-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CAMEL-21472: try assign new traceId for exchanges without parent span #16386
base: main
Are you sure you want to change the base?
CAMEL-21472: try assign new traceId for exchanges without parent span #16386
Conversation
🌟 Thank you for your contribution to the Apache Camel project! 🌟 🤖 CI automation will test this PR automatically. 🐫 Apache Camel Committers, please review the following items:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the proposed solution. Although this may fix the actual problem you've raised, I am not entirely sure it would be the correct way to handle. What we're doing with this is basically discarding any propagation mechanism that we should probably want instead. Also, see the comment on the Context.root()
:
/**
* Returns the root {@link Context} which all other {@link Context} are derived from.
*
* <p>It should generally not be required to use the root {@link Context} directly - instead, use
* {@link Context#current()} to operate on the current {@link Context}. Only use this method if
* you are absolutely sure you need to disregard the current {@link Context} - this almost always
* is only a workaround hiding an underlying context propagation issue.
*/
static Context root() {
return ContextStorage.get().root();
}
If you don't mind, let's keep this on hold while we continue the investigation and find out how to proceed in regards to context propagation that appears to be broken in general.
I have gone through a bit more troubleshooting about this. I've added some logging traces and I am not able to reproduce the problem on Spring Boot runtime. Starting a simple timer to log seems to create a new span at each exchange with actual 4.9.0-SNAPSHOT code:
What I have verified is that on Camel Main runtime we don't have any propagation at all, which is something that would require some attention. On Spring Boot, there is some injection that make this to work. How are you running the application? I wonder if the way it is executed is influencing the final behavior. |
I agree, absolutely. It feels like a workaround and there are far too many adapter-wrapper-propagation things here for me to really understand what's going on. In only made this PR because you asked me to do so you can better give feedback. I've seen this problem on release 4.8.1, I haven't tried with 4.9.0-SNAPSHOT, because I shied away from building the whole thing. Should I try with 4.9.0? I've been running my "test" with |
@thomas-gantenbein-tga I have the feeling that this has been solved already in 4.9.0-SNAPSHOT. Would you mind testing your reproducer against this version? At least, the simple Spring Boot example I run, does not reproduce the problem you've faced. |
BTW, the tests I'm doing are building the java application and run via |
Sure, will do. Currently building. Module 203 of about 500 something. I'll let you know when I've built and tested. |
@squakez, I tried with 4.9.0-SNAPSHOT. I wouldn't say it's fixed, it's just ... different. My reproducer goes as follows: from("netty-http:http://0.0.0.0:12345")
.routeId("netty")
.log("in netty route")
.to("direct:aSlowerRoute")
.to("direct:aFasterRoute");
from("timer:mytimer?period=5000&repeatCount=2&synchronous=true")
.routeId("timer")
.log("in timer route")
.to("direct:aSlowerRoute")
.to("direct:aFasterRoute");
from("direct:aSlowerRoute")
.log("in aSlowerRoute")
.delay(simple("${random(50, 350)}"))
.log("still in aSlowerRoute");
from("direct:aFasterRoute")
.log("in aFasterRoute")
.delay(simple("${random(10, 70)}"))
.log("still in aFasterRoute"); And here are the logs: camel-springboot 4.8.1Timer, fire 1
Timer, fire 2❌ traceId from fire 1
camel-springboot 4.9.0-SNAPSHOTTimer, fire 1❌ traceId and spanId not propagated to thread #3 and #4
Timer, fire 2✔ New traceId generated
curl, no traceparent header✔ New traceId generated
curl, with traceparent header 4bf92f3577b34da6a3ce929d0e0e4739✔ traceId taken from http header
curl, with traceparent header 4bf92f3577b34da6a3ce929d0e0e4740✔ traceId taken from http header
curl, no traceparent header✔ new traceId generated
|
Thanks @thomas-gantenbein-tga I will go through these scenarios and try to figure it out what's wrong. |
Thanks a lot @squakez. By the way:
Couldn't agree more 😄 I also tried this:
I didn't observe that this changed the behaviour, though. |
I did more troubleshooting but it is a complex thing to really understand. I think the changes in the PR are patching the problems you're facing by kind of resetting the trace at every exchange start. However it would not be fixing the root cause. I'm inclined to accept and merge this in order to get at least a quick patch in 4.9.0 and 4.8.2 which we should release soon, and in the while we need to continue the investigation and get a proper understanding of the propagation mechanism and how to finally fix it. @oscerd @davsclaus wdyt? |
I'd advocate to merge this. |
Would you mind letting me run my samples again with these changes on top of 4.9.0-SNAPSHOT? I had only tried on top of 4.8.1, and I wasn't paranoid enough to also check spanIds and try with several curl calls. Just want to be sure that it's actually better at least for the scenarios I've tested, not just different. |
Sure. We'll be waiting for your feedback anyway. |
I've cherry-picked this commit on top of ff5caa5 and ran my reproducer. The result: No, let's not merge this. Whatever was changed between ff5caa5 and release 4.8.1 fixed the issue of the same traceId being generated every time a consumer thread would start an exchange. Given this fix, my own "fix" on top of it even makes things slightly worse: Where I invoke the netty route without a traceparent header, thread #3 now gets a brand-new traceId, same for thread #6 in the last netty route:
|
Cherry-picking this commit on top of 4.8.1 gives me the logs below. Slightly better, by the looks of it, than 4.9.0-SNAPSHOT (ff5caa5), but still thread #3 in timer fire 2 gets traceId from timer fire 1, traceId c76b3ea6c032276dce25c4b9cbe5ae4e, spanId 477aa6e18490d3af figures two times on thread #5 ... so still not really working, not even as a quick workaround. I tried this to see how my "quick fix" compares to the state of ff5caa5. Now I see this, I don't think it's worth finding out what changed between 4.9.0-SNAPHSOT (ff5caa5) and 4.8.1 and apply this quick-"fix" instead of whatever fix was introduced so far. So I'll close this PR. Thanks for your comments and welcoming attitude, and sorry I couldn't help more.
|
Thanks a lot for informing us. I'll continue the investigation and hopefully we can have a fix within 4.10 release. |
@squakez, one more thing: When using the option
|
Thanks @thomas-gantenbein-tga . Yes, in that case we are creating a new span for each processor as well, so, likely the scope is not closed ahead of time. I still need to figure it out the entire propagation mechanism as it seems to be incomplete and broken somewhere in the |
I believe I understand why thread #3 -- the one that finishes the job after my "delay" -- doesn't get a traceId at all first and then traceIds get mixed up: Information about the current span, the span context and the context itself is stored in ThreadLocalStorage. This also explains why my "fix" makes the traceId appear already the first time on thread #3: I store the current span on the root context. So there is still no information about the current span the first time tread #3 does something in ThreadLocalStorage, but there is information in the root Context -- which may be stored statically somewhere, but I don't know for sure. And this, in turn, also explains why my "fix" still doesn't make everything work: It makes it work only the first time for a thread -- under certain conditions. If I am right, you'd have to find a way to pass the current spanContext and Context not only across routes (via an exchangeProperty and by invoking |
🌟 Thank you for your contribution to the Apache Camel project! 🌟 🤖 CI automation will test this PR automatically. 🐫 Apache Camel Committers, please review the following items:
|
Just in case you decide not to remove MDC logging altogether: I think this could work now, @squakez. At least for logs from the log EIP or log component. For logs from Java code ... hm, well, that would still be shaky. I tried with the same route as above, that is: from("netty-http:http://0.0.0.0:12345")
.routeId("netty")
.log("in netty route")
.to("direct:aSlowerRoute")
.to("direct:aFasterRoute");
from("timer:mytimer?period=5000&repeatCount=2&synchronous=true")
.routeId("timer")
.log("in timer route")
.to("direct:aSlowerRoute")
.to("direct:aFasterRoute");
from("direct:aSlowerRoute")
.log("in aSlowerRoute")
.delay(simple("${random(50, 350)}"))
.log("still in aSlowerRoute");
from("direct:aFasterRoute")
.log("in aFasterRoute")
.delay(simple("${random(10, 70)}"))
.log("still in aFasterRoute"); Waited for the timer to fire two times, then Logs:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's partially correct as you've mentioned it would not cover all scenarios. Also there is the problem of the context cleaning once the trace is complete.
@squakez, what do you think: Let's close this PR, since you are working on solving the root problem of this? And maybe decline CAMEL-21472 since what I reported turned out to be more of a symptom than a problem, and I guess your upcoming contribution has a much wider scope than that JIRA? |
Description
See https://issues.apache.org/jira/browse/CAMEL-21472