Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAMEL-21472: try assign new traceId for exchanges without parent span #16386

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

thomas-gantenbein-tga
Copy link
Contributor

Copy link
Contributor

🌟 Thank you for your contribution to the Apache Camel project! 🌟

🤖 CI automation will test this PR automatically.

🐫 Apache Camel Committers, please review the following items:

  • First-time contributors require MANUAL approval for the GitHub Actions to run

  • You can use the command /component-test (camel-)component-name1 (camel-)component-name2.. to request a test from the test bot.

  • You can label PRs using build-all, build-dependents, skip-tests and test-dependents to fine-tune the checks executed by this PR.

  • Build and test logs are available in the Summary page. Only Apache Camel committers have access to the summary.

  • ⚠️ Be careful when sharing logs. Review their contents before sharing them publicly.

Copy link
Contributor

@squakez squakez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the proposed solution. Although this may fix the actual problem you've raised, I am not entirely sure it would be the correct way to handle. What we're doing with this is basically discarding any propagation mechanism that we should probably want instead. Also, see the comment on the Context.root():

  /**
   * Returns the root {@link Context} which all other {@link Context} are derived from.
   *
   * <p>It should generally not be required to use the root {@link Context} directly - instead, use
   * {@link Context#current()} to operate on the current {@link Context}. Only use this method if
   * you are absolutely sure you need to disregard the current {@link Context} - this almost always
   * is only a workaround hiding an underlying context propagation issue.
   */
  static Context root() {
    return ContextStorage.get().root();
  }

If you don't mind, let's keep this on hold while we continue the investigation and find out how to proceed in regards to context propagation that appears to be broken in general.

@squakez
Copy link
Contributor

squakez commented Nov 27, 2024

I have gone through a bit more troubleshooting about this. I've added some logging traces and I am not able to reproduce the problem on Spring Boot runtime. Starting a simple timer to log seems to create a new span at each exchange with actual 4.9.0-SNAPSHOT code:

2024-11-27T13:22:20.859+01:00 DEBUG 446184 --- [ - timer://yaml] [69da4c773180c20d932deb3dbda91b50-a9710cbc4e5acb99] org.apache.camel.tracing.Tracer          : Tracing: start server span=OpenTelemetrySpanAdapter [baggage={}, span=PropagatedSpan{ImmutableSpanContext{traceId=69da4c773180c20d932deb3dbda91b50, spanId=a9710cbc4e5acb99, traceFlags=00, traceState=ArrayBasedTraceState{entries=[]}, remote=false, valid=true}}] with parent null
....
2024-11-27T13:22:50.841+01:00 DEBUG 446184 --- [ - timer://yaml] [af918b6f238227c77a401fbb6ed07dff-0306f19c52754b1b] org.apache.camel.tracing.Tracer          : Tracing: start server span=OpenTelemetrySpanAdapter [baggage={}, span=PropagatedSpan{ImmutableSpanContext{traceId=af918b6f238227c77a401fbb6ed07dff, spanId=0306f19c52754b1b, traceFlags=00, traceState=ArrayBasedTraceState{entries=[]}, remote=false, valid=true}}] with parent null

What I have verified is that on Camel Main runtime we don't have any propagation at all, which is something that would require some attention. On Spring Boot, there is some injection that make this to work. How are you running the application? I wonder if the way it is executed is influencing the final behavior.

@thomas-gantenbein-tga
Copy link
Contributor Author

I am not entirely sure it would be the correct way to handle.
If you don't mind, let's keep this on hold while we continue the investigation and find out how to proceed in regards to context propagation that appears to be broken in general.

I agree, absolutely. It feels like a workaround and there are far too many adapter-wrapper-propagation things here for me to really understand what's going on. In only made this PR because you asked me to do so you can better give feedback.

I've seen this problem on release 4.8.1, I haven't tried with 4.9.0-SNAPSHOT, because I shied away from building the whole thing. Should I try with 4.9.0?

I've been running my "test" with mvn spring:boot run. Details on https://github.com/thomas-gantenbein-tga/camel-opentelemetry/blob/main/pom.xml#L181-L189

@squakez
Copy link
Contributor

squakez commented Nov 27, 2024

@thomas-gantenbein-tga I have the feeling that this has been solved already in 4.9.0-SNAPSHOT. Would you mind testing your reproducer against this version? At least, the simple Spring Boot example I run, does not reproduce the problem you've faced.

@squakez
Copy link
Contributor

squakez commented Nov 27, 2024

BTW, the tests I'm doing are building the java application and run via java -jar myapp.jar.

@thomas-gantenbein-tga
Copy link
Contributor Author

@thomas-gantenbein-tga I have the feeling that this has been solved already in 4.9.0-SNAPSHOT. Would you mind testing your reproducer against this version? At least, the simple Spring Boot example I run, does not reproduce the problem you've faced.

Sure, will do. Currently building. Module 203 of about 500 something. I'll let you know when I've built and tested.

@thomas-gantenbein-tga
Copy link
Contributor Author

thomas-gantenbein-tga commented Nov 27, 2024

@squakez, I tried with 4.9.0-SNAPSHOT. I wouldn't say it's fixed, it's just ... different. My reproducer goes as follows:

 from("netty-http:http://0.0.0.0:12345")
            .routeId("netty")
            .log("in netty route")
            .to("direct:aSlowerRoute")
            .to("direct:aFasterRoute");

        from("timer:mytimer?period=5000&repeatCount=2&synchronous=true")
            .routeId("timer")
            .log("in timer route")
            .to("direct:aSlowerRoute")
            .to("direct:aFasterRoute");

        from("direct:aSlowerRoute")
            .log("in aSlowerRoute")
            .delay(simple("${random(50, 350)}"))
            .log("still in aSlowerRoute");

        from("direct:aFasterRoute")
            .log("in aFasterRoute")
            .delay(simple("${random(10, 70)}"))
            .log("still in aFasterRoute");

And here are the logs:

camel-springboot 4.8.1

Timer, fire 1

21:08:39.362 INFO  Camel (camel-1) thread #2 - timer://mytimer [eca8ede291f7aacb9515ff3dd74f2ab0,ac054cb2298eff68] timer - in timer route
21:08:39.372 INFO  Camel (camel-1) thread #2 - timer://mytimer [eca8ede291f7aacb9515ff3dd74f2ab0,f35a35101a900e77] route1 - in aSlowerRoute
21:08:39.598 INFO  Camel (camel-1) thread #3 - Delay [eca8ede291f7aacb9515ff3dd74f2ab0,916eba23ab488e92] route1 - still in aSlowerRoute
21:08:39.601 INFO  Camel (camel-1) thread #3 - Delay [eca8ede291f7aacb9515ff3dd74f2ab0,d3604be1184bfe9c] route2 - in aFasterRoute
21:08:39.657 INFO  Camel (camel-1) thread #4 - Delay [eca8ede291f7aacb9515ff3dd74f2ab0,0b938354295aa230] route2 - still in aFasterRoute

Timer, fire 2

❌ traceId from fire 1
✔ thread #3 gets spanId from fire 1

21:08:44.268 INFO  Camel (camel-1) thread #2 - timer://mytimer [eca8ede291f7aacb9515ff3dd74f2ab0,0ef83f6235ee1b8e] timer - in timer route
21:08:44.270 INFO  Camel (camel-1) thread #2 - timer://mytimer [eca8ede291f7aacb9515ff3dd74f2ab0,ddf8b1f376f4017b] route1 - in aSlowerRoute
21:08:44.456 INFO  Camel (camel-1) thread #3 - Delay [eca8ede291f7aacb9515ff3dd74f2ab0,d3604be1184bfe9c] route1 - still in aSlowerRoute
21:08:44.458 INFO  Camel (camel-1) thread #3 - Delay [eca8ede291f7aacb9515ff3dd74f2ab0,55e33e194dbad337] route2 - in aFasterRoute
21:08:44.472 INFO  Camel (camel-1) thread #4 - Delay [eca8ede291f7aacb9515ff3dd74f2ab0,ffc9bc53e60435b6] route2 - still in aFasterRoute

camel-springboot 4.9.0-SNAPSHOT

Timer, fire 1

❌ traceId and spanId not propagated to thread #3 and #4

20:51:23.689 INFO  Camel (camel-1) thread #2 - timer://mytimer [a92b2cd46ea975b5f2065cb98ca98b8a,2ddffed3056dba31] timer - in timer route
20:51:23.700 INFO  Camel (camel-1) thread #2 - timer://mytimer [a92b2cd46ea975b5f2065cb98ca98b8a,3cc7c4c0b12fffab] route1 - in aSlowerRoute
20:51:23.839 INFO  Camel (camel-1) thread #3 - Delay [,] route1 - still in aSlowerRoute
20:51:23.842 INFO  Camel (camel-1) thread #3 - Delay [a92b2cd46ea975b5f2065cb98ca98b8a,c5d45ed0f340e491] route2 - in aFasterRoute
20:51:23.876 INFO  Camel (camel-1) thread #4 - Delay [,] route2 - still in aFasterRoute

Timer, fire 2

✔ New traceId generated
❌ traceId and spanId not propagated to thread #4
❌ thread #3 gets traceId and spanId from fire 1 before getting correct traceId and new spanId
❌ thread #4 does not get a traceId

20:51:28.598 INFO  Camel (camel-1) thread #2 - timer://mytimer [4d8bdc9683b6968898cd479b8259170a,e247c14e0a6280af] timer - in timer route
20:51:28.600 INFO  Camel (camel-1) thread #2 - timer://mytimer [4d8bdc9683b6968898cd479b8259170a,0a160216be3f84e2] route1 - in aSlowerRoute
20:51:28.696 INFO  Camel (camel-1) thread #3 - Delay [a92b2cd46ea975b5f2065cb98ca98b8a,c5d45ed0f340e491] route1 - still in aSlowerRoute
20:51:28.698 INFO  Camel (camel-1) thread #3 - Delay [4d8bdc9683b6968898cd479b8259170a,6021d78f652666aa] route2 - in aFasterRoute
20:51:28.726 INFO  Camel (camel-1) thread #4 - Delay [,] route2 - still in aFasterRoute

curl, no traceparent header

✔ New traceId generated
✔ New traceId propagated to thread #4 and #8
❌ two times same spanId 18fa8a062cf5ce69

20:51:58.911 INFO  Camel (camel-1) thread #8 - NettyConsumerExecutorGroup [f42e6647e9cac76b870f4b91dd63869e,591e5965005b199f] netty - in netty route
20:51:58.913 INFO  Camel (camel-1) thread #8 - NettyConsumerExecutorGroup [f42e6647e9cac76b870f4b91dd63869e,cf69a2f758a9c90d] route1 - in aSlowerRoute
20:51:59.077 INFO  Camel (camel-1) thread #5 - Delay [f42e6647e9cac76b870f4b91dd63869e,18fa8a062cf5ce69] route1 - still in aSlowerRoute
20:51:59.079 INFO  Camel (camel-1) thread #5 - Delay [f42e6647e9cac76b870f4b91dd63869e,8d3c24f005b6e313] route2 - in aFasterRoute
20:51:59.100 INFO  Camel (camel-1) thread #6 - Delay [f42e6647e9cac76b870f4b91dd63869e,18fa8a062cf5ce69] route2 - still in aFasterRoute

curl, with traceparent header 4bf92f3577b34da6a3ce929d0e0e4739

✔ traceId taken from http header
✔ traceId propagated to thread #3 and #8
❌ thread #3 gets traceId and spanId from timer, fire 2

20:52:17.908 INFO  Camel (camel-1) thread #12 - NettyConsumerExecutorGroup [4bf92f3577b34da6a3ce929d0e0e4739,f30adaa693cc553c] netty - in netty route
20:52:17.909 INFO  Camel (camel-1) thread #12 - NettyConsumerExecutorGroup [4bf92f3577b34da6a3ce929d0e0e4739,8bb8773c982c307c] route1 - in aSlowerRoute
20:52:18.205 INFO  Camel (camel-1) thread #3 - Delay [4d8bdc9683b6968898cd479b8259170a,6021d78f652666aa] route1 - still in aSlowerRoute
20:52:18.207 INFO  Camel (camel-1) thread #3 - Delay [4bf92f3577b34da6a3ce929d0e0e4739,b1885430d753d182] route2 - in aFasterRoute
20:52:18.255 INFO  Camel (camel-1) thread #4 - Delay [4bf92f3577b34da6a3ce929d0e0e4739,3afe751f39b1a76f] route2 - still in aFasterRoute

curl, with traceparent header 4bf92f3577b34da6a3ce929d0e0e4740

✔ traceId taken from http header
❌ same spanId d916452083efcd5f

20:52:33.461 INFO  Camel (camel-1) thread #16 - NettyConsumerExecutorGroup [4bf92f3577b34da6a3ce929d0e0e4740,a15d1d4b16c0dcb0] netty - in netty route
20:52:33.463 INFO  Camel (camel-1) thread #16 - NettyConsumerExecutorGroup [4bf92f3577b34da6a3ce929d0e0e4740,19edc06dcd19a524] route1 - in aSlowerRoute
20:52:33.744 INFO  Camel (camel-1) thread #9 - Delay [4bf92f3577b34da6a3ce929d0e0e4740,d916452083efcd5f] route1 - still in aSlowerRoute
20:52:33.745 INFO  Camel (camel-1) thread #9 - Delay [4bf92f3577b34da6a3ce929d0e0e4740,71800a5a37af3c69] route2 - in aFasterRoute
20:52:33.779 INFO  Camel (camel-1) thread #10 - Delay [4bf92f3577b34da6a3ce929d0e0e4740,d916452083efcd5f] route2 - still in aFasterRoute

curl, no traceparent header

✔ new traceId generated
❌ trace #5 gets traceId and spanId from curl, first call

20:52:53.253 INFO  Camel (camel-1) thread #20 - NettyConsumerExecutorGroup [8ea524d7af24a239376e3f9a2058dc34,5b967ade2153d3ea] netty - in netty route
20:52:53.254 INFO  Camel (camel-1) thread #20 - NettyConsumerExecutorGroup [8ea524d7af24a239376e3f9a2058dc34,3940e6071069658c] route1 - in aSlowerRoute
20:52:53.455 INFO  Camel (camel-1) thread #5 - Delay [f42e6647e9cac76b870f4b91dd63869e,8d3c24f005b6e313] route1 - still in aSlowerRoute
20:52:53.457 INFO  Camel (camel-1) thread #5 - Delay [8ea524d7af24a239376e3f9a2058dc34,5b7bbb88fc47ced8] route2 - in aFasterRoute
20:52:53.475 INFO  Camel (camel-1) thread #6 - Delay [8ea524d7af24a239376e3f9a2058dc34,c63588f5683b94e2] route2 - still in aFasterRoute

@squakez
Copy link
Contributor

squakez commented Nov 28, 2024

Thanks @thomas-gantenbein-tga I will go through these scenarios and try to figure it out what's wrong.

@thomas-gantenbein-tga
Copy link
Contributor Author

Thanks a lot @squakez.

By the way:

Context propagation is not trivial, and when done incorrectly can lead to broken traces or even mixed traces.
(from https://www.javadoc.io/doc/io.opentelemetry/opentelemetry-context/latest/io/opentelemetry/context/Context.html)

Couldn't agree more 😄

I also tried this:

We provide a debug mechanism for context propagation, which can be enabled by setting -Dio.opentelemetry.context.enableStrictContext=true in your JVM args. This will enable a strict checker that makes sure that Scopes are closed on the correct thread and that they are not garbage collected before being closed.

I didn't observe that this changed the behaviour, though.

@squakez
Copy link
Contributor

squakez commented Nov 28, 2024

I did more troubleshooting but it is a complex thing to really understand. I think the changes in the PR are patching the problems you're facing by kind of resetting the trace at every exchange start. However it would not be fixing the root cause. I'm inclined to accept and merge this in order to get at least a quick patch in 4.9.0 and 4.8.2 which we should release soon, and in the while we need to continue the investigation and get a proper understanding of the propagation mechanism and how to finally fix it.

@oscerd @davsclaus wdyt?

@oscerd
Copy link
Contributor

oscerd commented Nov 28, 2024

I'd advocate to merge this.

@thomas-gantenbein-tga
Copy link
Contributor Author

Would you mind letting me run my samples again with these changes on top of 4.9.0-SNAPSHOT? I had only tried on top of 4.8.1, and I wasn't paranoid enough to also check spanIds and try with several curl calls. Just want to be sure that it's actually better at least for the scenarios I've tested, not just different.

@squakez
Copy link
Contributor

squakez commented Nov 28, 2024

Would you mind letting me run my samples again with these changes on top of 4.9.0-SNAPSHOT? I had only tried on top of 4.8.1, and I wasn't paranoid enough to also check spanIds and try with several curl calls. Just want to be sure that it's actually better at least for the scenarios I've tested, not just different.

Sure. We'll be waiting for your feedback anyway.

@thomas-gantenbein-tga
Copy link
Contributor Author

I've cherry-picked this commit on top of ff5caa5 and ran my reproducer. The result: No, let's not merge this.

Whatever was changed between ff5caa5 and release 4.8.1 fixed the issue of the same traceId being generated every time a consumer thread would start an exchange. Given this fix, my own "fix" on top of it even makes things slightly worse: Where I invoke the netty route without a traceparent header, thread #3 now gets a brand-new traceId, same for thread #6 in the last netty route:

21:08:33.366 INFO  Camel (camel-1) thread #2 - timer://mytimer [3e79d97fa497733fd51565e48c4bc79a,5caae76825597471] timer - in timer route
21:08:33.377 INFO  Camel (camel-1) thread #2 - timer://mytimer [3e79d97fa497733fd51565e48c4bc79a,0e35b3bee30a48d6] route1 - in aSlowerRoute
21:08:33.468 INFO  Camel (camel-1) thread #3 - Delay [,] route1 - still in aSlowerRoute
21:08:33.471 INFO  Camel (camel-1) thread #3 - Delay [3e79d97fa497733fd51565e48c4bc79a,95e1d7d8028296e0] route2 - in aFasterRoute
21:08:33.498 INFO  Camel (camel-1) thread #4 - Delay [,] route2 - still in aFasterRoute

21:08:38.255 INFO  Camel (camel-1) thread #2 - timer://mytimer [8a6688d4bff62bf0d1444ff0d9aaa210,d8457081e524cdcc] timer - in timer route
21:08:38.256 INFO  Camel (camel-1) thread #2 - timer://mytimer [8a6688d4bff62bf0d1444ff0d9aaa210,9d32d658a3986dfc] route1 - in aSlowerRoute
21:08:38.338 INFO  Camel (camel-1) thread #3 - Delay [3e79d97fa497733fd51565e48c4bc79a,95e1d7d8028296e0] route1 - still in aSlowerRoute
21:08:38.340 INFO  Camel (camel-1) thread #3 - Delay [8a6688d4bff62bf0d1444ff0d9aaa210,b77cf7ed75eb0941] route2 - in aFasterRoute
21:08:38.385 INFO  Camel (camel-1) thread #4 - Delay [,] route2 - still in aFasterRoute

21:08:57.929 INFO  Camel (camel-1) thread #8 - NettyConsumerExecutorGroup [35e0dec31ec26df1204f91caa8f95091,c0a1af17056cb737] netty - in netty route
21:08:57.930 INFO  Camel (camel-1) thread #8 - NettyConsumerExecutorGroup [35e0dec31ec26df1204f91caa8f95091,e5cb71ad5987aa99] route1 - in aSlowerRoute
21:08:58.062 INFO  Camel (camel-1) thread #5 - Delay [a794c504464d2e41a73370554b431788,d4afe0b75c1c773b] route1 - still in aSlowerRoute
21:08:58.064 INFO  Camel (camel-1) thread #5 - Delay [35e0dec31ec26df1204f91caa8f95091,3cae953a6f3821e4] route2 - in aFasterRoute
21:08:58.097 INFO  Camel (camel-1) thread #6 - Delay [a794c504464d2e41a73370554b431788,d4afe0b75c1c773b] route2 - still in aFasterRoute

21:08:58.146 INFO  Camel (camel-1) thread #12 - NettyConsumerExecutorGroup [4bf92f3577b34da6a3ce929d0e0e4739,9bcb9c18fe6b3547] netty - in netty route
21:08:58.148 INFO  Camel (camel-1) thread #12 - NettyConsumerExecutorGroup [4bf92f3577b34da6a3ce929d0e0e4739,6504a15fc75cc4f7] route1 - in aSlowerRoute
21:08:58.482 INFO  Camel (camel-1) thread #3 - Delay [8a6688d4bff62bf0d1444ff0d9aaa210,b77cf7ed75eb0941] route1 - still in aSlowerRoute
21:08:58.483 INFO  Camel (camel-1) thread #3 - Delay [4bf92f3577b34da6a3ce929d0e0e4739,711f994843db5d13] route2 - in aFasterRoute
21:08:58.500 INFO  Camel (camel-1) thread #4 - Delay [4bf92f3577b34da6a3ce929d0e0e4739,1219ea22da276463] route2 - still in aFasterRoute

21:08:58.519 INFO  Camel (camel-1) thread #16 - NettyConsumerExecutorGroup [4bf92f3577b34da6a3ce929d0e0e4740,93029888ed44c3e3] netty - in netty route
21:08:58.520 INFO  Camel (camel-1) thread #16 - NettyConsumerExecutorGroup [4bf92f3577b34da6a3ce929d0e0e4740,ee5d491873836fc4] route1 - in aSlowerRoute
21:08:58.719 INFO  Camel (camel-1) thread #9 - Delay [4bf92f3577b34da6a3ce929d0e0e4740,ca1e85b164b97e35] route1 - still in aSlowerRoute
21:08:58.721 INFO  Camel (camel-1) thread #9 - Delay [4bf92f3577b34da6a3ce929d0e0e4740,3a6a7afc4d2f88e7] route2 - in aFasterRoute
21:08:58.755 INFO  Camel (camel-1) thread #10 - Delay [4bf92f3577b34da6a3ce929d0e0e4740,ca1e85b164b97e35] route2 - still in aFasterRoute

21:08:58.774 INFO  Camel (camel-1) thread #20 - NettyConsumerExecutorGroup [8ab99ef0b2fbce66d11f747a1daf8f0c,f2baf5fab0ec1e7e] netty - in netty route
21:08:58.775 INFO  Camel (camel-1) thread #20 - NettyConsumerExecutorGroup [8ab99ef0b2fbce66d11f747a1daf8f0c,e5e2fa3ad6919297] route1 - in aSlowerRoute
21:08:59.031 INFO  Camel (camel-1) thread #5 - Delay [35e0dec31ec26df1204f91caa8f95091,3cae953a6f3821e4] route1 - still in aSlowerRoute
21:08:59.032 INFO  Camel (camel-1) thread #5 - Delay [8ab99ef0b2fbce66d11f747a1daf8f0c,0e548f6037c7f4d8] route2 - in aFasterRoute
21:08:59.088 INFO  Camel (camel-1) thread #6 - Delay [adb9fa707818304e40ec044a8a17bf93,037fa960a43b9ab2] route2 - still in aFasterRoute

@thomas-gantenbein-tga
Copy link
Contributor Author

Cherry-picking this commit on top of 4.8.1 gives me the logs below. Slightly better, by the looks of it, than 4.9.0-SNAPSHOT (ff5caa5), but still thread #3 in timer fire 2 gets traceId from timer fire 1, traceId c76b3ea6c032276dce25c4b9cbe5ae4e, spanId 477aa6e18490d3af figures two times on thread #5 ... so still not really working, not even as a quick workaround.

I tried this to see how my "quick fix" compares to the state of ff5caa5. Now I see this, I don't think it's worth finding out what changed between 4.9.0-SNAPHSOT (ff5caa5) and 4.8.1 and apply this quick-"fix" instead of whatever fix was introduced so far.

So I'll close this PR. Thanks for your comments and welcoming attitude, and sorry I couldn't help more.

21:33:28.772 INFO  Camel (camel-1) thread #2 - timer://mytimer [5317aa821834beaf211569b8fb5b817e,6ef9c33d89c30294] timer - in timer route
21:33:28.783 INFO  Camel (camel-1) thread #2 - timer://mytimer [5317aa821834beaf211569b8fb5b817e,0e99673bbeff9725] route1 - in aSlowerRoute
21:33:28.858 INFO  Camel (camel-1) thread #3 - Delay [5317aa821834beaf211569b8fb5b817e,25bc89ad835a23a6] route1 - still in aSlowerRoute
21:33:28.861 INFO  Camel (camel-1) thread #3 - Delay [5317aa821834beaf211569b8fb5b817e,b7f8defe3417006a] route2 - in aFasterRoute
21:33:28.931 INFO  Camel (camel-1) thread #4 - Delay [5317aa821834beaf211569b8fb5b817e,f845fc43adc2ca68] route2 - still in aFasterRoute

21:33:33.670 INFO  Camel (camel-1) thread #2 - timer://mytimer [78181387cecceb3da5915017c73738a7,a6ee4a4a472d6d4d] timer - in timer route
21:33:33.672 INFO  Camel (camel-1) thread #2 - timer://mytimer [78181387cecceb3da5915017c73738a7,9555f8aae3c71a91] route1 - in aSlowerRoute
21:33:33.736 INFO  Camel (camel-1) thread #3 - Delay [5317aa821834beaf211569b8fb5b817e,b7f8defe3417006a] route1 - still in aSlowerRoute
21:33:33.737 INFO  Camel (camel-1) thread #3 - Delay [78181387cecceb3da5915017c73738a7,8c095a8c44fd91cd] route2 - in aFasterRoute
21:33:33.752 INFO  Camel (camel-1) thread #4 - Delay [78181387cecceb3da5915017c73738a7,a1eb1b989c93a073] route2 - still in aFasterRoute

21:33:39.517 INFO  Camel (camel-1) thread #8 - NettyConsumerExecutorGroup [c76b3ea6c032276dce25c4b9cbe5ae4e,ad91c1c0287852de] netty - in netty route
21:33:39.518 INFO  Camel (camel-1) thread #8 - NettyConsumerExecutorGroup [c76b3ea6c032276dce25c4b9cbe5ae4e,a65fa72a4a4f2b3e] route1 - in aSlowerRoute
21:33:39.604 INFO  Camel (camel-1) thread #5 - Delay [c76b3ea6c032276dce25c4b9cbe5ae4e,0c408b291b1ac62e] route1 - still in aSlowerRoute
21:33:39.606 INFO  Camel (camel-1) thread #5 - Delay [c76b3ea6c032276dce25c4b9cbe5ae4e,477aa6e18490d3af] route2 - in aFasterRoute
21:33:39.621 INFO  Camel (camel-1) thread #6 - Delay [c76b3ea6c032276dce25c4b9cbe5ae4e,90434ec70f7877f7] route2 - still in aFasterRoute

21:33:39.687 INFO  Camel (camel-1) thread #12 - NettyConsumerExecutorGroup [4bf92f3577b34da6a3ce929d0e0e4739,2d577a9eb50864dc] netty - in netty route
21:33:39.688 INFO  Camel (camel-1) thread #12 - NettyConsumerExecutorGroup [4bf92f3577b34da6a3ce929d0e0e4739,ef865d37dbc382c7] route1 - in aSlowerRoute
21:33:39.808 INFO  Camel (camel-1) thread #3 - Delay [78181387cecceb3da5915017c73738a7,8c095a8c44fd91cd] route1 - still in aSlowerRoute
21:33:39.810 INFO  Camel (camel-1) thread #3 - Delay [4bf92f3577b34da6a3ce929d0e0e4739,f24e260e1a00d437] route2 - in aFasterRoute
21:33:39.871 INFO  Camel (camel-1) thread #4 - Delay [4bf92f3577b34da6a3ce929d0e0e4739,467e9f696896dbc7] route2 - still in aFasterRoute

21:33:39.905 INFO  Camel (camel-1) thread #16 - NettyConsumerExecutorGroup [4bf92f3577b34da6a3ce929d0e0e4740,bde97a94210de433] netty - in netty route
21:33:39.906 INFO  Camel (camel-1) thread #16 - NettyConsumerExecutorGroup [4bf92f3577b34da6a3ce929d0e0e4740,5aec4314c18bfdc8] route1 - in aSlowerRoute
21:33:39.972 INFO  Camel (camel-1) thread #9 - Delay [4bf92f3577b34da6a3ce929d0e0e4740,1eed86ac5e7ce989] route1 - still in aSlowerRoute
21:33:39.974 INFO  Camel (camel-1) thread #9 - Delay [4bf92f3577b34da6a3ce929d0e0e4740,f768efe1f4bd8cff] route2 - in aFasterRoute
21:33:39.991 INFO  Camel (camel-1) thread #10 - Delay [4bf92f3577b34da6a3ce929d0e0e4740,b287662cf3806cdd] route2 - still in aFasterRoute

21:33:40.013 INFO  Camel (camel-1) thread #20 - NettyConsumerExecutorGroup [aaee8bcea728cf9ef2543c58509a70b6,9d144769d5605db8] netty - in netty route
21:33:40.014 INFO  Camel (camel-1) thread #20 - NettyConsumerExecutorGroup [aaee8bcea728cf9ef2543c58509a70b6,2a156c01c5a0857e] route1 - in aSlowerRoute
21:33:40.354 INFO  Camel (camel-1) thread #5 - Delay [c76b3ea6c032276dce25c4b9cbe5ae4e,477aa6e18490d3af] route1 - still in aSlowerRoute
21:33:40.356 INFO  Camel (camel-1) thread #5 - Delay [aaee8bcea728cf9ef2543c58509a70b6,29c3943b16347639] route2 - in aFasterRoute
21:33:40.414 INFO  Camel (camel-1) thread #6 - Delay [aaee8bcea728cf9ef2543c58509a70b6,7590460d39ed1fdd] route2 - still in aFasterRoute

@squakez
Copy link
Contributor

squakez commented Nov 29, 2024

Thanks a lot for informing us. I'll continue the investigation and hopefully we can have a fix within 4.10 release.

@thomas-gantenbein-tga
Copy link
Contributor Author

@squakez, one more thing: When using the option camel.opentelemetry.traceProcessors = true, the whole thing works better (logs below). Though at startup, I get warnings like this: 08:14:45.382 WARN main [,] o.a.camel.impl.engine.DefaultChannel - Interceptor: org.apache.camel.opentelemetry.OpenTelemetryTracingStrategy@339cedbb at: To[direct:aFasterRoute] does not return an AsyncProcessor instance. This causes the asynchronous routing engine to not work as optimal as possible. See more details at the InterceptStrategy javadoc. Camel will use a bridge to adapt the interceptor to the asynchronous routing engine, but its not the most optimal solution. Please consider changing your interceptor to comply.
`

08:14:47.336 INFO  Camel (camel-1) thread #2 - timer://mytimer [d9dc304941ef6510bc5565b8f254a850,3b8c99e69e89ac46] timer - in timer route
08:14:47.347 INFO  Camel (camel-1) thread #2 - timer://mytimer [d9dc304941ef6510bc5565b8f254a850,3e83038ad51081c4] route1 - in aSlowerRoute
08:14:47.436 INFO  Camel (camel-1) thread #2 - timer://mytimer [d9dc304941ef6510bc5565b8f254a850,483ec28803781cbb] route1 - still in aSlowerRoute
08:14:47.440 INFO  Camel (camel-1) thread #2 - timer://mytimer [d9dc304941ef6510bc5565b8f254a850,f1f8a8ec5bb5194d] route2 - in aFasterRoute
08:14:47.469 INFO  Camel (camel-1) thread #2 - timer://mytimer [d9dc304941ef6510bc5565b8f254a850,5dbcb616875503a7] route2 - still in aFasterRoute

08:14:52.228 INFO  Camel (camel-1) thread #2 - timer://mytimer [fc320b7df9f7c59f2c98745ab760a457,68bbbdc5c2b5b3b2] timer - in timer route
08:14:52.229 INFO  Camel (camel-1) thread #2 - timer://mytimer [fc320b7df9f7c59f2c98745ab760a457,2b54666cbe277553] route1 - in aSlowerRoute
08:14:52.404 INFO  Camel (camel-1) thread #2 - timer://mytimer [fc320b7df9f7c59f2c98745ab760a457,9c09f771b3986b17] route1 - still in aSlowerRoute
08:14:52.406 INFO  Camel (camel-1) thread #2 - timer://mytimer [fc320b7df9f7c59f2c98745ab760a457,3017c9411447ac44] route2 - in aFasterRoute
08:14:52.457 INFO  Camel (camel-1) thread #2 - timer://mytimer [fc320b7df9f7c59f2c98745ab760a457,bafe17471b587c8f] route2 - still in aFasterRoute

08:15:05.296 INFO  Camel (camel-1) thread #8 - NettyConsumerExecutorGroup [89d9acf8c4f890ac81fe2aab0bce2002,a16898568c48c78d] netty - in netty route
08:15:05.297 INFO  Camel (camel-1) thread #8 - NettyConsumerExecutorGroup [89d9acf8c4f890ac81fe2aab0bce2002,84ad1887adbc74db] route1 - in aSlowerRoute
08:15:05.524 INFO  Camel (camel-1) thread #8 - NettyConsumerExecutorGroup [89d9acf8c4f890ac81fe2aab0bce2002,e6d220ca958a4f9a] route1 - still in aSlowerRoute
08:15:05.526 INFO  Camel (camel-1) thread #8 - NettyConsumerExecutorGroup [89d9acf8c4f890ac81fe2aab0bce2002,ea825b6be46cfe15] route2 - in aFasterRoute
08:15:05.555 INFO  Camel (camel-1) thread #8 - NettyConsumerExecutorGroup [89d9acf8c4f890ac81fe2aab0bce2002,2d9057f519b5e2db] route2 - still in aFasterRoute

08:15:05.614 INFO  Camel (camel-1) thread #12 - NettyConsumerExecutorGroup [4bf92f3577b34da6a3ce929d0e0e4739,adabb6a53e513883] netty - in netty route
08:15:05.615 INFO  Camel (camel-1) thread #12 - NettyConsumerExecutorGroup [4bf92f3577b34da6a3ce929d0e0e4739,ae6c71463b133f2d] route1 - in aSlowerRoute
08:15:05.687 INFO  Camel (camel-1) thread #12 - NettyConsumerExecutorGroup [4bf92f3577b34da6a3ce929d0e0e4739,fd1545f6e746681d] route1 - still in aSlowerRoute
08:15:05.689 INFO  Camel (camel-1) thread #12 - NettyConsumerExecutorGroup [4bf92f3577b34da6a3ce929d0e0e4739,2861a768e255f9ba] route2 - in aFasterRoute
08:15:05.710 INFO  Camel (camel-1) thread #12 - NettyConsumerExecutorGroup [4bf92f3577b34da6a3ce929d0e0e4739,b6ea92a8423be3e5] route2 - still in aFasterRoute

08:15:05.741 INFO  Camel (camel-1) thread #16 - NettyConsumerExecutorGroup [4bf92f3577b34da6a3ce929d0e0e4740,d13dd0d0403eabbe] netty - in netty route
08:15:05.742 INFO  Camel (camel-1) thread #16 - NettyConsumerExecutorGroup [4bf92f3577b34da6a3ce929d0e0e4740,9b2568376d3cd991] route1 - in aSlowerRoute
08:15:05.952 INFO  Camel (camel-1) thread #16 - NettyConsumerExecutorGroup [4bf92f3577b34da6a3ce929d0e0e4740,6824ac16b7c78fe1] route1 - still in aSlowerRoute
08:15:05.953 INFO  Camel (camel-1) thread #16 - NettyConsumerExecutorGroup [4bf92f3577b34da6a3ce929d0e0e4740,a91fa67528c090e0] route2 - in aFasterRoute
08:15:05.979 INFO  Camel (camel-1) thread #16 - NettyConsumerExecutorGroup [4bf92f3577b34da6a3ce929d0e0e4740,e8b3647543f4c218] route2 - still in aFasterRoute

08:15:06.012 INFO  Camel (camel-1) thread #20 - NettyConsumerExecutorGroup [e8ef703f008340510601a72729d400fc,6171ea4d64c88c7e] netty - in netty route
08:15:06.013 INFO  Camel (camel-1) thread #20 - NettyConsumerExecutorGroup [e8ef703f008340510601a72729d400fc,9ca1ae64fe0435da] route1 - in aSlowerRoute
08:15:06.210 INFO  Camel (camel-1) thread #20 - NettyConsumerExecutorGroup [e8ef703f008340510601a72729d400fc,5669e77a4bf682ec] route1 - still in aSlowerRoute
08:15:06.213 INFO  Camel (camel-1) thread #20 - NettyConsumerExecutorGroup [e8ef703f008340510601a72729d400fc,6c5d8b45799e1c33] route2 - in aFasterRoute
08:15:06.236 INFO  Camel (camel-1) thread #20 - NettyConsumerExecutorGroup [e8ef703f008340510601a72729d400fc,0dfd964a17af815b] route2 - still in aFasterRoute

@squakez
Copy link
Contributor

squakez commented Nov 29, 2024

Thanks @thomas-gantenbein-tga . Yes, in that case we are creating a new span for each processor as well, so, likely the scope is not closed ahead of time. I still need to figure it out the entire propagation mechanism as it seems to be incomplete and broken somewhere in the camel-tracing component.

@thomas-gantenbein-tga
Copy link
Contributor Author

I believe I understand why thread #3 -- the one that finishes the job after my "delay" -- doesn't get a traceId at all first and then traceIds get mixed up: Information about the current span, the span context and the context itself is stored in ThreadLocalStorage. OpenTelemetryTracer#startExchangeBeginSpan doesn't get the chance to be executed for thread #3 when the route reaches the delay EIP, so there's no information about a current span in ThreadLocalStorage at first. Later on, when thread #3 continues to execute the second direct route, OpenTelemetryTracer#startExchangeBeginSpan gets executed, which gets the parent span with the correct traceId.

This also explains why my "fix" makes the traceId appear already the first time on thread #3: I store the current span on the root context. So there is still no information about the current span the first time tread #3 does something in ThreadLocalStorage, but there is information in the root Context -- which may be stored statically somewhere, but I don't know for sure.

And this, in turn, also explains why my "fix" still doesn't make everything work: It makes it work only the first time for a thread -- under certain conditions.

If I am right, you'd have to find a way to pass the current spanContext and Context not only across routes (via an exchangeProperty and by invoking OpenTelemetryTracer#startExchangeBeginSpan on exchangeBegin, but also across EIPs within a route.

Copy link
Contributor

🌟 Thank you for your contribution to the Apache Camel project! 🌟

🤖 CI automation will test this PR automatically.

🐫 Apache Camel Committers, please review the following items:

  • First-time contributors require MANUAL approval for the GitHub Actions to run

  • You can use the command /component-test (camel-)component-name1 (camel-)component-name2.. to request a test from the test bot.

  • You can label PRs using build-all, build-dependents, skip-tests and test-dependents to fine-tune the checks executed by this PR.

  • Build and test logs are available in the Summary page. Only Apache Camel committers have access to the summary.

  • ⚠️ Be careful when sharing logs. Review their contents before sharing them publicly.

@thomas-gantenbein-tga
Copy link
Contributor Author

Just in case you decide not to remove MDC logging altogether: I think this could work now, @squakez. At least for logs from the log EIP or log component. For logs from Java code ... hm, well, that would still be shaky.

I tried with the same route as above, that is:

        from("netty-http:http://0.0.0.0:12345")
            .routeId("netty")
            .log("in netty route")
            .to("direct:aSlowerRoute")
            .to("direct:aFasterRoute");

        from("timer:mytimer?period=5000&repeatCount=2&synchronous=true")
            .routeId("timer")
            .log("in timer route")
            .to("direct:aSlowerRoute")
            .to("direct:aFasterRoute");

        from("direct:aSlowerRoute")
            .log("in aSlowerRoute")
            .delay(simple("${random(50, 350)}"))
            .log("still in aSlowerRoute");

        from("direct:aFasterRoute")
            .log("in aFasterRoute")
            .delay(simple("${random(10, 70)}"))
            .log("still in aFasterRoute");

Waited for the timer to fire two times, then curl localhost:12345 && curl -H "traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4739-00f067aa0ba902b7-01" localhost:12345 && curl -H "traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4740-00f067aa0ba902b7-01" localhost:12345 && curl localhost:12345

Logs:

20:12:44.994 INFO  Camel (camel-1) thread #2 - timer://mytimer [d7d91e81efc3acf4e0730bee88601e7c,6e1c0a64a2feec44] timer - in timer route
20:12:45.015 INFO  Camel (camel-1) thread #2 - timer://mytimer [d7d91e81efc3acf4e0730bee88601e7c,1be08f81939abccc] route1 - in aSlowerRoute
20:12:45.335 INFO  Camel (camel-1) thread #3 - Delay [d7d91e81efc3acf4e0730bee88601e7c,1be08f81939abccc] route1 - still in aSlowerRoute
20:12:45.339 INFO  Camel (camel-1) thread #3 - Delay [d7d91e81efc3acf4e0730bee88601e7c,46d1457be03f7ada] route2 - in aFasterRoute
20:12:45.397 INFO  Camel (camel-1) thread #4 - Delay [d7d91e81efc3acf4e0730bee88601e7c,46d1457be03f7ada] route2 - still in aFasterRoute

20:12:49.868 INFO  Camel (camel-1) thread #2 - timer://mytimer [ef249c428653959d02560131d975b206,9141be8a7b20d5a5] timer - in timer route
20:12:49.869 INFO  Camel (camel-1) thread #2 - timer://mytimer [ef249c428653959d02560131d975b206,0fae7d3d4de2631c] route1 - in aSlowerRoute
20:12:49.983 INFO  Camel (camel-1) thread #3 - Delay [ef249c428653959d02560131d975b206,0fae7d3d4de2631c] route1 - still in aSlowerRoute
20:12:49.984 INFO  Camel (camel-1) thread #3 - Delay [ef249c428653959d02560131d975b206,72e273fc1ff50c5a] route2 - in aFasterRoute
20:12:50.028 INFO  Camel (camel-1) thread #4 - Delay [ef249c428653959d02560131d975b206,72e273fc1ff50c5a] route2 - still in aFasterRoute

20:13:17.016 INFO  Camel (camel-1) thread #8 - NettyConsumerExecutorGroup [86aeeb0e2cb978b7b376430045338db5,283b751c9e8bd65e] netty - in netty route
20:13:17.017 INFO  Camel (camel-1) thread #8 - NettyConsumerExecutorGroup [86aeeb0e2cb978b7b376430045338db5,2747fb1cca26d6dd] route1 - in aSlowerRoute
20:13:17.079 INFO  Camel (camel-1) thread #5 - Delay [86aeeb0e2cb978b7b376430045338db5,2747fb1cca26d6dd] route1 - still in aSlowerRoute
20:13:17.081 INFO  Camel (camel-1) thread #5 - Delay [86aeeb0e2cb978b7b376430045338db5,639427d08d593827] route2 - in aFasterRoute
20:13:17.123 INFO  Camel (camel-1) thread #6 - Delay [86aeeb0e2cb978b7b376430045338db5,639427d08d593827] route2 - still in aFasterRoute

20:13:17.173 INFO  Camel (camel-1) thread #12 - NettyConsumerExecutorGroup [4bf92f3577b34da6a3ce929d0e0e4739,df985068b674f995] netty - in netty route
20:13:17.174 INFO  Camel (camel-1) thread #12 - NettyConsumerExecutorGroup [4bf92f3577b34da6a3ce929d0e0e4739,e06755720e1692de] route1 - in aSlowerRoute
20:13:17.351 INFO  Camel (camel-1) thread #3 - Delay [4bf92f3577b34da6a3ce929d0e0e4739,e06755720e1692de] route1 - still in aSlowerRoute
20:13:17.353 INFO  Camel (camel-1) thread #3 - Delay [4bf92f3577b34da6a3ce929d0e0e4739,08f9b17fe5fbee60] route2 - in aFasterRoute
20:13:17.408 INFO  Camel (camel-1) thread #4 - Delay [4bf92f3577b34da6a3ce929d0e0e4739,08f9b17fe5fbee60] route2 - still in aFasterRoute

20:13:17.429 INFO  Camel (camel-1) thread #16 - NettyConsumerExecutorGroup [4bf92f3577b34da6a3ce929d0e0e4740,5c9ba31a6a70e867] netty - in netty route
20:13:17.430 INFO  Camel (camel-1) thread #16 - NettyConsumerExecutorGroup [4bf92f3577b34da6a3ce929d0e0e4740,7eed5ecec22b3281] route1 - in aSlowerRoute
20:13:17.689 INFO  Camel (camel-1) thread #9 - Delay [4bf92f3577b34da6a3ce929d0e0e4740,7eed5ecec22b3281] route1 - still in aSlowerRoute
20:13:17.690 INFO  Camel (camel-1) thread #9 - Delay [4bf92f3577b34da6a3ce929d0e0e4740,9920ef6a743629a7] route2 - in aFasterRoute
20:13:17.720 INFO  Camel (camel-1) thread #10 - Delay [4bf92f3577b34da6a3ce929d0e0e4740,9920ef6a743629a7] route2 - still in aFasterRoute

20:13:17.741 INFO  Camel (camel-1) thread #20 - NettyConsumerExecutorGroup [8f1095f44448c6632ba4e97597a2ce5a,66661fb8786cf397] netty - in netty route
20:13:17.742 INFO  Camel (camel-1) thread #20 - NettyConsumerExecutorGroup [8f1095f44448c6632ba4e97597a2ce5a,3b42be0fae97dd0e] route1 - in aSlowerRoute
20:13:17.799 INFO  Camel (camel-1) thread #5 - Delay [8f1095f44448c6632ba4e97597a2ce5a,3b42be0fae97dd0e] route1 - still in aSlowerRoute
20:13:17.800 INFO  Camel (camel-1) thread #5 - Delay [8f1095f44448c6632ba4e97597a2ce5a,e5fc8dae051fc33e] route2 - in aFasterRoute
20:13:17.839 INFO  Camel (camel-1) thread #6 - Delay [8f1095f44448c6632ba4e97597a2ce5a,e5fc8dae051fc33e] route2 - still in aFasterRoute

Copy link
Contributor

@squakez squakez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's partially correct as you've mentioned it would not cover all scenarios. Also there is the problem of the context cleaning once the trace is complete.

@thomas-gantenbein-tga
Copy link
Contributor Author

@squakez, what do you think: Let's close this PR, since you are working on solving the root problem of this? And maybe decline CAMEL-21472 since what I reported turned out to be more of a symptom than a problem, and I guess your upcoming contribution has a much wider scope than that JIRA?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants