-
Notifications
You must be signed in to change notification settings - Fork 202
Prevent blocking on queue overflow (#1809) #1837
Conversation
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here (e.g. What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
I signed it! > CLA |
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -131,7 +136,19 @@ private static DisruptorEventQueue create() { | |||
new DisruptorEnqueuer() { | |||
@Override | |||
public void enqueue(Entry entry) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may consider tracing this method too (not suggesting doing it in this PR though).
We need to evaluate the potential memory leak issue with this PR. Marked as DO NOT MERGE for now. |
Thank you for checking this PR and quick response! I organized thoughts about your concern and implemented some improvements to mitigate the issue in 43d7647 & 8b886b0. Could you check this? @songy23 potential memory leak issueI checked subclasses of My idea to mitigate the issue of
I feel Idea 1-B is feasible and effective way, so that implemented Idea 1-B in 43d7647 . It also prevent memory leak even if events come after queue shutdown (current code has such potential issue). ( And also Skipping SpanStartEventWhen we drop Current only one implementation |
6ff8ccd
to
43d7647
Compare
Friendly ping @songy23 . Let me know if I can help in any way with getting this PR merged. |
Apologies for getting back late to this PR and thanks for investigating the issue.
This sounds a reasonable approach to me. We may also extend |
Hi @saiya, we would like to have a better understanding about the background of the issue. Could you describe a bit about how the thread blocking impacted your application? (Depending on the impact we may want to take a different approach, e.g do some optimization to reduce the events put onto the queue.) |
Hi @songy23, here is the situation I am facing:
There are some alternative idea I can imagine for this issue, but not suitable for OpenTracing:
This is why I am interested with this queue overflow issue.
Because large system (large scale microservice system) is chaotic, it is hard to determine "safe" sampling ratio / rate limit. If there are risk of blocking, I need to set very low sampling ratio. Even if we reduce events on the queue, the issue still alive. In my working experience, many large service provider (not only current company) has similar needs. It is hard to introduce additional blocking risk only for tracing/instrumentation. If the tracing SDK does not have such risk, it is awesome. This is why I made this PR. But reducing event is also wonderful because it prevent from dropping events. Thank you for talking with this PR, feel free to ask me for anything I can share more. |
Thank you @saiya for adding the detailed background, now we have a much better understanding about your use case. There're some concerns from my teammates, and also we want to make sure this behavior is consistent across all OC languages, so I filed census-instrumentation/opencensus-specs#262 for a broader discussion. |
a116593
to
a55ccbc
Compare
Hi @songy23 ! Currently this PR drops events if queue get full. But it calls And also I rebased this PR based on latest master branch (there are many refactoring to related classes :-) ). I know there are discussion in census-instrumentation/opencensus-specs#262. Can such improvements be done in another branch/PR? Or should we improve this PR? (it might make this PR huge.) This PR prevents memory leak and shows warning log on queue overflow. I think this is minimal viable solution. |
Hi @saiya, did you get a chance to try the 0.22.0 release? The new release included the fix proposed by @bogdandrutu (census-instrumentation/opencensus-specs#262 (comment)) and should reduce the possibility of queue blocking. #1893 implemented a similar approach on dropping spans. @bogdandrutu had some concerns about silently dropping events. Maybe the fact that event queue got full and blocking is an indicator that sampling rate is too high. In this case having users explicitly reduce the sampling rate may be a better option. (IMO dropping queue events is actually equivalent to having a lower sampling rate.) Some quotes from the Dapper paper:
Personally I think adaptive sampling (a.k.a rate-limited sampler) may be the silver bullet here. But before that happened, I think it's reasonable to make queue non-blocking and log a warning to let users know they may need to use a lower sampling rate. |
Hi @songy23, thank you for kind response.
Yes, I think so. And also I know that dropping events causes non-uniform sampling ratio (it sometimes drops, sometimes not drop). If user knows adequate sampling ratio, user should set it. But the key point I want to emphasize is that it is hard to know proper sampling ratio for some kind of environment. In our case, some microservice amplify request (generate many outgoing API request for each one incoming request). And there are random user activity spikes. In such situation, hard to know good sampling ratio. If I set very pessimistic sampling ratio (such as 0.001%) to avoid blocking, it can easily ignore issues of some percent of users.
Yes, adaptive sampling is an ideal solution for this matter. But, IMHO, I feel it takes some days to implement it.
Absolutely agree. It helps us to search good sampling ratio (rathe than very pessimistic one such as 0.001%). And we really need non blocking one to avoid stopping our service in production. I can try 0.22.0 in non-important service in our company. But to try it in most hot service, non blocking is important. |
Yes I was thinking about the same reason, and I agree in this scenario it's better to have drop-and-log rather than slow down users' whole application. @bogdandrutu WDYT? The only alternative I see here is to always recommend a low sampling rate, but like @saiya said it has its own problem. |
Hi @saiya, I chatted with @bogdandrutu about this PR today. Bogdan mentioned we'll provide the non-blocking option in the new OpenTelemetry client libraries. (As you may be aware of, we're currently migrating the OpenCensus project to the merged OpenTelemetry project, and OpenCensus will be put to maintenance mode afterwards.) Given that,
|
@songy23 when do you expect there'll be a beta release for OpenTelemetry or a stable release |
The current plan is to have API ready and published by the end of this month. SDK (a.k.a implementation) will be ready early next quarter. |
Hi @songy23 , thank you for sharing current status.
Very good news! Are there any specification / code / PR / issue in OpenTelemetry repositories (e.g. documentation in https://github.com/open-telemetry/opentelemetry-specification) that clarify non-blocking option? Or can I send PR to any repository of OpenTelemetry?
Not urgent. I can wait for some months (but not want to wait for years). In my case, I currently enabled OpenCensus in non-heavy-traffic systems only. In my understanding, OpenTelemetry will provide OpenCensus shim. So that I can reuse my OpenCensus integration to use OpenTelemetry.
Good news to hear it! I hope OpenTelemetry will provide SDK for Stackdriver Tracing in next quarter.
I am using 0.22.1 in not heavy system only. It is working well. But not enabled OpenCensus in heavy systems. |
Not yet since for now we're focusing on the API/data models. Please feel free to open an issue under https://github.com/open-telemetry/opentelemetry-specification.
Thanks for the info!
Exactly. The goal is OpenTelemetry will be 100% backwards-compatible with OpenCensus so the existing integration will continue to work. |
I found a fresh issue in open-telemetry/opentelemetry-specification#94 ( @songy23 Thank you for supporting this PR. I think we can continue discussion in the open-telemetry/opentelemetry-specification#94, please comment anything in the issue if you have thoughts. I feel it is okay to close this PR. |
Thanks for all the contributions and discussions - your time and efforts are highly appreciated! Let's continue the discussion at open-telemetry/opentelemetry-specification#94 to make sure we have the consistent behavior in all OpenTelemetry languages.
Yes that's on our radar. As of today we've finished the initial version of APIs in OpenTelemetry-Java. SDK implementation is in progress: https://github.com/open-telemetry/opentelemetry-java/labels/sdk. |
#1809
I implemented queue overflow handling to prevent blocking foreground (application) thread.
It drops event and show
WARN
log when queue get full.