Skip to content
This repository has been archived by the owner on Dec 23, 2023. It is now read-only.

Prevent blocking on queue overflow (#1809) #1837

Closed

Conversation

saiya
Copy link

@saiya saiya commented Apr 9, 2019

#1809

I implemented queue overflow handling to prevent blocking foreground (application) thread.

It drops event and show WARN log when queue get full.

@saiya saiya requested review from dinooliva, rghetia, songy23 and a team as code owners April 9, 2019 02:50
@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@saiya
Copy link
Author

saiya commented Apr 9, 2019

I signed it! > CLA

@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

Copy link
Contributor

@songy23 songy23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -131,7 +136,19 @@ private static DisruptorEventQueue create() {
new DisruptorEnqueuer() {
@Override
public void enqueue(Entry entry) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may consider tracing this method too (not suggesting doing it in this PR though).

@songy23
Copy link
Contributor

songy23 commented Apr 10, 2019

We need to evaluate the potential memory leak issue with this PR. Marked as DO NOT MERGE for now.

@saiya
Copy link
Author

saiya commented Apr 11, 2019

Thank you for checking this PR and quick response!

I organized thoughts about your concern and implemented some improvements to mitigate the issue in 43d7647 & 8b886b0. Could you check this? @songy23

potential memory leak issue

I checked subclasses of EventQueue.Entry. Problem is a SpanEndEvent#process because it calls runningSpanStore.onEnd to remove span from runningSpans. If we skip it, causes memory leak due to runningSpan. In my understanding, it is a cause for your concern.

My idea to mitigate the issue of runningSpans potential leak:

  • Idea 1-A) Call EventQueue.Entry#process() in foreground thread when queue get full. This idea seems bad because it may cause lock contention and severe performance issue.
  • Idea 1-B) Add EventQueue.Entry#rejected() method to implement minimum cleanup code. Call it in foreground thread when queue get full. Call runningSpanStore.onEnd from SpanEndEvent#rejected() to prevent memory leak.
  • Idea 2) Change InProcessRunningSpanStoreImpl to use WeakReference to prevent memory leak caused by it. This may cause additional memory usage due to allocating additional objects to hold reference to spans.

I feel Idea 1-B is feasible and effective way, so that implemented Idea 1-B in 43d7647 . It also prevent memory leak even if events come after queue shutdown (current code has such potential issue).

( And also InProcessSampledSpanStoreImpl.UnregisterSpanNameEvent seems to have similar potential issue. But the Map<String, PerSpanNameSamples> inside it should not be huge because its key is name of span, not span itself. So I feel it is not serious problem. )

Skipping SpanStartEvent

When we drop SpanStartEvent, it skips RunningSpanStoreImpl#onStart(span). So that RunningSpanStoreImpl might receive onEnd(span) without onStart(span).

Current only one implementation InProcessRunningSpanStoreImpl raises IllegalArgumentExeption in runningSpans.removeElement. In such case, do nothing is correct behavior in my understanding. So that I implemented exception handling in 8b886b0 .

@saiya
Copy link
Author

saiya commented Apr 16, 2019

Friendly ping @songy23 . Let me know if I can help in any way with getting this PR merged.

@songy23
Copy link
Contributor

songy23 commented Apr 16, 2019

Apologies for getting back late to this PR and thanks for investigating the issue.

Idea 1-B) Add EventQueue.Entry#rejected() method to implement minimum cleanup code. Call it in foreground thread when queue get full. Call runningSpanStore.onEnd from SpanEndEvent#rejected() to prevent memory leak.

This sounds a reasonable approach to me. We may also extend EventQueue.Entry#rejected() to support measurement recording in Stats. @bogdandrutu @dinooliva @rghetia WDYT?

@songy23
Copy link
Contributor

songy23 commented Apr 22, 2019

Hi @saiya, we would like to have a better understanding about the background of the issue. Could you describe a bit about how the thread blocking impacted your application?

(Depending on the impact we may want to take a different approach, e.g do some optimization to reduce the events put onto the queue.)

@songy23 songy23 added this to the Release 0.21.0 milestone Apr 22, 2019
@songy23 songy23 mentioned this pull request Apr 22, 2019
3 tasks
@saiya
Copy link
Author

saiya commented Apr 23, 2019

we would like to have a better understanding about the background of the issue. Could you describe a bit about how the thread blocking impacted your application?

Hi @songy23, here is the situation I am facing:

  1. There are so many products/microservices in my company
    • Over 100 systems
    • There are complex dependency between API subsystems
    • Estimated as ten billion HTTP API calls/month at least
  2. Some product has traffic spikes
    • It causes unexpectable spike for related API subsystems
  3. Some important & huge services are running in on-premise environment
    • Difficult for elastic scaling / frequent failover, so that we are running large instances to process traffic
  4. All of our service are implemented with concurrent way (multi-thread or event driven)
    • Blocking is not welcome (even if it is "possible" blocking)
    • Blocking cause service down. It hurts profit & customer loyalty
  5. We had used Zipkin, but maintaining Zipkin infrastructure is hard work so that considering OpenCensus + Stackdriver Trace

There are some alternative idea I can imagine for this issue, but not suitable for OpenTracing:

  1. Use low sampling ratio (e.g. trace only 10%)
    • Cannot capture random performance degrade
    • Does not help debugging for complex bug due to missing trace
  2. Use RateLimiting sampler
    • I found the document today, it seems not supported by opencensus-java yet
    • Because I want to avoid blocking as possible, I need to set conservative RateLimit.
      • It means low sampling ratio as same as idea 1.
  3. Implement dynamic sampling to sample slow request only
    • Does not provide uniform sampling, so that may cause problem for statistics
    • Does not solve queue overflow because we need to start tracing before we know the request is slow

This is why I am interested with this queue overflow issue.

(Depending on the impact we may want to take a different approach, e.g do some optimization to reduce the events put onto the queue.)

Because large system (large scale microservice system) is chaotic, it is hard to determine "safe" sampling ratio / rate limit. If there are risk of blocking, I need to set very low sampling ratio. Even if we reduce events on the queue, the issue still alive.

In my working experience, many large service provider (not only current company) has similar needs. It is hard to introduce additional blocking risk only for tracing/instrumentation. If the tracing SDK does not have such risk, it is awesome. This is why I made this PR.

But reducing event is also wonderful because it prevent from dropping events.

Thank you for talking with this PR, feel free to ask me for anything I can share more.

@songy23
Copy link
Contributor

songy23 commented Apr 23, 2019

Thank you @saiya for adding the detailed background, now we have a much better understanding about your use case. There're some concerns from my teammates, and also we want to make sure this behavior is consistent across all OC languages, so I filed census-instrumentation/opencensus-specs#262 for a broader discussion.

@songy23 songy23 removed this from the Release 0.21.0 milestone May 2, 2019
@dmichel1 dmichel1 mentioned this pull request May 3, 2019
@saiya
Copy link
Author

saiya commented May 16, 2019

Hi @songy23 !

Currently this PR drops events if queue get full. But it calls runningSpanStore.onEnd(span); even on rejected() so it does not cause memory leak. I added corner-case handling to prevent memory leak in rare case also in a55ccbc.

And also I rebased this PR based on latest master branch (there are many refactoring to related classes :-) ).

I know there are discussion in census-instrumentation/opencensus-specs#262. Can such improvements be done in another branch/PR? Or should we improve this PR? (it might make this PR huge.) This PR prevents memory leak and shows warning log on queue overflow. I think this is minimal viable solution.

@songy23
Copy link
Contributor

songy23 commented May 16, 2019

Hi @saiya, did you get a chance to try the 0.22.0 release? The new release included the fix proposed by @bogdandrutu (census-instrumentation/opencensus-specs#262 (comment)) and should reduce the possibility of queue blocking. #1893 implemented a similar approach on dropping spans.

@bogdandrutu had some concerns about silently dropping events. Maybe the fact that event queue got full and blocking is an indicator that sampling rate is too high. In this case having users explicitly reduce the sampling rate may be a better option. (IMO dropping queue events is actually equivalent to having a lower sampling rate.) Some quotes from the Dapper paper:

In practice, we have found that there is still an adequate amount of trace data for high-volume services when using a sampling rate as low as 1/1024.

The first production version of Dapper used a uniform sampling probability for all processes at Google, averaging one sampled trace for every 1024 candidates. This simple scheme was effective for our high-throughput online services since the vast majority of events of interest were still very likely to appear often enough to be captured.

Personally I think adaptive sampling (a.k.a rate-limited sampler) may be the silver bullet here. But before that happened, I think it's reasonable to make queue non-blocking and log a warning to let users know they may need to use a lower sampling rate.

@saiya
Copy link
Author

saiya commented May 17, 2019

Hi @songy23, thank you for kind response.

IMO dropping queue events is actually equivalent to having a lower sampling rate

Yes, I think so. And also I know that dropping events causes non-uniform sampling ratio (it sometimes drops, sometimes not drop). If user knows adequate sampling ratio, user should set it.

But the key point I want to emphasize is that it is hard to know proper sampling ratio for some kind of environment. In our case, some microservice amplify request (generate many outgoing API request for each one incoming request). And there are random user activity spikes. In such situation, hard to know good sampling ratio. If I set very pessimistic sampling ratio (such as 0.001%) to avoid blocking, it can easily ignore issues of some percent of users.

Personally I think adaptive sampling (a.k.a rate-limited sampler) may be the silver bullet here.

Yes, adaptive sampling is an ideal solution for this matter. But, IMHO, I feel it takes some days to implement it.

But before that happened, I think it's reasonable to make queue non-blocking and log a warning to let users know they may need to use a lower sampling rate.

Absolutely agree. It helps us to search good sampling ratio (rathe than very pessimistic one such as 0.001%). And we really need non blocking one to avoid stopping our service in production.

I can try 0.22.0 in non-important service in our company. But to try it in most hot service, non blocking is important.

@songy23
Copy link
Contributor

songy23 commented May 17, 2019

But the key point I want to emphasize is that it is hard to know proper sampling ratio for some kind of environment. In our case, some microservice amplify request (generate many outgoing API request for each one incoming request). And there are random user activity spikes. In such situation, hard to know good sampling ratio. If I set very pessimistic sampling ratio (such as 0.001%) to avoid blocking, it can easily ignore issues of some percent of users.

Yes I was thinking about the same reason, and I agree in this scenario it's better to have drop-and-log rather than slow down users' whole application.

@bogdandrutu WDYT? The only alternative I see here is to always recommend a low sampling rate, but like @saiya said it has its own problem.

@songy23
Copy link
Contributor

songy23 commented Jun 7, 2019

Hi @saiya, I chatted with @bogdandrutu about this PR today. Bogdan mentioned we'll provide the non-blocking option in the new OpenTelemetry client libraries. (As you may be aware of, we're currently migrating the OpenCensus project to the merged OpenTelemetry project, and OpenCensus will be put to maintenance mode afterwards.) Given that,

  1. Do you need this feature urgently in OpenCensus, or can you wait for us implement it in OpenTelemetry?
  2. If you need this feature now in OpenCensus, can you keep blocking as the default, and make non-blocking configurable through manifest/env vars/flags? Changing the default from blocking to non-blocking is kind of a breaking change, and this is something we want to avoid.
  3. Have you got a chance to try out the v0.22.1 release? We have another customer facing a similar issue, and the latest release fixed it (Memory settings #1767).

@bputt
Copy link

bputt commented Jun 7, 2019

@songy23 when do you expect there'll be a beta release for OpenTelemetry or a stable release

@songy23
Copy link
Contributor

songy23 commented Jun 7, 2019

when do you expect there'll be a beta release for OpenTelemetry or a stable release

The current plan is to have API ready and published by the end of this month. SDK (a.k.a implementation) will be ready early next quarter.

@saiya
Copy link
Author

saiya commented Jun 8, 2019

Hi @songy23 , thank you for sharing current status.

Bogdan mentioned we'll provide the non-blocking option in the new OpenTelemetry client libraries.

Very good news! Are there any specification / code / PR / issue in OpenTelemetry repositories (e.g. documentation in https://github.com/open-telemetry/opentelemetry-specification) that clarify non-blocking option? Or can I send PR to any repository of OpenTelemetry?

Do you need this feature urgently in OpenCensus, or can you wait for us implement it in OpenTelemetry?

Not urgent. I can wait for some months (but not want to wait for years). In my case, I currently enabled OpenCensus in non-heavy-traffic systems only.

In my understanding, OpenTelemetry will provide OpenCensus shim. So that I can reuse my OpenCensus integration to use OpenTelemetry.

The current plan is to have API ready and published by the end of this month. SDK (a.k.a implementation) will be ready early next quarter.

Good news to hear it! I hope OpenTelemetry will provide SDK for Stackdriver Tracing in next quarter.

Have you got a chance to try out the v0.22.1 release? We have another customer facing a similar issue, and the latest release fixed it (#1767)

I am using 0.22.1 in not heavy system only. It is working well. But not enabled OpenCensus in heavy systems.

@songy23
Copy link
Contributor

songy23 commented Jun 8, 2019

Are there any specification / code / PR / issue in OpenTelemetry repositories (e.g. documentation in https://github.com/open-telemetry/opentelemetry-specification) that clarify non-blocking option? Or can I send PR to any repository of OpenTelemetry?

Not yet since for now we're focusing on the API/data models. Please feel free to open an issue under https://github.com/open-telemetry/opentelemetry-specification.

I am using 0.22.1 in not heavy system only. It is working well. But not enabled OpenCensus in heavy systems.

Thanks for the info!

In my understanding, OpenTelemetry will provide OpenCensus shim. So that I can reuse my OpenCensus integration to use OpenTelemetry.

Exactly. The goal is OpenTelemetry will be 100% backwards-compatible with OpenCensus so the existing integration will continue to work.

@saiya
Copy link
Author

saiya commented Jun 12, 2019

I found a fresh issue in open-telemetry/opentelemetry-specification#94 (Add recommendations about blocking / queuing / resource consumption for language libraries), I hope OpenTelemetry explicitly supports non-blocking behavior.

@songy23 Thank you for supporting this PR. I think we can continue discussion in the open-telemetry/opentelemetry-specification#94, please comment anything in the issue if you have thoughts. I feel it is okay to close this PR.

@songy23
Copy link
Contributor

songy23 commented Jun 12, 2019

Thanks for all the contributions and discussions - your time and efforts are highly appreciated! Let's continue the discussion at open-telemetry/opentelemetry-specification#94 to make sure we have the consistent behavior in all OpenTelemetry languages.

I hope OpenTelemetry explicitly supports non-blocking behavior.

Yes that's on our radar. As of today we've finished the initial version of APIs in OpenTelemetry-Java. SDK implementation is in progress: https://github.com/open-telemetry/opentelemetry-java/labels/sdk.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants