Prevent blocking on queue overflow (#1809) #1837

saiya · 2019-04-09T02:50:18Z

I implemented queue overflow handling to prevent blocking foreground (application) thread.

It drops event and show WARN log when queue get full.

googlebot · 2019-04-09T02:50:22Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

saiya · 2019-04-09T02:57:48Z

I signed it! > CLA

googlebot · 2019-04-09T02:57:50Z

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

songy23

LGTM

songy23 · 2019-04-10T17:18:42Z

impl/src/main/java/io/opencensus/impl/internal/DisruptorEventQueue.java

@@ -131,7 +136,19 @@ private static DisruptorEventQueue create() {
        new DisruptorEnqueuer() {
          @Override
          public void enqueue(Entry entry) {


We may consider tracing this method too (not suggesting doing it in this PR though).

songy23 · 2019-04-10T18:25:35Z

We need to evaluate the potential memory leak issue with this PR. Marked as DO NOT MERGE for now.

saiya · 2019-04-11T03:35:27Z

Thank you for checking this PR and quick response!

I organized thoughts about your concern and implemented some improvements to mitigate the issue in 43d7647 & 8b886b0. Could you check this? @songy23

potential memory leak issue

I checked subclasses of EventQueue.Entry. Problem is a SpanEndEvent#process because it calls runningSpanStore.onEnd to remove span from runningSpans. If we skip it, causes memory leak due to runningSpan. In my understanding, it is a cause for your concern.

My idea to mitigate the issue of runningSpans potential leak:

Idea 1-A) Call EventQueue.Entry#process() in foreground thread when queue get full. This idea seems bad because it may cause lock contention and severe performance issue.
Idea 1-B) Add EventQueue.Entry#rejected() method to implement minimum cleanup code. Call it in foreground thread when queue get full. Call runningSpanStore.onEnd from SpanEndEvent#rejected() to prevent memory leak.
Idea 2) Change InProcessRunningSpanStoreImpl to use WeakReference to prevent memory leak caused by it. This may cause additional memory usage due to allocating additional objects to hold reference to spans.

I feel Idea 1-B is feasible and effective way, so that implemented Idea 1-B in 43d7647 . It also prevent memory leak even if events come after queue shutdown (current code has such potential issue).

( And also InProcessSampledSpanStoreImpl.UnregisterSpanNameEvent seems to have similar potential issue. But the Map<String, PerSpanNameSamples> inside it should not be huge because its key is name of span, not span itself. So I feel it is not serious problem. )

Skipping SpanStartEvent

When we drop SpanStartEvent, it skips RunningSpanStoreImpl#onStart(span). So that RunningSpanStoreImpl might receive onEnd(span) without onStart(span).

Current only one implementation InProcessRunningSpanStoreImpl raises IllegalArgumentExeption in runningSpans.removeElement. In such case, do nothing is correct behavior in my understanding. So that I implemented exception handling in 8b886b0 .

saiya · 2019-04-16T08:51:26Z

Friendly ping @songy23 . Let me know if I can help in any way with getting this PR merged.

songy23 · 2019-04-16T17:32:35Z

Apologies for getting back late to this PR and thanks for investigating the issue.

Idea 1-B) Add EventQueue.Entry#rejected() method to implement minimum cleanup code. Call it in foreground thread when queue get full. Call runningSpanStore.onEnd from SpanEndEvent#rejected() to prevent memory leak.

This sounds a reasonable approach to me. We may also extend EventQueue.Entry#rejected() to support measurement recording in Stats. @bogdandrutu @dinooliva @rghetia WDYT?

songy23 · 2019-04-22T22:13:47Z

Hi @saiya, we would like to have a better understanding about the background of the issue. Could you describe a bit about how the thread blocking impacted your application?

(Depending on the impact we may want to take a different approach, e.g do some optimization to reduce the events put onto the queue.)

saiya · 2019-04-23T06:31:59Z

we would like to have a better understanding about the background of the issue. Could you describe a bit about how the thread blocking impacted your application?

Hi @songy23, here is the situation I am facing:

There are so many products/microservices in my company
- Over 100 systems
- There are complex dependency between API subsystems
- Estimated as ten billion HTTP API calls/month at least
Some product has traffic spikes
- It causes unexpectable spike for related API subsystems
Some important & huge services are running in on-premise environment
- Difficult for elastic scaling / frequent failover, so that we are running large instances to process traffic
All of our service are implemented with concurrent way (multi-thread or event driven)
- Blocking is not welcome (even if it is "possible" blocking)
- Blocking cause service down. It hurts profit & customer loyalty
We had used Zipkin, but maintaining Zipkin infrastructure is hard work so that considering OpenCensus + Stackdriver Trace

There are some alternative idea I can imagine for this issue, but not suitable for OpenTracing:

Use low sampling ratio (e.g. trace only 10%)
- Cannot capture random performance degrade
- Does not help debugging for complex bug due to missing trace
Use RateLimiting sampler
- I found the document today, it seems not supported by opencensus-java yet
- Because I want to avoid blocking as possible, I need to set conservative RateLimit.
  - It means low sampling ratio as same as idea 1.
Implement dynamic sampling to sample slow request only
- Does not provide uniform sampling, so that may cause problem for statistics
- Does not solve queue overflow because we need to start tracing before we know the request is slow

This is why I am interested with this queue overflow issue.

(Depending on the impact we may want to take a different approach, e.g do some optimization to reduce the events put onto the queue.)

Because large system (large scale microservice system) is chaotic, it is hard to determine "safe" sampling ratio / rate limit. If there are risk of blocking, I need to set very low sampling ratio. Even if we reduce events on the queue, the issue still alive.

In my working experience, many large service provider (not only current company) has similar needs. It is hard to introduce additional blocking risk only for tracing/instrumentation. If the tracing SDK does not have such risk, it is awesome. This is why I made this PR.

But reducing event is also wonderful because it prevent from dropping events.

Thank you for talking with this PR, feel free to ask me for anything I can share more.

songy23 · 2019-04-23T22:22:43Z

Thank you @saiya for adding the detailed background, now we have a much better understanding about your use case. There're some concerns from my teammates, and also we want to make sure this behavior is consistent across all OC languages, so I filed census-instrumentation/opencensus-specs#262 for a broader discussion.

…-instrumentation#1809)

…nsus-instrumentation#1809)

saiya · 2019-05-16T04:32:06Z

Hi @songy23 !

Currently this PR drops events if queue get full. But it calls runningSpanStore.onEnd(span); even on rejected() so it does not cause memory leak. I added corner-case handling to prevent memory leak in rare case also in a55ccbc.

And also I rebased this PR based on latest master branch (there are many refactoring to related classes :-) ).

I know there are discussion in census-instrumentation/opencensus-specs#262. Can such improvements be done in another branch/PR? Or should we improve this PR? (it might make this PR huge.) This PR prevents memory leak and shows warning log on queue overflow. I think this is minimal viable solution.

songy23 · 2019-05-16T18:44:04Z

Hi @saiya, did you get a chance to try the 0.22.0 release? The new release included the fix proposed by @bogdandrutu (census-instrumentation/opencensus-specs#262 (comment)) and should reduce the possibility of queue blocking. #1893 implemented a similar approach on dropping spans.

@bogdandrutu had some concerns about silently dropping events. Maybe the fact that event queue got full and blocking is an indicator that sampling rate is too high. In this case having users explicitly reduce the sampling rate may be a better option. (IMO dropping queue events is actually equivalent to having a lower sampling rate.) Some quotes from the Dapper paper:

In practice, we have found that there is still an adequate amount of trace data for high-volume services when using a sampling rate as low as 1/1024.

The first production version of Dapper used a uniform sampling probability for all processes at Google, averaging one sampled trace for every 1024 candidates. This simple scheme was effective for our high-throughput online services since the vast majority of events of interest were still very likely to appear often enough to be captured.

Personally I think adaptive sampling (a.k.a rate-limited sampler) may be the silver bullet here. But before that happened, I think it's reasonable to make queue non-blocking and log a warning to let users know they may need to use a lower sampling rate.

saiya · 2019-05-17T05:22:04Z

Hi @songy23, thank you for kind response.

IMO dropping queue events is actually equivalent to having a lower sampling rate

Yes, I think so. And also I know that dropping events causes non-uniform sampling ratio (it sometimes drops, sometimes not drop). If user knows adequate sampling ratio, user should set it.

But the key point I want to emphasize is that it is hard to know proper sampling ratio for some kind of environment. In our case, some microservice amplify request (generate many outgoing API request for each one incoming request). And there are random user activity spikes. In such situation, hard to know good sampling ratio. If I set very pessimistic sampling ratio (such as 0.001%) to avoid blocking, it can easily ignore issues of some percent of users.

Personally I think adaptive sampling (a.k.a rate-limited sampler) may be the silver bullet here.

Yes, adaptive sampling is an ideal solution for this matter. But, IMHO, I feel it takes some days to implement it.

But before that happened, I think it's reasonable to make queue non-blocking and log a warning to let users know they may need to use a lower sampling rate.

Absolutely agree. It helps us to search good sampling ratio (rathe than very pessimistic one such as 0.001%). And we really need non blocking one to avoid stopping our service in production.

I can try 0.22.0 in non-important service in our company. But to try it in most hot service, non blocking is important.

songy23 · 2019-05-17T16:40:49Z

But the key point I want to emphasize is that it is hard to know proper sampling ratio for some kind of environment. In our case, some microservice amplify request (generate many outgoing API request for each one incoming request). And there are random user activity spikes. In such situation, hard to know good sampling ratio. If I set very pessimistic sampling ratio (such as 0.001%) to avoid blocking, it can easily ignore issues of some percent of users.

Yes I was thinking about the same reason, and I agree in this scenario it's better to have drop-and-log rather than slow down users' whole application.

@bogdandrutu WDYT? The only alternative I see here is to always recommend a low sampling rate, but like @saiya said it has its own problem.

songy23 · 2019-06-07T03:58:45Z

Hi @saiya, I chatted with @bogdandrutu about this PR today. Bogdan mentioned we'll provide the non-blocking option in the new OpenTelemetry client libraries. (As you may be aware of, we're currently migrating the OpenCensus project to the merged OpenTelemetry project, and OpenCensus will be put to maintenance mode afterwards.) Given that,

Do you need this feature urgently in OpenCensus, or can you wait for us implement it in OpenTelemetry?
If you need this feature now in OpenCensus, can you keep blocking as the default, and make non-blocking configurable through manifest/env vars/flags? Changing the default from blocking to non-blocking is kind of a breaking change, and this is something we want to avoid.
Have you got a chance to try out the v0.22.1 release? We have another customer facing a similar issue, and the latest release fixed it (Memory settings #1767).

bputt · 2019-06-07T04:35:01Z

@songy23 when do you expect there'll be a beta release for OpenTelemetry or a stable release

songy23 · 2019-06-07T16:45:29Z

when do you expect there'll be a beta release for OpenTelemetry or a stable release

The current plan is to have API ready and published by the end of this month. SDK (a.k.a implementation) will be ready early next quarter.

saiya · 2019-06-08T04:22:35Z

Hi @songy23 , thank you for sharing current status.

Bogdan mentioned we'll provide the non-blocking option in the new OpenTelemetry client libraries.

Very good news! Are there any specification / code / PR / issue in OpenTelemetry repositories (e.g. documentation in https://github.com/open-telemetry/opentelemetry-specification) that clarify non-blocking option? Or can I send PR to any repository of OpenTelemetry?

Do you need this feature urgently in OpenCensus, or can you wait for us implement it in OpenTelemetry?

Not urgent. I can wait for some months (but not want to wait for years). In my case, I currently enabled OpenCensus in non-heavy-traffic systems only.

In my understanding, OpenTelemetry will provide OpenCensus shim. So that I can reuse my OpenCensus integration to use OpenTelemetry.

The current plan is to have API ready and published by the end of this month. SDK (a.k.a implementation) will be ready early next quarter.

Good news to hear it! I hope OpenTelemetry will provide SDK for Stackdriver Tracing in next quarter.

Have you got a chance to try out the v0.22.1 release? We have another customer facing a similar issue, and the latest release fixed it (#1767)

I am using 0.22.1 in not heavy system only. It is working well. But not enabled OpenCensus in heavy systems.

songy23 · 2019-06-08T21:34:57Z

Are there any specification / code / PR / issue in OpenTelemetry repositories (e.g. documentation in https://github.com/open-telemetry/opentelemetry-specification) that clarify non-blocking option? Or can I send PR to any repository of OpenTelemetry?

Not yet since for now we're focusing on the API/data models. Please feel free to open an issue under https://github.com/open-telemetry/opentelemetry-specification.

I am using 0.22.1 in not heavy system only. It is working well. But not enabled OpenCensus in heavy systems.

Thanks for the info!

In my understanding, OpenTelemetry will provide OpenCensus shim. So that I can reuse my OpenCensus integration to use OpenTelemetry.

Exactly. The goal is OpenTelemetry will be 100% backwards-compatible with OpenCensus so the existing integration will continue to work.

saiya · 2019-06-12T02:47:36Z

I found a fresh issue in open-telemetry/opentelemetry-specification#94 (Add recommendations about blocking / queuing / resource consumption for language libraries), I hope OpenTelemetry explicitly supports non-blocking behavior.

@songy23 Thank you for supporting this PR. I think we can continue discussion in the open-telemetry/opentelemetry-specification#94, please comment anything in the issue if you have thoughts. I feel it is okay to close this PR.

songy23 · 2019-06-12T04:22:47Z

Thanks for all the contributions and discussions - your time and efforts are highly appreciated! Let's continue the discussion at open-telemetry/opentelemetry-specification#94 to make sure we have the consistent behavior in all OpenTelemetry languages.

I hope OpenTelemetry explicitly supports non-blocking behavior.

Yes that's on our radar. As of today we've finished the initial version of APIs in OpenTelemetry-Java. SDK implementation is in progress: https://github.com/open-telemetry/opentelemetry-java/labels/sdk.

saiya requested review from dinooliva, rghetia, songy23 and a team as code owners April 9, 2019 02:50

googlebot added the cla: no label Apr 9, 2019

googlebot added cla: yes and removed cla: no labels Apr 9, 2019

saiya mentioned this pull request Apr 9, 2019

Consider changing DisruptorEventQueue.enqueue() to avoid possibility of blocking #1809

Open

songy23 requested a review from bogdandrutu April 10, 2019 17:16

songy23 approved these changes Apr 10, 2019

View reviewed changes

songy23 added the DO NOT MERGE label Apr 10, 2019

saiya force-pushed the avoid_enqueue_blocking branch from 6ff8ccd to 43d7647 Compare April 11, 2019 04:33

songy23 added this to the Release 0.21.0 milestone Apr 22, 2019

songy23 mentioned this pull request Apr 22, 2019

0.21.0 release #1857

Closed

3 tasks

songy23 mentioned this pull request Apr 23, 2019

Decide what to do when async event queue is full. census-instrumentation/opencensus-specs#262

Open

songy23 removed this from the Release 0.21.0 milestone May 2, 2019

dmichel1 mentioned this pull request May 3, 2019

Memory settings #1767

Closed

saiya added 3 commits May 16, 2019 11:54

Prevent blocking on queue overflow (census-instrumentation#1809)

f4c94aa

Handle IllegalArgumentException due to skipped SpanStartEvent (census…

a29f6c4

…-instrumentation#1809)

Add EventQueue.Entry#rejected() for cleanup logic to prevent leak (ce…

04fd00b

…nsus-instrumentation#1809)

Remove running span even also in rare case to prevent memory leak (ce…

a55ccbc

…nsus-instrumentation#1809)

saiya force-pushed the avoid_enqueue_blocking branch from a116593 to a55ccbc Compare May 16, 2019 03:18

saiya mentioned this pull request Jun 12, 2019

Add recommendations about blocking / queuing / resource consumption for language libraries open-telemetry/opentelemetry-specification#94

Closed

songy23 closed this Jun 12, 2019

saiya mentioned this pull request Jun 17, 2019

Add Performance and Blocking specification open-telemetry/opentelemetry-specification#130

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent blocking on queue overflow (#1809) #1837

Prevent blocking on queue overflow (#1809) #1837

saiya commented Apr 9, 2019 •

edited

Loading

googlebot commented Apr 9, 2019

saiya commented Apr 9, 2019

googlebot commented Apr 9, 2019

songy23 left a comment

songy23 Apr 10, 2019

songy23 commented Apr 10, 2019

saiya commented Apr 11, 2019 •

edited

Loading

saiya commented Apr 16, 2019

songy23 commented Apr 16, 2019

songy23 commented Apr 22, 2019

saiya commented Apr 23, 2019

songy23 commented Apr 23, 2019

saiya commented May 16, 2019

songy23 commented May 16, 2019

saiya commented May 17, 2019

songy23 commented May 17, 2019

songy23 commented Jun 7, 2019

bputt commented Jun 7, 2019

songy23 commented Jun 7, 2019

saiya commented Jun 8, 2019

songy23 commented Jun 8, 2019

saiya commented Jun 12, 2019

songy23 commented Jun 12, 2019

Prevent blocking on queue overflow (#1809) #1837

Prevent blocking on queue overflow (#1809) #1837

Conversation

saiya commented Apr 9, 2019 • edited Loading

googlebot commented Apr 9, 2019

What to do if you already signed the CLA

Individual signers

Corporate signers

saiya commented Apr 9, 2019

googlebot commented Apr 9, 2019

songy23 left a comment

Choose a reason for hiding this comment

songy23 Apr 10, 2019

Choose a reason for hiding this comment

songy23 commented Apr 10, 2019

saiya commented Apr 11, 2019 • edited Loading

potential memory leak issue

Skipping SpanStartEvent

saiya commented Apr 16, 2019

songy23 commented Apr 16, 2019

songy23 commented Apr 22, 2019

saiya commented Apr 23, 2019

songy23 commented Apr 23, 2019

saiya commented May 16, 2019

songy23 commented May 16, 2019

saiya commented May 17, 2019

songy23 commented May 17, 2019

songy23 commented Jun 7, 2019

bputt commented Jun 7, 2019

songy23 commented Jun 7, 2019

saiya commented Jun 8, 2019

songy23 commented Jun 8, 2019

saiya commented Jun 12, 2019

songy23 commented Jun 12, 2019

saiya commented Apr 9, 2019 •

edited

Loading

saiya commented Apr 11, 2019 •

edited

Loading