Memory settings #1767

bputt · 2019-02-21T23:12:04Z

When exporting trace information:

What's the high/low watermark for heap usage, if any?
If the heap usage hits a certain threshold, does the exporter become blocking at that point?
Is there a way to configure high watermarks for the client?

We were able to verify heap usage increased steadily as we added backpressure to our ingest service. This can become a big issue if we had actual latency issues or if our ingest service was down.

dmichel1 · 2019-05-02T14:42:15Z

I can confirm this behavior when the ingest service becomes overloaded/slow the heap steadily increases and eventually causes memory pressure leading to an OOM event in the application.

I havn't found the same behavior when the ingest service is completely down but perhaps I need to test that case more thoroughly.

I'm attaching a few screenshots from a heap dump I was able to capture...

This issue seems different than #1813 where spans that aren't closed properly will cause an OOM event (?)

dmichel1 · 2019-05-03T15:01:03Z

It seems census-instrumentation/opencensus-specs#262 and #1837 are related to this issue.

bogdandrutu · 2019-05-04T22:24:53Z

@dmichel1 by any chance can you take dump the thread stacktrace as well?

bogdandrutu · 2019-05-04T23:05:29Z

Based on the picture the attach the memory is hold by the SpanExporterImpl thread.

The current design is this:
ThreadExecutesRequest -> DisruptorThread -> SpanExporterImplThread -> Exporter (usually no thread here, executed from the span exporter impl thread)

Keep in mind that threads are passed to the SpanExporterImplThread only when they end, so clearly not related to this #1813.

For me this does not correlate with the blocking queue issue. May still be a correlation if the Disruptor thread bursts the Spans into the SpanExporterImplThread but probably that is not the issue.

I think the issue is that if one of the exporter code blocks (for this explanation I would pretend that you use exporterX). Lets assume that backend for exporterX is not reachable for few seconds, the SpanExporterImplThread will block so the spans to export are growing unbounded during this time. This also can happen if only pushing to exporterX is just slow (or slower than the rate of spans get produced). I think we need to do few things here:

Set a deadline for all exporters requests to ensure that we don't block for a long time, also make sure that they don't have a retry logic (double check if Stackdriver for example has retry logic enabled). For example if Stackdriver is used and a quota error is hit (this is possible because the volume of spans as suggested by this issue is high) the problem will be very bad.
Make all exporter requests async, analyze if making them multi-threaded will help.
Consider to start dropping Spans when the exporter cannot keep up with the amount of spans produced by the application.
Consider to suggest users to use the oc-agent to deal with retries and maybe a higher deadline.

bputt · 2019-05-04T23:13:44Z

Can we make the async operation non blocking so that it's fire and forget. Don't want the endpoint that'll accept the spans to slow down the application.

Also, would be interesting to offer an option to batch, so rather than emitting an array of spans, maybe we send multiple arrays with a new line delimiter or allow configuration of batch size to reduce network overhead

dmichel1 · 2019-05-06T15:41:25Z

@bogdandrutu

Here's the stacktrace from the exporter thread. I got this from the hprof of the JVM that crashed.

ExportComponent.ServiceExporterThread-0
  at io.opencensus.implcore.trace.RecordEventsSpanImpl.toSpanData()Lio/opencensus/trace/export/SpanData; (RecordEventsSpanImpl.java:252)
  at io.opencensus.implcore.trace.export.SpanExporterImpl$Worker.fromSpanImplToSpanData(Ljava/util/List;)Ljava/util/List; (SpanExporterImpl.java:165)
  at io.opencensus.implcore.trace.export.SpanExporterImpl$Worker.run()V (SpanExporterImpl.java:194)
  at java.lang.Thread.run()V (Thread.java:834)

Set a deadline for all exporters requests to ensure that we don't block for a long time, also make sure that they don't have a retry logic (double check if Stackdriver for example has retry logic enabled). For example if Stackdriver is used and a quota error is hit (this is possible because the volume of spans as suggested by this issue is high) the problem will be very bad.

That is an interesting example with the Stackdriver quota issue. We encountered exactly this issue a few months ago without reaching a conclusion on what went wrong.

Make all exporter requests async, analyze if making them multi-threaded will help.

+1 to making this async and maybe even fire and forget. That is a similar to how we publish metrics, although in the metrics use case dropping some metrics will often only cause resolution differences vs having incomplete trace data.

Consider to start dropping Spans when the exporter cannot keep up with the amount of spans produced by the application.

+1 here too

Consider to suggest users to use the oc-agent to deal with retries and maybe a higher deadline.

I would prefer not to have a side car for deploying tracing and keep everything internal to the library or the vendor exporters.

@bputt i've seen exporter implementations handle the batching of the span data. I believe the stackdriver exporter does this already. The lightstep exporter does something similar as well, https://docs.lightstep.com/docs/lightstep-client-buffer.

bogdandrutu · 2019-05-07T17:35:18Z

First step to make things more deterministic. We will schedule batches of a maximum size to the exporters:
#1882

bogdandrutu · 2019-05-07T17:37:58Z

@dmichel1: oc-agent can be deploy like a workload (service) in your cluster, so that all applications will initially send data to this then the oc-agent forwards everything to the backend. This way we can configure some queuing and retries, etc to deal with backend not being available for some time.

This is definitely not requirement (more like a suggestion), but can help if the backend is not in the same cluster or cloud region.

songy23 added question tracing exporters labels Feb 21, 2019

songy23 added the P2 label Apr 26, 2019

bogdandrutu added P1 and removed P2 labels May 7, 2019

songy23 mentioned this issue May 7, 2019

Exporters: update all exporters to take a Deadline option and don't retry. #1880

Closed

8 tasks

songy23 mentioned this issue May 10, 2019

Exporter/Trace: Refactor to use TimeLimitedHandler. #1890

Merged

This was referenced May 12, 2019

Start dropping spans when the span exporter thread holds more than 10K spans references. #1893

Merged

Make all exporter requests async #1894

Open

lmuhlha mentioned this issue May 13, 2019

Fix tracing memory leak (Heroic-api) spotify/heroic#469

Closed

bogdandrutu closed this as completed in #1893 May 13, 2019

songy23 mentioned this issue Jun 7, 2019

Prevent blocking on queue overflow (#1809) #1837

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory settings #1767

Memory settings #1767

bputt commented Feb 21, 2019

dmichel1 commented May 2, 2019

dmichel1 commented May 3, 2019

bogdandrutu commented May 4, 2019

bogdandrutu commented May 4, 2019

bputt commented May 4, 2019

dmichel1 commented May 6, 2019

bogdandrutu commented May 7, 2019

bogdandrutu commented May 7, 2019

Memory settings #1767

Memory settings #1767

Comments

bputt commented Feb 21, 2019

dmichel1 commented May 2, 2019

dmichel1 commented May 3, 2019

bogdandrutu commented May 4, 2019

bogdandrutu commented May 4, 2019

bputt commented May 4, 2019

dmichel1 commented May 6, 2019

bogdandrutu commented May 7, 2019

bogdandrutu commented May 7, 2019