Skip to content
This repository has been archived by the owner on Dec 23, 2023. It is now read-only.

Memory settings #1767

Closed
bputt opened this issue Feb 21, 2019 · 8 comments · Fixed by #1893
Closed

Memory settings #1767

bputt opened this issue Feb 21, 2019 · 8 comments · Fixed by #1893

Comments

@bputt
Copy link

bputt commented Feb 21, 2019

When exporting trace information:

  1. What's the high/low watermark for heap usage, if any?
  2. If the heap usage hits a certain threshold, does the exporter become blocking at that point?
  3. Is there a way to configure high watermarks for the client?

We were able to verify heap usage increased steadily as we added backpressure to our ingest service. This can become a big issue if we had actual latency issues or if our ingest service was down.

@dmichel1
Copy link
Contributor

dmichel1 commented May 2, 2019

I can confirm this behavior when the ingest service becomes overloaded/slow the heap steadily increases and eventually causes memory pressure leading to an OOM event in the application.

I havn't found the same behavior when the ingest service is completely down but perhaps I need to test that case more thoroughly.

I'm attaching a few screenshots from a heap dump I was able to capture...

Screen Shot 2019-05-02 at 10 36 09 AM
Screen Shot 2019-05-02 at 10 36 21 AM

This issue seems different than #1813 where spans that aren't closed properly will cause an OOM event (?)

@dmichel1
Copy link
Contributor

dmichel1 commented May 3, 2019

It seems census-instrumentation/opencensus-specs#262 and #1837 are related to this issue.

@bogdandrutu
Copy link
Contributor

@dmichel1 by any chance can you take dump the thread stacktrace as well?

@bogdandrutu
Copy link
Contributor

Based on the picture the attach the memory is hold by the SpanExporterImpl thread.

The current design is this:
ThreadExecutesRequest -> DisruptorThread -> SpanExporterImplThread -> Exporter (usually no thread here, executed from the span exporter impl thread)

Keep in mind that threads are passed to the SpanExporterImplThread only when they end, so clearly not related to this #1813.

For me this does not correlate with the blocking queue issue. May still be a correlation if the Disruptor thread bursts the Spans into the SpanExporterImplThread but probably that is not the issue.

I think the issue is that if one of the exporter code blocks (for this explanation I would pretend that you use exporterX). Lets assume that backend for exporterX is not reachable for few seconds, the SpanExporterImplThread will block so the spans to export are growing unbounded during this time. This also can happen if only pushing to exporterX is just slow (or slower than the rate of spans get produced). I think we need to do few things here:

  1. Set a deadline for all exporters requests to ensure that we don't block for a long time, also make sure that they don't have a retry logic (double check if Stackdriver for example has retry logic enabled). For example if Stackdriver is used and a quota error is hit (this is possible because the volume of spans as suggested by this issue is high) the problem will be very bad.
  2. Make all exporter requests async, analyze if making them multi-threaded will help.
  3. Consider to start dropping Spans when the exporter cannot keep up with the amount of spans produced by the application.
  4. Consider to suggest users to use the oc-agent to deal with retries and maybe a higher deadline.

@bputt
Copy link
Author

bputt commented May 4, 2019

Can we make the async operation non blocking so that it's fire and forget. Don't want the endpoint that'll accept the spans to slow down the application.

Also, would be interesting to offer an option to batch, so rather than emitting an array of spans, maybe we send multiple arrays with a new line delimiter or allow configuration of batch size to reduce network overhead

@dmichel1
Copy link
Contributor

dmichel1 commented May 6, 2019

@bogdandrutu

Here's the stacktrace from the exporter thread. I got this from the hprof of the JVM that crashed.

ExportComponent.ServiceExporterThread-0
  at io.opencensus.implcore.trace.RecordEventsSpanImpl.toSpanData()Lio/opencensus/trace/export/SpanData; (RecordEventsSpanImpl.java:252)
  at io.opencensus.implcore.trace.export.SpanExporterImpl$Worker.fromSpanImplToSpanData(Ljava/util/List;)Ljava/util/List; (SpanExporterImpl.java:165)
  at io.opencensus.implcore.trace.export.SpanExporterImpl$Worker.run()V (SpanExporterImpl.java:194)
  at java.lang.Thread.run()V (Thread.java:834)

Set a deadline for all exporters requests to ensure that we don't block for a long time, also make sure that they don't have a retry logic (double check if Stackdriver for example has retry logic enabled). For example if Stackdriver is used and a quota error is hit (this is possible because the volume of spans as suggested by this issue is high) the problem will be very bad.

That is an interesting example with the Stackdriver quota issue. We encountered exactly this issue a few months ago without reaching a conclusion on what went wrong.

Make all exporter requests async, analyze if making them multi-threaded will help.

+1 to making this async and maybe even fire and forget. That is a similar to how we publish metrics, although in the metrics use case dropping some metrics will often only cause resolution differences vs having incomplete trace data.

Consider to start dropping Spans when the exporter cannot keep up with the amount of spans produced by the application.

+1 here too

Consider to suggest users to use the oc-agent to deal with retries and maybe a higher deadline.

I would prefer not to have a side car for deploying tracing and keep everything internal to the library or the vendor exporters.

@bputt i've seen exporter implementations handle the batching of the span data. I believe the stackdriver exporter does this already. The lightstep exporter does something similar as well, https://docs.lightstep.com/docs/lightstep-client-buffer.

@bogdandrutu
Copy link
Contributor

First step to make things more deterministic. We will schedule batches of a maximum size to the exporters:
#1882

@bogdandrutu
Copy link
Contributor

@dmichel1: oc-agent can be deploy like a workload (service) in your cluster, so that all applications will initially send data to this then the oc-agent forwards everything to the backend. This way we can configure some queuing and retries, etc to deal with backend not being available for some time.

This is definitely not requirement (more like a suggestion), but can help if the backend is not in the same cluster or cloud region.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
4 participants