-
Notifications
You must be signed in to change notification settings - Fork 200
Memory settings #1767
Comments
I can confirm this behavior when the ingest service becomes overloaded/slow the heap steadily increases and eventually causes memory pressure leading to an OOM event in the application. I havn't found the same behavior when the ingest service is completely down but perhaps I need to test that case more thoroughly. I'm attaching a few screenshots from a heap dump I was able to capture... This issue seems different than #1813 where spans that aren't closed properly will cause an OOM event (?) |
It seems census-instrumentation/opencensus-specs#262 and #1837 are related to this issue. |
@dmichel1 by any chance can you take dump the thread stacktrace as well? |
Based on the picture the attach the memory is hold by the SpanExporterImpl thread. The current design is this: Keep in mind that threads are passed to the SpanExporterImplThread only when they end, so clearly not related to this #1813. For me this does not correlate with the blocking queue issue. May still be a correlation if the Disruptor thread bursts the Spans into the SpanExporterImplThread but probably that is not the issue. I think the issue is that if one of the exporter code blocks (for this explanation I would pretend that you use exporterX). Lets assume that backend for exporterX is not reachable for few seconds, the SpanExporterImplThread will block so the spans to export are growing unbounded during this time. This also can happen if only pushing to exporterX is just slow (or slower than the rate of spans get produced). I think we need to do few things here:
|
Can we make the async operation non blocking so that it's fire and forget. Don't want the endpoint that'll accept the spans to slow down the application. Also, would be interesting to offer an option to batch, so rather than emitting an array of spans, maybe we send multiple arrays with a new line delimiter or allow configuration of batch size to reduce network overhead |
Here's the stacktrace from the exporter thread. I got this from the hprof of the JVM that crashed.
That is an interesting example with the Stackdriver quota issue. We encountered exactly this issue a few months ago without reaching a conclusion on what went wrong.
+1 to making this async and maybe even fire and forget. That is a similar to how we publish metrics, although in the metrics use case dropping some metrics will often only cause resolution differences vs having incomplete trace data.
+1 here too
I would prefer not to have a side car for deploying tracing and keep everything internal to the library or the vendor exporters. @bputt i've seen exporter implementations handle the batching of the span data. I believe the stackdriver exporter does this already. The lightstep exporter does something similar as well, https://docs.lightstep.com/docs/lightstep-client-buffer. |
First step to make things more deterministic. We will schedule batches of a maximum size to the exporters: |
@dmichel1: oc-agent can be deploy like a workload (service) in your cluster, so that all applications will initially send data to this then the oc-agent forwards everything to the backend. This way we can configure some queuing and retries, etc to deal with backend not being available for some time. This is definitely not requirement (more like a suggestion), but can help if the backend is not in the same cluster or cloud region. |
When exporting trace information:
We were able to verify heap usage increased steadily as we added backpressure to our ingest service. This can become a big issue if we had actual latency issues or if our ingest service was down.
The text was updated successfully, but these errors were encountered: