Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Shutdown doesn't flush when used with global subscriber #1961

Open
stepantubanov opened this issue Jul 24, 2024 · 10 comments
Open

[Bug]: Shutdown doesn't flush when used with global subscriber #1961

stepantubanov opened this issue Jul 24, 2024 · 10 comments
Assignees
Labels
bug Something isn't working triage:todo Needs to be traiged.

Comments

@stepantubanov
Copy link

stepantubanov commented Jul 24, 2024

What happened?

As mentioned in #1625 - Tracer now holds strong reference to TracerProvider.

When opentelemetry used as a layer with global tracing subscriber it is now impossible to shutdown properly (it only decrements a reference, but doesn't execute Drop).

let layer = tracing_opentelemetry::layer()
  .with_tracer(tracer_provider.tracer("app"));

opentelemetry::global::set_tracer_provider(tracer_provider);
tracing::subscriber::set_global_default(Registry::default().with(layer));

// shutdown call does not actually shutdown global tracer provider
opentelemetry::global::shutdown_tracer_provider();

As a result some spans are missing, flattened, etc.

EDIT: Possible workaround is to flush manually:

tracer_provider.force_flush();

API Version

0.24.0

SDK Version

0.24.1

What Exporter(s) are you seeing the problem on?

OTLP

Relevant log output

No response

@stepantubanov stepantubanov added bug Something isn't working triage:todo Needs to be traiged. labels Jul 24, 2024
@stepantubanov stepantubanov changed the title [Bug]: No way to shutdown properly when used with global subscriber [Bug]: Shutdown doesn't flush when used with global subscriber Jul 24, 2024
@lalitb
Copy link
Member

lalitb commented Jul 24, 2024

When opentelemetry used as a layer with global tracing subscriber it is now impossible to shutdown properly (it only decrements a reference, but doesn't execute Drop).

That's expected behavior. The global::shutdown_tracer_provider() should only decrement the reference it is holding. There shouldn't be any flush/shutdown mechanism exposed at API/global level as per the specs. The application needs to keep hold of the sdk::TracerProvider handle, and use it to invoke shutdown/flush. And during graceful shutdown of application, the shutdown would be invoked implicitly through Drop which internally does the flush before shutdown. cc @TommyCpp

@stepantubanov
Copy link
Author

stepantubanov commented Jul 24, 2024

That's expected behavior. The global::shutdown_tracer_provider() should only decrement the reference it is holding. There shouldn't be any flush/shutdown mechanism exposed at API/global level as per the specs.

It's kind of misleading for global::shutdown_tracer_provider() to not be a shutdown mechanism for global tracer. Currently the doc reads:

Shut down the current tracer provider. This will invoke the shutdown method on all span processors. span processors should export remaining spans before return
https://docs.rs/opentelemetry/0.24.0/opentelemetry/global/fn.shutdown_tracer_provider.html

If current behavior (no shutdown if there are other references) is intentional and should remain, then I think it'd be helpful to change the documentation (explicitly state that it's not going to do anything if there's more references to the tracer provider, suggest calling flush/shutdown on the tracer provider directly).

And during graceful shutdown of application, the shutdown would be invoked implicitly through Drop which internally does the flush before shutdown.

In previous opentelemetry version it used to work like that (because Tracer had a weak reference to TracerProvider) for the given example, but with the new version it no longer works. And it was a bit surprising to debug this issue and find out call to shutdown_tracer_provider didn't do anything.

@lalitb
Copy link
Member

lalitb commented Jul 24, 2024

If current behavior (no shutdown if there are other references) is intentional and should remain, then I think it'd be helpful to change the documentation (explicitly state that it's not going to do anything if there's more references to the tracer provider, suggest calling flush/shutdown on the tracer provider directly).

  • Agree, doc is misleading. We can use this issue to track the improvement. Also, I think global::shutdown_tracer_provider() can be named better say global::release_tracer_provider().

@mzabaluev
Copy link

mzabaluev commented Oct 15, 2024

Here's a test against opentelemetry_sdk:

#![cfg(feature = "rt-tokio")]

use futures_util::future::BoxFuture;
use opentelemetry::global as otel_global;
use opentelemetry::trace::{TracerProvider as _, Tracer as _};
use opentelemetry_sdk::{
    export::trace::{ExportResult, SpanData, SpanExporter},
    runtime,
    trace::TracerProvider,
};
use tokio::runtime::Runtime;

use std::sync::{Arc, Mutex};

#[derive(Clone, Debug, Default)]
struct TestExporter(Arc<Mutex<Vec<SpanData>>>);

impl SpanExporter for TestExporter {
    fn export(&mut self, mut batch: Vec<SpanData>) -> BoxFuture<'static, ExportResult> {
        let spans = self.0.clone();
        Box::pin(async move {
            if let Ok(mut inner) = spans.lock() {
                inner.append(&mut batch);
            }
            Ok(())
        })
    }
}

fn test_tracer(runtime: &Runtime) -> (TracerProvider, TestExporter) {
    let _guard = runtime.enter();

    let exporter = TestExporter::default();
    let provider = TracerProvider::builder()
        .with_batch_exporter(exporter.clone(), runtime::Tokio)
        .build();

    (provider, exporter)
}

#[test]
fn shutdown_global() {
    let rt = Runtime::new().unwrap();
    let (provider, exporter) = test_tracer(&rt);

    otel_global::set_tracer_provider(provider);

    let tracer = otel_global::tracer("test");
    for _ in 0..1000 {
        tracer.start("test_span");
    }
    // drop(tracer);

    // Should flush all batched telemetry spans
    otel_global::shutdown_tracer_provider();

    let spans = exporter.0.lock().unwrap();
    assert_eq!(spans.len(), 1000);
}

Dropping the tracer before the call to shutdown_tracer_provider makes the test succeed.

As currently implemented, the behavior of shutdown_tracer_provider is fragile against existence of remaining references to the provider, which may be still in scope, or owned by other globally set data like in tokio-rs/tracing-opentelemetry#159, or simply leaked. This is in contrast to TracerProvider::shutdown, which works as expected.

@lalitb lalitb self-assigned this Oct 15, 2024
@mzabaluev
Copy link

mzabaluev commented Oct 16, 2024

This is in contrast to TracerProvider::shutdown, which works as expected.

Here's a test to exercise this. However, it does produce an error printout when the tracer instance is kept alive past the shutdown.

#[test]
fn shutdown_in_scope() {
    let rt = Runtime::new().unwrap();
    let (provider, exporter) = test_tracer(&rt);

    let tracer = provider.tracer("test");
    for _ in 0..1000 {
        tracer.start("test_span");
    }
    // drop(tracer);

    // Should flush all batched telemetry spans
    provider.shutdown().unwrap();

    let spans = exporter.0.lock().unwrap();
    assert_eq!(spans.len(), 1000);
}

There is also a case of a lockup in the shutdown call unless it's done in a separate worker thread. I was able to reproduce and traced it to this line:

futures_executor::block_on(res_receiver)

So it looks like we're not safe with the shutdown method either, but that's a subject for another bug report.

@pitoniak32
Copy link
Contributor

pitoniak32 commented Nov 21, 2024

I just got bit by this too, and took me a while to find a work around, I think at least the examples / docs need to be updated.

This was my workaround
setup:

pub struct OtelGuard {
    tracer_provider: opentelemetry_sdk::trace::TracerProvider,
}

impl Drop for OtelGuard {
    fn drop(&mut self) {
        println!("Dropping OtelGuard!");
        println!("Shutting down TracerProvider!");
        self.tracer_provider.shutdown().expect("TracerProvider should shutdown properly");
    }
}

pub fn setup_tracing_subscriber() -> anyhow::Result<OtelGuard> {
    let tracer_provider = opentelemetry_sdk::trace::TracerProvider::builder()
        .with_batch_exporter(
            SpanExporter::builder()
                .with_tonic()
                .with_endpoint("grpc://localhost:4317")
                .build()?,
            runtime::Tokio,
        )
        .build();
    
    global::set_tracer_provider(tracer_provider.clone());
    
    tracing::subscriber::set_global_default(
        tracing_subscriber::registry()
            .with(tracing_subscriber::fmt::layer())
            .with(OpenTelemetryLayer::new(tracer_provider.tracer(TRACER_NAME))),
    )
    .unwrap();
    
    Ok(OtelGuard { tracer_provider })
}

main:

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let _otel_guard = otel::setup_tracing_subscriber()?;
    
    let root_span = info_span!("test");
    let _span_gaurd = root_span.enter();
    
    let fruit = "watermelon";
    
    tracing::debug!("Fruit was {fruit}");
    tracing::info!("Fruit was {fruit}");
    tracing::warn!("Fruit was {fruit}");
    tracing::error!("Fruit was {fruit}");
    
    Ok(())
}

@lalitb
Copy link
Member

lalitb commented Nov 21, 2024

To clarify, the shutdown_tracer_provider will eventually be removed from the API surface. Alternatively, it might be moved to the SDK and renamed to release_tracer_provider, as its primary purpose is to release the provider from the global context. For all practical purposes, TracerProvider::Shutdown() should be used for shutting down.

@pitoniak32
Copy link
Contributor

pitoniak32 commented Nov 22, 2024

@lalitb, @cijothomas - Would it be welcome to update some of the tracing examples to show this more clearly? I could see this being very frustrating to new comers of the crate trying to get it working from examples and traces not being sent.


To clarify, the shutdown_tracer_provider will eventually be removed from the API surface. Alternatively, it might be moved to the SDK and renamed to release_tracer_provider, as its primary purpose is to release the provider from the global context. For all practical purposes, TracerProvider::Shutdown() should be used for shutting down.

just updated my example snippet to be more accurate

@cijothomas
Copy link
Member

@pitoniak32 I'd suggest to take a look at the Metrics example and see if it makes sense and is easy to use. If yes, lets replicate the same to Traces. This was always the plan, just that Metrics (and logs) made progress, and we left out tracing. (Sorry, we just didn't have the bandwidth to tackle all, but I realize now it'd have been better to not leave high level inconsistency like this :( )

The overall idea is:
Make the providers a cheaply cloneable struct pointing to same underlying provider. Lets make it clear in the doc that Provider is a handle to underlying provider only and is cheaply cloneable.
Once user creates a provider, they hold on to it, so shutdown can be called at the end. -- this is the key part.
If there is a need to set the provider globally (so that other parts of the app or libraries can retrieve it), clone and set it.

Metrics and Logs are already working like this. (Small difference: Logs does not have the need to set global)
Lets make Traces consistent with rest of the signals, and then the doc/example would naturally be consistent and won't suffer from the confusion raised in this issue.

@pitoniak32 Something you can help to make happen?
(Only one thing to note, if planning to work on this immediately: the next release 0.27.1 should not contain any breaking change - we can still mark things obsolete, and do 0.28 with breaking changes, if any needed.)

@pitoniak32
Copy link
Contributor

Yeah I can take a look at this! Makes sense, thank you for the outline!

Totally understand, there's a ton of stuff to maintain in here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage:todo Needs to be traiged.
Projects
None yet
Development

No branches or pull requests

5 participants