Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash when shutting down without disposing #50

Open
ah- opened this issue Jul 7, 2016 · 15 comments
Open

Crash when shutting down without disposing #50

ah- opened this issue Jul 7, 2016 · 15 comments

Comments

@ah-
Copy link
Owner

ah- commented Jul 7, 2016

See https://github.com/ah-/rdkafka-dotnet/pull/49

Running

private static void Runner()
        {
            const int tries = 10;
            for (var i = 0; i < tries; ++i)
            {
                var producer = new Producer(new Config
                {
                    ["compression.codec"] = "snappy",
                    ["heartbeat.interval.ms"] = "10000",
                    ["session.timeout.ms"] = "30000",
                    ["topic.metadata.refresh.sparse"] = "true"
                }, broker);
                var topic = producer.Topic("Abc", new TopicConfig
                {
                    ["request.required.acks"] = "-1",
                    ["message.timeout.ms"] = "60000"
                });
                Parallel.For(0, Count, _ => topic.Produce(null));
                Console.WriteLine($"Done: {i}.");
            }
            Console.WriteLine("Done.");
        }

sometimes crashes with "ntdll!NtWaitForSingleObject" and "A heap has been corrupted."

@ah-
Copy link
Owner Author

ah- commented Jul 7, 2016

What do you think about the following two hunches?

When SafeKafkaHandle.ReleaseHandle is called via the SafeHandle finalizer, these can be called in random order. Hence a producer can be destroyed before a topic. Basically, given that finalization order is non deterministic, is it possible that it violates proper librdkafka termination?
Finalizers are allowed to run for no more than 2 seconds. If destroying any handle takes more, can this cause any issues?
If the above are risks, then i think we do need to explicitly Dispose. What do you think?

The first one sounds like the most likely explanation. Due to GC being random we might call librdkafka cleanup functions in the wrong order, which then causes the crash.
Alternatively it might be a genuine librdkafka crash, I think we could find out by doing the same thing in C without the C# wrapper and see if it crashes there as well.

The C# wrapper objects should keep references that try to ensure that librdkafka is cleaned up in the right order. E.g. https://github.com/ah-/rdkafka-dotnet/blob/master/src/RdKafka/Internal/SafeKafkaHandle.cs#L97 keeps a reference to the Producer, which should ensure that the Producer only gets destroyed after all related Topics are gone.

If you have time to debug, I'd start by putting logging into the SafeKafkaHandle and so on and see if they do get called in the wrong order.

@MaximGurschi
Copy link
Contributor

Good idea to create a separate issue.

Is the producer going to wait for outstanding requests to be transmitted and handled? I am not sure RdKafka.Handle.Dispose gets called during finalizations.

This page suggests that one should first wait for outstanding requests:
https://github.com/edenhill/librdkafka/wiki/Proper-termination-sequence

@ah-
Copy link
Owner Author

ah- commented Jul 8, 2016

That must be it! Handle should implement the full Dispose() pattern (https://msdn.microsoft.com/en-us/library/b1yfkh5e(v=vs.110).aspx) then I think this should stop happening.

Although, reading the docs it seems like it shouldn't crash if you don't wait, so maybe that's a librdkafka issue.

@MaximGurschi
Copy link
Contributor

MaximGurschi commented Jul 8, 2016

If i modify the sample above to keep references to the topic and producers and Disposing them after the outer loop (i accumulate two lists) then the application does not seem to crash (so far).

Will you be updating the library to do additional cleanup?

Update: if alternatively i accumulate the tasks returned by produce and wait on all of them before the topic+producer reference goes out of scope, it still crashes (eventually).

@MaximGurschi
Copy link
Contributor

MaximGurschi commented Jul 8, 2016

The other thing, is that i cannot reproduce the crash if i do not call Produce in Parallel.

Correction - just done that now. See next post.

@MaximGurschi
Copy link
Contributor

MaximGurschi commented Jul 8, 2016

If i add some logging to SafeTopicHandle.ReleaseHandle and SafeKafkaHandle.ReleaseHandle and actually run everything single threaded (so in the above sample code, replace Parallel.For(0, Count, _ => topic.Produce(null)) with for (var i=0;i<Count;++i) topic.Produce(null);) i see the following output before it crashes:

SafeKafkaHandle.ReleaseHandle does:

                    var n = Counter.Next();
                    Console.WriteLine($"\t\t{n} Enter.ReleaseHandle.Producer");
                    LibRdKafka.destroy(handle);
                    Console.WriteLine($"\t\t{n} Exit.ReleaseHandle.Producer");

SafeTopicHandle.ReleaseHandle does:

                    var n = Counter.Next();
                    Console.WriteLine($"\t\t\t{n} Enter.ReleaseHandle.SafeTopicHandle");
                    LibRdKafka.topic_destroy(handle);
                    Console.WriteLine($"\t\t\t{n} Middle.ReleaseHandle.SafeTopicHandle");
                    // See SafeKafkaHandle.Topic
                    kafkaHandle.DangerousRelease();
                    Console.WriteLine($"\t\t\t{n} Exit.ReleaseHandle.SafeTopicHandle");

static class Counter { private static int _next; public static int Next() => ++_next; }

The last producer to be freed never writes "Exit.ReleaseHandle.Producer". Does this look like a RdKafka issue?

Done: 0.
                        1 Enter.ReleaseHandle.SafeTopicHandle
                        1 Middle.ReleaseHandle.SafeTopicHandle
                        1 Exit.ReleaseHandle.SafeTopicHandle
Done: 1.
                        2 Enter.ReleaseHandle.SafeTopicHandle
                        2 Middle.ReleaseHandle.SafeTopicHandle
                        2 Exit.ReleaseHandle.SafeTopicHandle
Done: 2.
                        3 Enter.ReleaseHandle.SafeTopicHandle
                        3 Middle.ReleaseHandle.SafeTopicHandle
                        3 Exit.ReleaseHandle.SafeTopicHandle
Done: 3.
                        4 Enter.ReleaseHandle.SafeTopicHandle
                        4 Middle.ReleaseHandle.SafeTopicHandle
                        4 Exit.ReleaseHandle.SafeTopicHandle
Done: 4.
                        5 Enter.ReleaseHandle.SafeTopicHandle
                        5 Middle.ReleaseHandle.SafeTopicHandle
                        5 Exit.ReleaseHandle.SafeTopicHandle
Done: 5.
                        6 Enter.ReleaseHandle.SafeTopicHandle
                        6 Middle.ReleaseHandle.SafeTopicHandle
                        6 Exit.ReleaseHandle.SafeTopicHandle
Done: 6.
Done: 7.
                        7 Enter.ReleaseHandle.SafeTopicHandle
                        7 Middle.ReleaseHandle.SafeTopicHandle
                        7 Exit.ReleaseHandle.SafeTopicHandle
                        8 Enter.ReleaseHandle.SafeTopicHandle
                        8 Middle.ReleaseHandle.SafeTopicHandle
                        8 Exit.ReleaseHandle.SafeTopicHandle
Done: 8.
                        9 Enter.ReleaseHandle.SafeTopicHandle
                        9 Middle.ReleaseHandle.SafeTopicHandle
                        9 Exit.ReleaseHandle.SafeTopicHandle
Done: 9.
Done.
                10 Enter.ReleaseHandle.Producer
                10 Exit.ReleaseHandle.Producer
                11 Enter.ReleaseHandle.Producer
                11 Exit.ReleaseHandle.Producer
                12 Enter.ReleaseHandle.Producer
                12 Exit.ReleaseHandle.Producer
                13 Enter.ReleaseHandle.Producer
                13 Exit.ReleaseHandle.Producer
                14 Enter.ReleaseHandle.Producer
                14 Exit.ReleaseHandle.Producer
                15 Enter.ReleaseHandle.Producer
                15 Exit.ReleaseHandle.Producer
                16 Enter.ReleaseHandle.Producer
                16 Exit.ReleaseHandle.Producer
                17 Enter.ReleaseHandle.Producer
                17 Exit.ReleaseHandle.Producer
                18 Enter.ReleaseHandle.Producer
                18 Exit.ReleaseHandle.Producer
                        19 Enter.ReleaseHandle.SafeTopicHandle
                        19 Middle.ReleaseHandle.SafeTopicHandle
                20 Enter.ReleaseHandle.Producer

@MaximGurschi
Copy link
Contributor

Notice how there is no matching "19 ExitReleaseHandle.SafeTopicHandle" and "20 Exit.ReleaseHandle.Producer"

@MaximGurschi
Copy link
Contributor

MaximGurschi commented Jul 8, 2016

Do you think that Handle.StartCallbackTask can be doing a poll while the handle is finalized?

1.If I comment out the poll from the task then it doesn't crash.
2.Alternatively, if i just stop the poll task before letting the application exit (I added a new function called Stop) then it doesn't crash.

Unfortunately i cannot prove that poll is being called while finalized.

Eden does say that you should not call any other API while destroying:
confluentinc/librdkafka#725

@ah-
Copy link
Owner Author

ah- commented Jul 10, 2016

Thanks for all the debugging, this was really helpful!

I think this is fixed now in a0c5343, by cancelling the polling task in the finalizer, which should guarantee it doesn't poll after shutting down the librdkafka objects.

I had a bit of a hard time reproducing the issue reliably, could you check if you still see the issue?

@MaximGurschi
Copy link
Contributor

MaximGurschi commented Jul 11, 2016

Thank you. Running it now looks like it doesn't crash. But then this happened :/


System.Runtime.InteropServices.SEHException: External component has thrown an exception.
at RdKafka.Internal.LibRdKafka.NativeMethods.rd_kafka_poll(IntPtr rk, IntPtr timeout_ms)
at RdKafka.Handle.<>c__DisplayClass27_0.b__0()
at System.Threading.Tasks.Task.Execute()
--- End of inner exception stack trace ---
at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
at RdKafka.Handle.Dispose(Boolean disposing)
at RdKafka.Handle.Finalize()


Not sure if this is RdKafka or librdkafka.
Is it completely safe to call the CTS and task from the finalizer? Though non of the classes have finalizers, can Microsoft change that? Also maybe it is a good idea to Dispose the callbackTask and callbackCts in Dispose(bool)?

@MaximGurschi
Copy link
Contributor

Might be something transient as it is running for a couple of hours now and all is fine.

The test is:
1.Compile the above sample to an exe.
2.A batch file runs the exe, waits for it to finish and runs another one in an infinite loop.

@MaximGurschi
Copy link
Contributor

It has been running fine for a couple of hours now and that exception has yet to re-occur. Maybe the error above was a fluke?

What do you think about disposing the two instances of CTS and task? Also Microsoft probably wont add finalizers after the fact (as that would screw all existing finalizers).

@MaximGurschi
Copy link
Contributor

MaximGurschi commented Jul 12, 2016

I think the error was not a fluke. I believe that the finalization order was at play here. The Handle finalizer can run after the finalizers for SafeTopicHandle and SafeKafkaHanle.

A->B - indicates that A has a reference to B.
{A,B} - indicates both A and B

The following object graph can be determined:
Producer->~Handle->{CTS,Task,SafeKafkaHanle}
(continued)
{CTS,Task}->SafeKafkaHanle
(continued)
SafeKafkaHanle->SafeHandleZeroIsInvalid->~SafeHandle

Topic->SafeTopicHandle
(continued)
{Topic,SafeTopicHandle}->SafeKafkaHanle

Looks like the finalizer(~) for Handle can run after the finalizer for SafeHandle. If that happens then the callbackTask can execute one more time and access SafeKafkaHanle.

@ah-
Copy link
Owner Author

ah- commented Jul 13, 2016

You're right about the issue, it's certainly real and has most likely to do with cleanup order.

I think it's because there's no guaranteed order in which the finalizers run. Using DangerousAddRef to express the dependency would be one possible solution, but I'm not yet sure if that's the best one.

I have somewhat limited time to spend on this right now, but I'll look into it as soon as I can. On what platform are you seeing the issue? .net framework (not core) on Windows?

@MaximGurschi
Copy link
Contributor

I'm running this on Windows 7. .NET 4.6 is installed but i compile it for 4.5.2.

Yes unless all unmanaged handles are aggregated under one class (Handle, SafeTopicHandle, SafeKafkaHandle) then DangerousAddRef is a solution. In principle the code can be refactored under the hood such that there is only one owner of SafeTopicHandle[] and SafeKafkaHandle. The this type could do cleanup. But i wonder if for a large number of SafeTopicHandles the finalize will take more than 2 seconds (in which case .NET will not like that).

How about a rough solution based on this:

  1. Drop the Handle destructor

  2. Add this first line to Handle.Dispose(bool)

         //Check for null in case Init threw.
         if (callbackTask==null || Interlocked.CompareExchange(ref stoppedCallbackTask, 1, 0) == 1) return;
    
  3. Handle.Init does:

        callbackTask = StartCallbackTask(callbackCts.Token);
        handle.stopCallbackTask = () => Dispose(false);
    
  4. SafeKafkaHandle.Topic does:

        topicHandle.kafkaHandle = this;
        topicHandle.stopCallbackTask = stopCallbackTask;
    
  5. ReleaseHandle of both SafeKafkaHandle and SafeKafkaHandle call stopCallbackTask?.Invoke() first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants