-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash when shutting down without disposing #50
Comments
The first one sounds like the most likely explanation. Due to GC being random we might call librdkafka cleanup functions in the wrong order, which then causes the crash. The C# wrapper objects should keep references that try to ensure that librdkafka is cleaned up in the right order. E.g. https://github.com/ah-/rdkafka-dotnet/blob/master/src/RdKafka/Internal/SafeKafkaHandle.cs#L97 keeps a reference to the Producer, which should ensure that the Producer only gets destroyed after all related Topics are gone. If you have time to debug, I'd start by putting logging into the SafeKafkaHandle and so on and see if they do get called in the wrong order. |
Good idea to create a separate issue. Is the producer going to wait for outstanding requests to be transmitted and handled? I am not sure RdKafka.Handle.Dispose gets called during finalizations. This page suggests that one should first wait for outstanding requests: |
That must be it! Handle should implement the full Dispose() pattern (https://msdn.microsoft.com/en-us/library/b1yfkh5e(v=vs.110).aspx) then I think this should stop happening. Although, reading the docs it seems like it shouldn't crash if you don't wait, so maybe that's a librdkafka issue. |
If i modify the sample above to keep references to the topic and producers and Disposing them after the outer loop (i accumulate two lists) then the application does not seem to crash (so far). Will you be updating the library to do additional cleanup? Update: if alternatively i accumulate the tasks returned by produce and wait on all of them before the topic+producer reference goes out of scope, it still crashes (eventually). |
The other thing, is that i cannot reproduce the crash if i do not call Produce in Parallel. Correction - just done that now. See next post. |
If i add some logging to SafeTopicHandle.ReleaseHandle and SafeKafkaHandle.ReleaseHandle and actually run everything single threaded (so in the above sample code, replace Parallel.For(0, Count, _ => topic.Produce(null)) with for (var i=0;i<Count;++i) topic.Produce(null);) i see the following output before it crashes: SafeKafkaHandle.ReleaseHandle does:
SafeTopicHandle.ReleaseHandle does:
The last producer to be freed never writes "Exit.ReleaseHandle.Producer". Does this look like a RdKafka issue?
|
Notice how there is no matching "19 ExitReleaseHandle.SafeTopicHandle" and "20 Exit.ReleaseHandle.Producer" |
Do you think that Handle.StartCallbackTask can be doing a poll while the handle is finalized? 1.If I comment out the poll from the task then it doesn't crash. Unfortunately i cannot prove that poll is being called while finalized. Eden does say that you should not call any other API while destroying: |
Thanks for all the debugging, this was really helpful! I think this is fixed now in a0c5343, by cancelling the polling task in the finalizer, which should guarantee it doesn't poll after shutting down the librdkafka objects. I had a bit of a hard time reproducing the issue reliably, could you check if you still see the issue? |
Thank you. Running it now looks like it doesn't crash. But then this happened :/ System.Runtime.InteropServices.SEHException: External component has thrown an exception. Not sure if this is RdKafka or librdkafka. |
Might be something transient as it is running for a couple of hours now and all is fine. The test is: |
It has been running fine for a couple of hours now and that exception has yet to re-occur. Maybe the error above was a fluke? What do you think about disposing the two instances of CTS and task? Also Microsoft probably wont add finalizers after the fact (as that would screw all existing finalizers). |
I think the error was not a fluke. I believe that the finalization order was at play here. The Handle finalizer can run after the finalizers for SafeTopicHandle and SafeKafkaHanle. A->B - indicates that A has a reference to B. The following object graph can be determined: Topic->SafeTopicHandle Looks like the finalizer(~) for Handle can run after the finalizer for SafeHandle. If that happens then the callbackTask can execute one more time and access SafeKafkaHanle. |
You're right about the issue, it's certainly real and has most likely to do with cleanup order. I think it's because there's no guaranteed order in which the finalizers run. Using DangerousAddRef to express the dependency would be one possible solution, but I'm not yet sure if that's the best one. I have somewhat limited time to spend on this right now, but I'll look into it as soon as I can. On what platform are you seeing the issue? .net framework (not core) on Windows? |
I'm running this on Windows 7. .NET 4.6 is installed but i compile it for 4.5.2. Yes unless all unmanaged handles are aggregated under one class (Handle, SafeTopicHandle, SafeKafkaHandle) then DangerousAddRef is a solution. In principle the code can be refactored under the hood such that there is only one owner of SafeTopicHandle[] and SafeKafkaHandle. The this type could do cleanup. But i wonder if for a large number of SafeTopicHandles the finalize will take more than 2 seconds (in which case .NET will not like that). How about a rough solution based on this:
|
See https://github.com/ah-/rdkafka-dotnet/pull/49
Running
sometimes crashes with "ntdll!NtWaitForSingleObject" and "A heap has been corrupted."
The text was updated successfully, but these errors were encountered: