-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Remove circular references from DataReaders #134
base: master
Are you sure you want to change the base?
WIP: Remove circular references from DataReaders #134
Conversation
I apparently wasn't paying attention carefully enough (thanks to @thijsmie doing such a wonderful work) because this is something I'd like to know a little bit more about. That exception is that listeners are called synchronously when some event occurs, and the consequence is that it may be called while holding some locks and hence that some operations can deadlock when invoked from inside a listener. So if there are listener invocations occur in parallel, or if you call In my understanding, this saga started with listeners, and it seems plausible to me that you ran into this exception. But I'd like to be sure ... |
Well I'm a little confused honestly. I've seen several different kinds of problems including segfaults and hangs. My app only ever uses asyncio's The segfaults I've been seeing are always related to listeners and reference cycles. The hangs on the other hand are a bit harder to track down and I think vary. I have three small tests on a branch which can produce 2 hangs, and 1 segfault. All use the See: master...willstott101:cyclonedds-python:repro/dds_delete_hang |
@willstott101 I tried a few things — 0.10.1 (consistently?) crashes with your tests, but 0.10.1 + this PR and after combining your tests sometimes hangs. I haven't been paying sufficient attention to this to be sure whether this is the combination I should have tried, but it is a deadlock that fits with my expectations: it tries to invoke the Python listener in one thread on receiving data, blocks trying to obtain the GIL, and then another thread blocks on trying to deliver data. It is so easy to accidentally end up in this kind of thing that I think the Cyclone C code should add support for listener invocations on another thread, make that the default and keep the current behaviour as an opt-in for those cases where the synchronous nature of the invocation is really needed. That would fix this type of deadlock.
|
Yes I don't think it's possible to create listeners which don't take the GIL right now in this lib. So another solution would be great. Both queuing an event in asyncio as well as running in another cyclonedds thread would work for me. It does make me wander if we can solve this by just releasing the GIL on more calls into cyclonedds? Though that could be a bit too delicate to get right. |
Any I have found something in the Python C API docs called Asynchronous Notifications which could be exactly what we need. I'll explore that after my holiday. |
Ok so if we're currently unable to support circular references (read: correct object finalization during the circular island collection phase in cpython where collection order is unspecified) then we need to not create reference cycles in the library ourselves at the very least.
The only case of this I've found specifically is
ReadCondition
s - I wonder if they really need to keep the reference to theDataReader
? For now in this MR I've just made ReadConditions only exist for function scopes - which fixes the segfaults and hangs I've been seeing once removing all circular references from my own code.Avoiding circular references is good practice anyway, really - but it is a shame the results are so explosive when they exist as they're so easy to introduce. In our case we want to expose users to a python API using cyclonedds so just avoiding them in app code is untenable.
I suspect it's actually a
cyclonedds
C bug that was causing the CI hanging - ifdds_delete
is meant to be idempotent and support any destruction order.FWIW I found these loops quite manually by adding prints to
Entity.__del__
to find classes exhibiting loops by printing if the object is in the_entities
dict during destruction.