-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix object destruction order #497
Conversation
@@ -2766,7 +2744,7 @@ rclpy_wait(PyObject * Py_UNUSED(self), PyObject * args) | |||
if (!PyArg_ParseTuple(args, "O|K", &pywait_set, &timeout)) { | |||
return NULL; | |||
} | |||
rcl_wait_set_t * wait_set = (rcl_wait_set_t *)PyCapsule_GetPointer(pywait_set, "rcl_wait_set_t"); | |||
rcl_wait_set_t * wait_set = PyCapsule_GetPointer(pywait_set, "rcl_wait_set_t"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ivanpauno why doesn't rcl_wait_set_t
follow the same treatment that other rcl
types' allocations do i.e. using rclpy_handle_t
too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Currently, the rcl_wait_set_t
capsule didn't have a deleter, and deletion was handled manually in Python
.
To avoid doing this PR bigger, I didn't refactor how wait sets were used. It should be done in a follow up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's open an issue to track that work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but we should follow up on leaking resources ASAP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review. Will continue later.
The managed `rclpy_handle_t` object will not be destructed immediately if another handle | ||
called `other.requires(this)`. | ||
In that case, the managed object will be destroyed after all its | ||
dependents are destroyed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can Handle
call destroy()
on all the downstream handles too? Without that, there must be something else that has duplicate information of which handle requires what so they can be destroyed in the right order. If I want to destroy Node
, I must destroy the publishers, subscribers, etc created from it first. In practice this responsibility is owned by Node.destroy_node()
, but that only works because Node
also knows which handles require its handle.
That duplicate knowledge is hard to do for Context
. Without that knowledge duplicated it means rclpy.shutdown()
doesn't actually shutdown the library; instead everything must wait for the garbage collector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That duplicate knowledge is hard to do for Context. Without that knowledge duplicated it means rclpy.shutdown() doesn't actually shutdown the library; instead everything must wait for the garbage collector.
Up to now, shutting down rclpy wasn't destructing all the rcl
created objects, as Context
wasn't using a Handle
. If we want that feature, I would rather do it in a follow-up PR, and limit this one to fix the order destruction problem (it can be done by keeping track of all the nodes created by a context).
There's also some inconsistency of what happens when a node is destructed:
- Publishers, Subscriptions, Clients, Services, Timers, Guards created by a node are destructed.
- Action clients and action servers aren't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also rely on things being explicitly destroyed in reversed order and fail if destruction of an object other objects still depend on is attempted.
If I may, I have to ask, where does this need of explicitly destroying objects come from? It's somewhat incompatible with Python, it creates invalid objects, it makes resource management more complex and it could be easily circumvented by calling gc.collect
if need be. It almost seems to me like we're fighting the language to do things in a way it was not meant to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I may, I have to ask, where does this need of explicitly destroying objects come from? It's somewhat incompatible with Python, it creates invalid objects, it makes resource management more complex and it could be easily circumvented by calling gc.collect if need be. It almost seems to me like we're fighting the language to do things in a way it was not meant to.
I hear you. This is an old discussion. I've been on that side before, and I'm still mostly on that side. I agree we don't need to free memory right away. There's nothing special about our ram usage vs python's, so if the python garbage collector is good enough for some then it is good enough for all.
The reason to destroy these objects is not about their memory usage, it's about freeing system resources. Explicitly freeing system resources without waiting for the garbage collector is not fighting the language; it's pretty normal.
- https://docs.python.org/3/library/socket.html#socket.close
- https://docs.python.org/3/library/io.html#io.IOBase.close
- https://docs.python.org/3/library/subprocess.html#subprocess.Popen.kill
Maybe it could be circumvented by calling gc.collect()
, but that's not how python objects usually solve this situation: https://docs.python.org/3/tutorial/inputoutput.html .
If you’re not using the with keyword, then you should call f.close() to close the file and immediately free up any system resources used by it. If you don’t explicitly close a file, Python’s garbage collector will eventually destroy the object and close the open file for you, but the file may stay open for a while. Another risk is that different Python implementations will do this clean-up at different times.
Yes rcl
objects have to be destroyed in order, but that has nothing to do with why we want to destroy them. We want to destroy them because that's the only way the rcl
API gives us to free system resources they use. Depending on the rmw
implementation, nodes/pubs/subs/etc open sockets, they're using network bandwidth by continuing to participate in discovery, subscribers are receiving data from other publishers on the network, service servers may be accepting requests even if the intent is for another server to take their place, etc. That's where the need of explicitly destroying objects comes from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the point of freeing resources. But, my point is that rclpy.shutdown
is currently not doing so. I think that the responsability of doing that should be of the Context
itself, and not of Handle
. I can add that, but IMO it's subject of a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my point is that rclpy.shutdown is currently not doing so. I think that the responsability of doing that should be of the Context itself, and not of Handle. I can add that, but IMO it's subject of a follow up.
That's a fair point. I think I'm getting hung up on what I intended Handle
to do vs the reality of that never being fully implemented. I do like the approach this PR takes, and I think it makes it pretty easy to get to a state of everything being cleaned up right away.
If h1.requires(h0)
caused h0
to store a weakref to h1
, then h0.destory()
could call h1.destroy()
without worrying about destruction order of h1
's dependents because the reference counting in this PR would take care of that. I think doing that is important for the reasons above, but you're right it doesn't need to be done right away if the feature doesn't fully exist now. I'll dismiss my review since it seems to be blocking progress. Sorry about that.
267e393
to
cd37c5a
Compare
cd37c5a
to
034da37
Compare
c0c56b9
to
b08ad1b
Compare
@hidmic About the regression test you asked, I gave it a try, but it's super hard to reproduce the problem. My main reason to move lifetime manangement to C is that destruction order is not documented in Python, nor how the gc works. Both of them are an implementation detail of a particular python interpreter, and users shouldn't make any supposition about how they work. From PEP 442:
|
I think listing that PEP to back up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but for a question, though some green CI would be nice too
with node1.handle: | ||
pass | ||
finally: | ||
rclpy.shutdown(context=context) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ivanpauno why were these tests dropped? they certainly have to be rewritten but still apply conceptually, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those tests were relaying in using requires
, which is not exposed more.
I think that the tests I wrote in https://github.com/ros2/rclpy/blob/39a93ac5cde247b53137441e34f431d68ba211c4/rclpy/src/test/rclpy_common/test_handle.cpp are a good replacement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, fair enough.
Signed-off-by: Ivan Santiago Paunovic <[email protected]> Add node handle dependeny on context handle Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
b08ad1b
to
39a93ac
Compare
See 826e6cc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM again but for a few nits, and pending green CI
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM !
Supersedes #470.
See #470 (comment).
Solves strange bugs related with destruction order.
Included in this PR:
rclpy_handle_t
struct, related methods, and usage of it inHandle
class.rclpy_handle_t
where needed.Not included (subject of a follow up):
rmw_request_id_t
capsule is leaking.rmw_qos_profile_t
capsule is leaking.wait set
related code, so it can be wrapped by aHandle
.rclpy_action
objects, so they can be wrapped in aHandle
.