Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Context destruction order bug #470

Closed
wants to merge 2 commits into from

Conversation

ivanpauno
Copy link
Member

@ivanpauno ivanpauno commented Dec 3, 2019

Intro

As part of ros2/rmw#183, I needed to ensure that the context is destructed after the nodes/publishers/etc. That's needed as now the Participant will be part of the context, and the Participant is needed in the destruction of most of the entities (e.g.: https://github.com/ros2/rmw_fastrtps/blob/a16c45ef11a19361a15c585497bf373a7ed0bae6/rmw_fastrtps_shared_cpp/src/rmw_service.cpp#L75).
Also, rcl documentation says that not doing so, is undefined behavior https://github.com/ros2/rcl/blob/3a5c3a34191bd95491cb72c7458b9409452a28f0/rcl/include/rcl/context.h#L101-L106 (before, undefined meant nothing happened. Now it means a segfault).

Commits

  • b740f90 is using a Handle object to wrap the context capsule, and establishing a dependency of the node handle on the context handle. The other changes are just a consequence of this.
  • 5f7f249 is adding some debug messages I will use in the buggy example.

Buggy example

The context is sometimes being destructed out of order, though it's hard to reproduce it.
Here is one example:

  • Example

    Code
    import rclpy
    from rclpy.executors import SingleThreadedExecutor
    from rclpy.node import Node
    from std_msgs.msg import String
    
    
    def main(args=None):
        rclpy.init(args=args)
        node = Node('talker')
        node2 = Node('asd')
        rclpy.shutdown()
    
    if __name__ == '__main__':
        main()
    Output
    destroying rcl_node_t   140190542497472
    destroying rcl_context_t   140190450323888
    destroying rclpy_publisher_t   140190542497824
    destroying rcl_clock_t   140190449713680
    destroying rclpy_service_t   140190542497872
    destroying rclpy_service_t   140190542497856
    destroying rclpy_service_t   140190542497888
    destroying rclpy_service_t   140190542497904
    destroying rclpy_service_t   140190542497920
    destroying rclpy_service_t   140190542497936
    destroying rcl_node_t   140190542497952
    destroying rclpy_publisher_t   140190542497968
    destroying rcl_clock_t   140190375035248
    destroying rclpy_service_t   140190542498016
    destroying rclpy_service_t   140190542498000
    destroying rclpy_service_t   140190542498032
    destroying rclpy_service_t   140190542498048
    destroying rclpy_service_t   140190542498064
    destroying rclpy_service_t   140190542498080
    destroying service
    destroyed service
    destroying service
    destroyed service
    destroying service
    destroyed service
    destroying service
    destroyed service
    destroying service
    destroyed service
    destroying service
    destroyed service
    destroying publisher
    destroyed publisher
    destroying node
    destroyed node
    context being destructed
    context destructed
    destroying service
    destroyed service
    destroying service
    destroyed service
    destroying service
    destroyed service
    destroying service
    destroyed service
    destroying service
    destroyed service
    destroying service
    destroyed service
    destroying publisher
    destroyed publisher
    destroying node
    destroyed node
    
    Conclusion The context is being destructed out of order. All the example where I observed this have two nodes (maybe someone is decrementing a reference twice (?)).

Original bug

In the branches where I'm moving the Participant to the context, the following test failed

def test_handle_does_not_destroy_requirements():
context = rclpy.context.Context()
rclpy.init(context=context)
try:
node1 = rclpy.create_node('test_handle_does_not_destroy_requirements1', context=context)
node2 = rclpy.create_node('test_handle_does_not_destroy_requirements2', context=context)
node2.handle.requires(node1.handle)
with node1.handle:
pass
with node2.handle:
pass
node2.handle.destroy()
with pytest.raises(InvalidHandle):
with node2.handle:
pass
with node1.handle:
pass
finally:
rclpy.shutdown(context=context)
.
Some interesting facts:

  • It always failed when I calling colcon test --packages-select rclpy
  • It didn't fail when calling python3 -m pytest -s test_handle.py
  • It did fail when calling python3 -m pytest -s test_handle.py::test_handle_does_not_destroy_requirements.

I have been debugging, and found that the objects are being destructed by the gc.
AFAIU, cpython gc doesn't ensure any destruction order (my understanding of cpython garbage collection process is really poor).

@ivanpauno
Copy link
Member Author

ivanpauno commented Dec 3, 2019

@hidmic FYI
@sloretz Can you take a look to this? I'm not really sure what's the problem (nor how to fix it).

@ivanpauno ivanpauno added bug Something isn't working help wanted Extra attention is needed labels Dec 3, 2019
@sloretz
Copy link
Contributor

sloretz commented Dec 4, 2019

cpython gc doesn't ensure any destruction order

Handle is responsible for the destruction order. When garbage collected, it forces the order using python's reference counting. Both of the nodes' handles should be keeping a reference to the context handle. The context handle's reference count is non-zero until the node's are destroyed because it's still held by them. In theory the context shouldn't get garbage collected at the same time.

That's the theory anyways. From the print statements it looks like the context is being destroyed after the first node, but before the second node. Maybe there's a bug in Handle that causes the second node handle to lose its reference to the context handle? I can look at this deeper later this week.

@ivanpauno
Copy link
Member Author

Handle is responsible for the destruction order. When garbage collected, it forces the order using python's reference counting. Both of the nodes' handles should be keeping a reference to the context handle. The context handle's reference count is non-zero until the node's are destroyed because it's still held by them. In theory the context shouldn't get garbage collected at the same time.

I understand the same.

That's the theory anyways. From the print statements it looks like the context is being destroyed after the first node, but before the second node. Maybe there's a bug in Handle that causes the second node handle to lose its reference to the context handle?

That's possible, though I unfortunately haven't found a bug.

I can look at this deeper later this week.

Thanks!!!

@sloretz
Copy link
Contributor

sloretz commented Dec 10, 2019

No conclusions yet, but I do have a workaround. The destruction order appears correct using ctx.handle.destroy(), but incorrect when done via the garbage collector. Destroying the handle when the context is shutdown works around the issue.

diff --git a/rclpy/rclpy/context.py b/rclpy/rclpy/context.py
index 76a2fcc..2a26696 100644
--- a/rclpy/rclpy/context.py
+++ b/rclpy/rclpy/context.py
@@ -59,6 +59,7 @@ class Context:
         with self._handle as capsule, self._lock:
             rclpy_implementation.rclpy_shutdown(capsule)
         self._call_on_shutdown_callbacks()
+        self.handle.destroy()
 
     def try_shutdown(self):
         """Shutdown this context, if not already shutdown."""
$ ./two_node_shutdown.py 
destroying service
destroyed service
destroying service
destroyed service
destroying service
destroyed service
destroying service
destroyed service
destroying service
destroyed service
destroying service
destroyed service
destroying node
destroyed node
destroying service
destroyed service
destroying service
destroyed service
destroying service
destroyed service
destroying service
destroyed service
destroying service
destroyed service
destroying service
destroyed service
destroying node
destroyed node
context being destructed
context destructed
delaying interpretter exit
destroying rcl_node_t   140172171982528
destroying rcl_context_t   140172050260376
destroying rclpy_publisher_t   140172171982880
destroying rcl_clock_t   140172050237280
destroying rclpy_service_t   140172171982928
destroying rclpy_service_t   140172171982912
destroying rclpy_service_t   140172171982944
destroying rclpy_service_t   140172171982960
destroying rclpy_service_t   140172171982976
destroying rclpy_service_t   140172171982992
destroying rcl_node_t   140172171983008
destroying rclpy_publisher_t   140172171983024
destroying rcl_clock_t   140171974628976
destroying rclpy_service_t   140172171983072
destroying rclpy_service_t   140172171983056
destroying rclpy_service_t   140172171983088
destroying rclpy_service_t   140172171983104
destroying rclpy_service_t   140172171983120
destroying rclpy_service_t   140172171983136

@ivanpauno
Copy link
Member Author

No conclusions yet, but I do have a workaround. The destruction order appears correct using ctx.handle.destroy(), but incorrect when done via the garbage collector. Destroying the handle when the context is shutdown works around the issue.

Yeah, that fixes this specific problem.
But if we don't understand the bug, we may hit it again in the future and have surprising segfaults (e.g.: with handles of other capsules not being destructed in order).

@hidmic
Copy link
Contributor

hidmic commented Dec 12, 2019

but incorrect when done via the garbage collector.

This. Should test that the garbage collector behaves as we'd expect it to? If ref counting does not result in ordered destruction, then the Handle rationale falls apart.

@sloretz
Copy link
Contributor

sloretz commented Dec 17, 2019

If ref counting does not result in ordered destruction, then the Handle rationale falls apart.

Yeah, that seems to be the case. It looks like the garbage collector is destroying Handle instances that are still referred to by other handles, which is an assumption the garbage collection part of Handle relies on.

If I modify the example so gc.collect() happens inside of main() then the objects are destroyed in the correct order. If I modify it so gc.collect() happens after main() returns, then it happens in the wrong order. Maybe this is a special case of the garbage collector when handling a group of objects that became an island? Regardless, solving this will require a scheme that destroys the rcl instances in order regardless of the order the PyCapsule's get garbage collected.

@ivanpauno
Copy link
Member Author

Yeah, that seems to be the case. It looks like the garbage collector is destroying Handle instances that are still referred to by other handles, which is an assumption the garbage collection part of Handle relies on.

If I modify the example so gc.collect() happens inside of main() then the objects are destroyed in the correct order. If I modify it so gc.collect() happens after main() returns, then it happens in the wrong order. Maybe this is a special case of the garbage collector when handling a group of objects that became an island? Regardless, solving this will require a scheme that destroys the rcl instances in order regardless of the order the PyCapsule's get garbage collected.

I'm surprised that Python doesn't ensure destruction order in this case at all, where there is no reference cycle nor anything strange, but I reached the same conclusion.

About how to solve the problem, I'm thinking about implementing the handle type in C:
Roughly:

typedef struct rclpy_handle_t {
  void * ptr; // opaque pointer to the wrapped object.
  size_t ref_count; // Reference count.
  rclpy_handle_t ** dependencies; // array of pointers to dependencies.
  size_t num_of_dependencies; // size of the array.
} rclpy_handle_t;

// Start with ref count 1, empty dependency list.
// A deleter might be added to the signature, though we just use malloc/free in rclpy.
rclpy_handle_t *
rclpy_create_handle(void * ptr);

// Increments ref count of the dependency, and add the dependency to the dependent list.
void
rclpy_handle_add_dependency(rclpy_handle_t * dependent, rclpy_handle_t * dependency);

// Decrements the ref count of the handle.
// If reaches zero:
// - Calls `rclpy_handle_dec_ref` of its dependencies.
// - Deallocates the handels.
void
rclpy_handle_dec_ref(rclpy_handle_t * handle);

We can wrap that in a normal PyCapsule and create a python Handle class based on that.

@hidmic @sloretz @dirk-thomas Does that sound reasonable?

Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
@ivanpauno
Copy link
Member Author

ivanpauno commented Jan 3, 2020

I tested a bit more today, and I think I understand way Python doesn't destruct the object without respecting the order:

Nodes create reference cycles when constructing a TimeSource and a ParameterService:

rclpy/rclpy/rclpy/node.py

Lines 203 to 207 in d2c30d9

self._time_source = TimeSource(node=self)
self._time_source.attach_clock(self._clock)
if start_parameter_services:
self._parameter_service = ParameterService(self)

That means that all Node objects can't be destructed by reference counting and are destructed by the gc.
As nodes also mantain a reference to the node handle and the context, and the context mantain a reference of the context handle, that means that all those objects also can't be destructed by reference counting and must be destructed by the gc.

The gc doesn't ensure any destruction order (as that would be impossible).
The Handle trick "usually" worked well because how the gc is coded, but that seems to by just by chance and not ensured at all.

I commented the mentioned lines, and the example works correctly. It's pretty hard to confirm that my supposal is correct, but I think it is.

I don't think that the problem is that we're creating reference cycles when creating nodes (which probably is not ideal), the problem is the whole Handle concept.
If we are careful enough to to not create any reference cycle, ensuring objects are destructed by ref counting, this might work. But if a user happens to make that same mistake, they'll face the same problem. And triaging this kind of problems is usually a rabbit hole.

I think that all this "dependency order of objects" problem has to be solved in C, I don't think that it can be solved in Python.

@dirk-thomas @sloretz @hidmic Friedly ping. I proposed in my last comment a solution, I would like to have some feedback.

@hidmic
Copy link
Contributor

hidmic commented Jan 3, 2020

Aha! That's fun. I see why we may need to extend rclpy's C extension to deal with this too. I'd say, let's go for it. It's definitely a huge bug in Eloquent and something I think we'd want fixed for Foxy.

@dirk-thomas
Copy link
Member

Nodes create reference cycles when constructing a TimeSource and a ParameterService:
That means that all Node objects can't be destructed by reference counting and are destructed by the gc.

Would breaking these cycles solve the problem? If yes, I would suggest to do that instead of a custom handle class.

E.g. considering the same in C++ we wouldn't use shared_ptr between the objects leading to a "island" which would never be cleaned up. Instead we would break the cycle somewhere using a weak_ptr.

@sloretz
Copy link
Contributor

sloretz commented Jan 3, 2020

We can wrap that in a normal PyCapsule

I'm not sure I understand here. Would rclpy_create_context() return a PyCapsule with a pointer to an instance of rclpy_handle_t with the PyCapsule name still being "rcl_context_t"?

and create a python Handle class based on that.

What's the separation between rclpy_handle_t and class Handle? What is each responsible for?

I think I understand how rclpy_handle_t would work if it is garbage collected, but how would it support being explicitly destroyed? IIUC Handle.destroy() tries to destroy the object right away. Any handles that depend on it get destroyed first. The catch is if a handle is being used (like a subscription is in the wait set in rcl_wait()), then the handle gets marked for destruction and the last user to be done with the handle destroys it. I think this would need more information then rclpy_handle_t has: a way to tell if a handle is being used, a way to tell if the handle should be destroyed when its done being used, and a way to traverse the the DAG from an object to other objects that depend on it. Would this be part of rclpy_handle_t, or is the thought that it will be implemented in Handle above rclpy_handle_t?

@ivanpauno
Copy link
Member Author

Would breaking these cycles solve the problem? If yes, I would suggest to do that instead of a custom handle class.

That partially solves the problem. The problem is that if the user creates any reference cycle, and one object in that cycle references a Node, then that node can't be destroyed by reference counting more.
Then, the Handle trick for ensuring destruction order doesn't work and the user have an strange segmentation fault (or maybe something hangs, or some other strange error).
And the "python style" is that you don't have to worry about if you created a reference cycle or not, your objects will at some point be collected and destructed without causing problems. Conditioning users to not create reference cycles wouldn't be much "Pythonic".
And yes, we shouldn't create reference cycles in rclpy.

I'm not sure I understand here. Would rclpy_create_context() return a PyCapsule with a pointer to an instance of rclpy_handle_t with the PyCapsule name still being "rcl_context_t"?

I think that's ok, maybe a little strange though.
It's also true that after this, all places that were expecting a capsule to a rcl object will get a handle to a rclpy_handle_t, and an extra step must be done to get the pointer to the rcl object.

IIUC Handle.destroy() tries to destroy the object right away. Any handles that depend on it get destroyed first. The catch is if a handle is being used (like a subscription is in the wait set in rcl_wait()), then the handle gets marked for destruction and the last user to be done with the handle destroys it. I think this would need more information then rclpy_handle_t has: a way to tell if a handle is being used, a way to tell if the handle should be destroyed when its done being used, and a way to traverse the the DAG from an object to other objects that depend on it. Would this be part of rclpy_handle_t, or is the thought that it will be implemented in Handle above rclpy_handle_t?

That's a good question.
Currently, when a handle is destroyed also their dependents are destroyed (i.e. if someone called dependent_handle.requires(dependency_handle), doing dependency_handle.destroy() destroy boths).
I find that really strange. What I think should happen is that if someone call dependency_handle.destroy(), the ref count of the (dependecy_)handle should be decremented, and the object shouldn't be actually destroyed until the dependents are destroyed too (and not force the dependents to be destructed at the moment).

Following that idea, what should happen when someone call handle.destroy is that the reference count of the wrapped rclpy_handle_t should be decremented and the capsule destroyed.
That would not actually destroy the rcl object, until all the other rclpy_handle_t objects that added it as a dependency are destroyed. In that case, the reference count of all the stored dependencies in rclpy_handle_t are decremented, and the stored rcl object is destroyed.

@dirk-thomas
Copy link
Member

we shouldn't create reference cycles in rclpy.

Then lets fix those first.

Conditioning users to not create reference cycles wouldn't be much "Pythonic".

And worry about user code which does create their own cycles in a second step.

@sloretz
Copy link
Contributor

sloretz commented Jan 7, 2020

Currently, when a handle is destroyed also their dependents are destroyed (i.e. if someone called dependent_handle.requires(dependency_handle), doing dependency_handle.destroy() destroy boths).
I find that really strange. What I think should happen is that if someone call dependency_handle.destroy(), the ref count of the (dependecy_)handle should be decremented, and the object shouldn't be actually destroyed until the dependents are destroyed too (and not force the dependents to be destructed at the moment).

Yeah I agree it's a strange thing to do in Python. The goal is to remove stuff from the middleware asap. rclpy can't rely on the gc because it doesn't make any guarantees about when it will collect stuff. __enter__() and __exit__() don't quite fit here either. The closest thing I can think of is socket.close(). If someone calls node.destroy_node(), rclpy tries to destroy the node right away* so it no longer appears on the node graph, which means publishers/subscribers/clients/etc need to be destroyed first.

what should happen when someone call handle.destroy is that the reference count of the wrapped rclpy_handle_t should be decremented and the capsule destroyed.
That would not actually destroy the rcl object, until all the other rclpy_handle_t objects that added it as a dependency are destroyed

I can see how this would work for nodes. If service_handle requires node_handle then node_handle.ref_count is 2 and service_handle.ref_count is 1. If someone calls node_handle.destroy() then both are 1, so nothing is destroyed. If this were all then the node destruction would have to wait for the garbage collector to collect service_handle, and that's to be avoided. However, most would call node.destroy_node() which calls node.destroy_service() which would call service_handle.destroy() which would then destroy service_handle followed by the node_handle without waiting for the gc.

In the context of this PR, context.shutdown() would need more information to do this. If node_handle requires context_handle, then context would need a reference to node so that context.shutdown() could call node.destroy_node(). Otherwise context.shutdown() can only decrement context_handle.ref_count, which as shown above means it would have to wait for the node to be garbage collected.

* It can be delayed by subscribers/timers/services/etc being used in rcl_wait(), but that's easier to control than the garbage collector. This is why destroy_node() wakes the executor.

@ivanpauno
Copy link
Member Author

Yeah I agree it's a strange thing to do in Python. The goal is to remove stuff from the middleware asap. rclpy can't rely on the gc because it doesn't make any guarantees about when it will collect stuff. enter() and exit() don't quite fit here either. The closest thing I can think of is socket.close(). If someone calls node.destroy_node(), rclpy tries to destroy the node right away* so it no longer appears on the node graph, which means publishers/subscribers/clients/etc need to be destroyed first.

The socket example is also not the same. You can actually solve the problem using the context manager (example).
Our use case is pretty strange. We don't want to only rely in the gc, because the desire is to have a way to clean objects immediately to free middleware resources. But we also have a lot of dependencies between objects, in which we want to ensure that they are destructed in the correct order, if not something might crash. I unfortunatelly haven't seen any other python library with this kind of problem (though I haven't use a lot), I would love to see solutions that other libraries apply.

I can see how this would work for nodes. If service_handle requires node_handle then node_handle.ref_count is 2 and service_handle.ref_count is 1. If someone calls node_handle.destroy() then both are 1, so nothing is destroyed. If this were all then the node destruction would have to wait for the garbage collector to collect service_handle, and that's to be avoided. However, most would call node.destroy_node() which calls node.destroy_service() which would call service_handle.destroy() which would then destroy service_handle followed by the node_handle without waiting for the gc.

I think that users shouldn't call directly handle.destroy(), they should just call the destroy method of the corresponding object as you commented. We probable should make documentation clearer, or delete handle from the public api (just have an attribute node._handle that users aren't suppossed to use, don't provide the property node.handle).

In the context of this PR, context.shutdown() would need more information to do this. If node_handle requires context_handle, then context would need a reference to node so that context.shutdown() could call node.destroy_node(). Otherwise context.shutdown() can only decrement context_handle.ref_count, which as shown above means it would have to wait for the node to be garbage collected.

IMO, I don't think that context.shutdown() should destroy anything.
It should call the corresponding rcl method, that it actually doesn't destruct anything (AFAIR), it just indicate that the context is invalid afterwards.
If we want to provide a way to destruct all the objects related with a context, I think we should add another method, called destroy. In that case, I think that keeping track of the nodes created by a context should be enough (using weak references, to avoid a ref cycle between the nodes and the context).


I also want to mention that I consider my proposal "complex", but I have failed to find a simpler alternative.
I also don't like the fact that this problem might also happen in other client libraries of garbage collected languages.
It would be good if we find a solution that can live in rcl, but I don't know if that's possible.

@sloretz
Copy link
Contributor

sloretz commented Jan 7, 2020

The socket example is also not the same

Right, not the same since there aren't any dependencies, but similar in that there's a resource to be free'd without waiting for the garbage collector.

Our use case is pretty strange. We don't want to only rely in the gc, because the desire is to have a way to clean objects immediately to free middleware resources. But we also have a lot of dependencies between objects, in which we want to ensure that they are destructed in the correct order, if not something might crash.

That sums it up.

I would love to see solutions that other libraries apply.

If a context manager is used to cleanup things without waiting for the garbage collector, then nesting context managers (nesting with statements) can clean up context managers that depend on other context managers. contextlib.ExitStack is a way to nest context managers without hardcoding with statements.

I don't think that context.shutdown() should destroy anything. [...] If we want to provide a way to destruct all the objects related with a context, I think we should add another method, called destroy.

It can be called context.destroy() instead of context.shutdown(). Whatever the name, it should be called by rclpy.shutdown(). I think there would be a lot of value in this PR to make rclpy.shutdown() destroy all resources used in the middleware, so nodes created by one unit test won't affect the next.

@ivanpauno
Copy link
Member Author

It can be called context.destroy() instead of context.shutdown(). Whatever the name, it should be called by rclpy.shutdown(). I think there would be a lot of value in this PR to make rclpy.shutdown() destroy all resources used in the middleware, so nodes created by one unit test won't affect the next.

I forgot about rclpy.shutdown. It sounds reasonable.
I find many of this thing inconsistent with rclcpp, e.g.: rclcpp::shutdown doesn't clean any object. Maybe, the comparision is not fair though.

The proposal I made I think can also satisfy that requirement (as commented above).
I will wait for some more opinions before starting to implement something.

@hidmic
Copy link
Contributor

hidmic commented Jan 8, 2020

Following that idea, what should happen when someone call handle.destroy is that the reference count of the wrapped rclpy_handle_t should be decremented and the capsule destroyed.
That would not actually destroy the rcl object, until all the other rclpy_handle_t objects that added it as a dependency are destroyed. In that case, the reference count of all the stored dependencies in rclpy_handle_t are decremented, and the stored rcl object is destroyed.

Essentially recreating std::shared_ptr in C. That's what we do for rclcpp, so I guess it's fine for a relatively non-intrusive solution. BTW pybind11 supports std::shared_ptr i.e. shared ownership between Python and C++, but that'd be a much larger change to rclpy.

@ivanpauno
Copy link
Member Author

Closing, superseded by #497.

@ivanpauno ivanpauno closed this Jan 20, 2020
@ivanpauno ivanpauno deleted the ivanpauno/fix-destruction-order branch January 20, 2020 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants