-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
intraprocess bond::Bond::BondStatusCB use after free #4841
base: ipc
Are you sure you want to change the base?
Conversation
The Bond mechanism includes creation of a subscription using a reference to a member function(bondStatusCB) of the Bond class. This member function operates on member variables. The lifecycle_node was calling bond_.reset() which releases the memory as far as the lifecycle_node is concerned but this is not immediately released from the rclcpp internal mechanisms (especially intraprocess). As a result the bondStatusCB function can called after it has been freed. This use after free shows up reliably with asan when running the test_bond test. This change allows the test_bond to suceed by calling bond_->breakBond() (rather than bond_.reset()) to break the bond rather than expecting it to be done cleanly by the ~Bond() destructor. Is it enough is TBC. Other possibilities might be to get the Bond to inherit from std::enable_shared_from_this(), as Ros2 Nodes do, so that the pointer to the Bond member function bondStatusCB function remains valid until the subscription is released during destruction. Signed-off-by: Mike Wake <[email protected]>
I thought about that potentially, but that is called in the Bond Destructor already when we reset the shared pointer [1], so that should be done first thing as the object is being destroyed. That was on my list to try to see if it virtually fixed the issue, but I think we should dig a bit deeper so that this issue doesn't come back to bite us in another place or other bond users can do this too.
You provide some example fixes for bond / how we use bond (use a shared pointer instead of a unique ptr; I do want to do more than just
It sounds like your confirming a bug in [1] https://github.com/ros/bond_core/blob/ros2/bondcpp/src/bond.cpp#L163 |
By the way, I tried this, it didn't work. The rclcpp maintainers pointed to something they think is related / the cause in this comment ros2/rclcpp#2678 (comment) |
See ros/bond_core#108 for changes that use enable_shared_from_this to ensure valid lifetime of the member function subscription callback. I also commented on ros2/rclcpp#2678 (comment) suggesting that a helper function like createSafeSubscriptionMemFuncCallback() could be added to the rclcpp namespace. It could have an intermediate home in nav2_utils too. |
In order for Bond to inherit from std::enable_shared_from_this to allow its lifetime to be managed to avoid a use after free where its subcriber callback is called before it has been cleaned up from internal rclcpp mechanism. This is being explored either by modification to rclcpp or using a lambda that protects against the the member function going out of scope by testing a weak_ptr to shared_from_this for expiry before the member function bondStatusCB is called.
This wraps the creation of the node currently (ipc experiment branch) sets the node option .use_intra_process_comms(true)
I'm trying to understand better here the set of options from these PRs / issue tickets.
|
I'll have a go at explaining what I currently think for each of your points,
You suggested earlier, #4841 (comment), and I was concerted that breakBond() is not enough as you also wanted the timers to be stopped. So I reverted that change. Then for any of the the weak_ptr expiry checking mechanism to work the bond must be created(in nav2) as a shared_ptr(not a unique_ptr) to allow use of shared_from_this().
I agree that it would be best to resolve the issue in rclcpp. With that in mind I explored a way that I thought might do that in a rclcpp PR ros2/rclcpp#2725. This has prompted some discussion but I think the actual changes I made are likely to be rejected in favor of a
I think the argument against something like rclcpp::createSafeSubscriptionMemFuncCallback() is the onus is put on the user to use it. IOW, The create_subscription API does not take an argument that can be used as a lifetime pointer to capture with the callback function in an internally constructed lambda. So I don't think it can be done within the current create_subscription API call. I proposed and coded up an addition of a callback_lifetime weak_ptr to subscription options as an alternative. It works(tm) within rclcpp but still puts onus on the user to use it. ros2/rclcpp#2725 I am still trying to understand the multi-threaded executor reasoning behind the proposed
I think the demotion of weak_ptr of everything everywhere within rclcpp ONLY handles the actual subscription entity itself. There is currently not enough information provided in the create_subscription API to handle the lifetime of the registered callback function. (Now just repeating myself I think) Hope the above helps rather than hinders. |
Got it, thank you. I understood about 80% of that but enough to get the gist of the issue, but perhaps not all of the specific details of the proposed solutions. I agree that not putting the onus on the user is very important since that is a huge quality of use / life issue for ROS 2 users (and subtle to explain when it matters).
Wouldn't Michael's comment on demoting everything to weak pointers though resolve that issue? Currently compiling your bond branch + Nav2 and running the system tests to see what I see. I think Nav2's use of IPC might be blocked by a full solution in rclcpp, but at least allows me to do my final testing so I can finish my work, validate everything, and then hold to merge until ready. If Bond's PR can be merged and that's the only issue within Nav2 that this issue rears its head, then I suppose we could move forward here too - but I'd feel better with a full solution that removes this 'class' of problem, since we have nodes that contain objects that contain sub/pub in Nav2 in several places. Edit: I'm still seeing my Nav2 system tests job running 45 minutes later. That usually means that every single test is failing and waiting for its full timeout before force exiting the test (usually takes ~22 minutes). This is running your bond branch + the changes in this PR. I'm curious if you tested those changes or if I should try again? |
I tried again without avail - still shows virtually every system test failing using this branch + the bond PR branch. I don't think this resolves the issue unfortunately :( |
The Bond mechanism includes creation of a subscription using a reference to a member function(bondStatusCB) of the Bond class.
This member function operates on member variables.
The lifecycle_node was calling bond_.reset() which releases the memory as far as the lifecycle_node
is concerned but this is not immediately released
from the rclcpp internal mechanisms (especially intraprocess). As a result the bondStatusCB function can called after it has been freed. This use after free shows up reliably with asan when running the test_bond test.
This change allows the test_bond to suceed by
calling bond_->breakBond() (rather than bond_.reset()) to break the bond rather than expecting it to
be done cleanly by the ~Bond() destructor.
Is it enough is TBC.
Other possibilities might be to get the Bond to
inherit from std::enable_shared_from_this(), as Ros2 Nodes do, so that the pointer to the Bond member function bondStatusCB function remains valid until the subscription is released during destruction.
Basic Info
Description of contribution in a few bullet points
Call bond_->breakBond() in on_deactivate rather than bond_.reset() to avoid use after free caused by dangling pointer held by rclcpp intraprocess subscription callback mechanism.
Description of documentation updates required from your changes
none
Future work that may be required in bullet points
For Maintainers: