Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intraprocess bond::Bond::BondStatusCB use after free #4841

Open
wants to merge 4 commits into
base: ipc
Choose a base branch
from

Conversation

ewak
Copy link

@ewak ewak commented Jan 10, 2025

The Bond mechanism includes creation of a subscription using a reference to a member function(bondStatusCB) of the Bond class.

This member function operates on member variables.

The lifecycle_node was calling bond_.reset() which releases the memory as far as the lifecycle_node
is concerned but this is not immediately released
from the rclcpp internal mechanisms (especially intraprocess). As a result the bondStatusCB function can called after it has been freed. This use after free shows up reliably with asan when running the test_bond test.

This change allows the test_bond to suceed by
calling bond_->breakBond() (rather than bond_.reset()) to break the bond rather than expecting it to
be done cleanly by the ~Bond() destructor.

Is it enough is TBC.

Other possibilities might be to get the Bond to
inherit from std::enable_shared_from_this(), as Ros2 Nodes do, so that the pointer to the Bond member function bondStatusCB function remains valid until the subscription is released during destruction.


Basic Info

Info Please fill out this column
Ticket(s) this addresses #4691
Primary OS tested on Ubuntu
Robotic platform tested on none
Does this PR contain AI generated software? no

Description of contribution in a few bullet points

Call bond_->breakBond() in on_deactivate rather than bond_.reset() to avoid use after free caused by dangling pointer held by rclcpp intraprocess subscription callback mechanism.

Description of documentation updates required from your changes

none


Future work that may be required in bullet points

  • Consider effect of only using breakBond() rather than reset().
  • Should LifeCycleNode::destroyBond() wait for some timeout for the bond->breakBond() to take affect using waitUntilBroken() ?
  • Figure out an improvement to rclcpp intraprocess subscription mechanism to check for memory lifetime of passed in callback function.
  • Maybe make changes to bondcpp to have Bond inherit from std::enable_shared_from_this() to extend the lifetime of the subscription callback function.
  • Consider creating the bond_ as a shared_ptr rather than a unique_ptr to allow the internal mechanisms of rclcpp to hold onto valid memory until they have a chance to cleanup properly.

For Maintainers:

  • Check that any new parameters added are updated in docs.nav2.org
  • Check that any significant change is added to the migration guide
  • Check that any new features OR changes to existing behaviors are reflected in the tuning guide
  • Check that any new functions have Doxygen added
  • Check that any new features have test coverage
  • Check that any new plugins is added to the plugins page
  • If BT Node, Additionally: add to BT's XML index of nodes for groot, BT package's readme table, and BT library lists

The Bond mechanism includes creation of a subscription
using a reference to a member function(bondStatusCB)
of the Bond class.

This member function operates on member variables.

The lifecycle_node was calling bond_.reset() which
releases the memory as far as the lifecycle_node
is concerned but this is not immediately released
from the rclcpp internal mechanisms (especially intraprocess).
As a result the bondStatusCB function can called after it
has been freed.  This use after free shows up reliably with
asan when running the test_bond test.

This change allows the test_bond to suceed by
calling bond_->breakBond() (rather than bond_.reset())
to break the bond rather than expecting it to
be done cleanly by the ~Bond() destructor.

Is it enough is TBC.

Other possibilities might be to get the Bond to
inherit from std::enable_shared_from_this(), as Ros2 Nodes do,
so that the pointer to the Bond member function bondStatusCB
function remains valid until the subscription is released
during destruction.

Signed-off-by: Mike Wake <[email protected]>
Copy link
Contributor

mergify bot commented Jan 10, 2025

@ewak, all pull requests must be targeted towards the main development branch.
Once merged into main, it is possible to backport to @ipc, but it must be in main
to have these changes reflected into new distributions.

@SteveMacenski
Copy link
Member

SteveMacenski commented Jan 10, 2025

I thought about that potentially, but that is called in the Bond Destructor already when we reset the shared pointer [1], so that should be done first thing as the object is being destroyed. That was on my list to try to see if it virtually fixed the issue, but I think we should dig a bit deeper so that this issue doesn't come back to bite us in another place or other bond users can do this too.

Should LifeCycleNode::destroyBond() wait for some timeout for the bond->breakBond() to take affect using waitUntilBroken() ?

destroyBond resets the shared pointer which calls the ~Bond method, we should have already broken in the destructor and then waited. If it failed after waiting, we should have seen a log from ~Bond before .reset() returns.

You provide some example fixes for bond / how we use bond (use a shared pointer instead of a unique ptr; I do want to do more than just breakBond since that would still leave the timers on which I want completely off in deactivate). These are probably good ways to go that could resolve our issues to move forward quickly, but I think given the issues you report in rclcpp below:

but this is not immediately released from the rclcpp internal mechanisms (especially intraprocess).
As a result the bondStatusCB function can called after it has been freed. This use after free shows up reliably with asan when running the test_bond test.
Figure out an improvement to rclcpp intraprocess subscription mechanism to check for memory lifetime of passed in callback function.

It sounds like your confirming a bug in rclcpp for IPC handling that we should detail in the rclcpp ticket for a more general resolution ros2/rclcpp#2721. We shouldn't be having IPC callbacks triggered after the object containing them is destroyed. Its common to have helper objects composed into your node that have sub/pubs for various path, parameter, visualization, etc handling. Can you update with some detailed results that you found in that ticket for Alberto and Tomoya if there are potential action items for rclcpp IPC handling fixes that this uncovered (if any)?

[1] https://github.com/ros/bond_core/blob/ros2/bondcpp/src/bond.cpp#L163

@SteveMacenski
Copy link
Member

SteveMacenski commented Jan 11, 2025

Consider creating the bond_ as a shared_ptr rather than a unique_ptr to allow the internal mechanisms of rclcpp to hold onto valid memory until they have a chance to cleanup properly.

By the way, I tried this, it didn't work.

The rclcpp maintainers pointed to something they think is related / the cause in this comment ros2/rclcpp#2678 (comment)

@ewak
Copy link
Author

ewak commented Jan 11, 2025

See ros/bond_core#108 for changes that use enable_shared_from_this to ensure valid lifetime of the member function subscription callback. I also commented on ros2/rclcpp#2678 (comment) suggesting that a helper function like createSafeSubscriptionMemFuncCallback() could be added to the rclcpp namespace. It could have an intermediate home in nav2_utils too.

ewak added 3 commits January 13, 2025 16:55
In order for Bond to inherit from std::enable_shared_from_this
to allow its lifetime to be managed to avoid a use
after free where its subcriber callback is called
before it has been cleaned up from internal rclcpp
mechanism.

This is being explored either by modification to rclcpp
or using a lambda that protects against the the
member function going out of scope by testing a
weak_ptr to shared_from_this for expiry before
the member function bondStatusCB is called.
This wraps the creation of the node currently (ipc experiment branch)
sets the node option  .use_intra_process_comms(true)
@SteveMacenski
Copy link
Member

SteveMacenski commented Jan 13, 2025

I'm trying to understand better here the set of options from these PRs / issue tickets.

  1. Here in Nav2, I'm seeing that the issue is resolved by only breakBond() rather than destroying the object due to the rclcpp issue. So I'm a little unclear why the shared_ptr was migrated, unless the shared_ptr + the bondcpp PR resolves the issue. tl;dr: don't we need one or the other, for this limited scope problem? Edit: I did some testing, I don't think breakBond() is enough anymore - still seeing 4x system tests fail reliably

  2. In bondcpp, I see how that could resolve the topic within the thread from rclcpp, but wouldn't it be best to resolve it at the rclcpp level, since this only fixes bond? I assume we're on the same page about that

  3. In rclcpp, Why not have something like createSafeSubscriptionMemFuncCallback inside of rclcpp that is always used in create_subscription and/or adding stop_callbacks to rclcpp::Subscription's destructor so that resetting a subscription's shared pointer automatically handles this? I don't see that suggestion considered in the threads, which seems like the easiest, low-impact solution. Michael's suggestion of demoting everything, everywhere to weak_ptr would probably be best, but I don't think much functionally different.

@ewak
Copy link
Author

ewak commented Jan 14, 2025

I'll have a go at explaining what I currently think for each of your points,

1. Here in Nav2, I'm seeing that the issue is resolved by only `breakBond()` rather than destroying the object due to the rclcpp issue. So I'm a little unclear why the `shared_ptr` was migrated, unless the `shared_ptr` + the bondcpp PR resolves the issue. tl;dr: don't we need one or the other, for this limited scope problem?

You suggested earlier, #4841 (comment), and I was concerted that breakBond() is not enough as you also wanted the timers to be stopped. So I reverted that change.

Then for any of the the weak_ptr expiry checking mechanism to work the bond must be created(in nav2) as a shared_ptr(not a unique_ptr) to allow use of shared_from_this().

2. In bondcpp, I see how that could resolve the topic within the thread from `rclcpp`, but wouldn't it be best to resolve it at the `rclcpp` level, since this only fixes bond? I assume we're on the same page about that

I agree that it would be best to resolve the issue in rclcpp. With that in mind I explored a way that I thought might do that in a rclcpp PR ros2/rclcpp#2725. This has prompted some discussion but I think the actual changes I made are likely to be rejected in favor of a stop_callbacks mechanism.

3. In rclcpp, Why not have something like `createSafeSubscriptionMemFuncCallback` inside of rclcpp that is always used in `create_subscription` and/or adding `stop_callbacks` to `rclcpp::Subscription`'s destructor so that resetting a subscription's shared pointer automatically handles this? I don't see that suggestion considered in the threads, which seems like the easiest, low-impact solution. 

I think the argument against something like rclcpp::createSafeSubscriptionMemFuncCallback() is the onus is put on the user to use it.

IOW, The create_subscription API does not take an argument that can be used as a lifetime pointer to capture with the callback function in an internally constructed lambda. So I don't think it can be done within the current create_subscription API call.

I proposed and coded up an addition of a callback_lifetime weak_ptr to subscription options as an alternative. It works(tm) within rclcpp but still puts onus on the user to use it. ros2/rclcpp#2725

I am still trying to understand the multi-threaded executor reasoning behind the proposed subscription.stop_callbacks mechanism. I still think bondcpp would need to call it before calling reset() on its shared_ptr to the subscriber. It's not enough to put it in the rclcpp::Subscription's destructor because that won't fire until all instances of it are released. For instance, Here is a place where the WaitSet can hold a subscription shared_ptr https://github.com/ros2/rclcpp/blob/80768ed14e5ebbeb4fbbff6de42149c9c526409d/rclcpp/include/rclcpp/wait_set_policies/dynamic_storage.hpp#L205

3.1 Michael's suggestion of demoting everything, everywhere to weak_ptr would probably be best, but I don't think much functionally different.

I think the demotion of weak_ptr of everything everywhere within rclcpp ONLY handles the actual subscription entity itself. There is currently not enough information provided in the create_subscription API to handle the lifetime of the registered callback function. (Now just repeating myself I think)

Hope the above helps rather than hinders.

@SteveMacenski
Copy link
Member

SteveMacenski commented Jan 14, 2025

Got it, thank you. I understood about 80% of that but enough to get the gist of the issue, but perhaps not all of the specific details of the proposed solutions. I agree that not putting the onus on the user is very important since that is a huge quality of use / life issue for ROS 2 users (and subtle to explain when it matters).

I am still trying to understand the multi-threaded executor reasoning behind the proposed subscription.stop_callbacks mechanism. I still think bondcpp would need to call it before calling reset() on its shared_ptr to the subscriber. It's not enough to put it in the rclcpp::Subscription's destructor because that won't fire until all instances of it are released. For instance, Here is a place where the WaitSet can hold a subscription shared_ptr https://github.com/ros2/rclcpp/blob/80768ed14e5ebbeb4fbbff6de42149c9c526409d/rclcpp/include/rclcpp/wait_set_policies/dynamic_storage.hpp#L205

Wouldn't Michael's comment on demoting everything to weak pointers though resolve that issue?


Currently compiling your bond branch + Nav2 and running the system tests to see what I see. I think Nav2's use of IPC might be blocked by a full solution in rclcpp, but at least allows me to do my final testing so I can finish my work, validate everything, and then hold to merge until ready. If Bond's PR can be merged and that's the only issue within Nav2 that this issue rears its head, then I suppose we could move forward here too - but I'd feel better with a full solution that removes this 'class' of problem, since we have nodes that contain objects that contain sub/pub in Nav2 in several places.

Edit: I'm still seeing my Nav2 system tests job running 45 minutes later. That usually means that every single test is failing and waiting for its full timeout before force exiting the test (usually takes ~22 minutes). This is running your bond branch + the changes in this PR. I'm curious if you tested those changes or if I should try again?

@SteveMacenski
Copy link
Member

SteveMacenski commented Jan 14, 2025

I tried again without avail - still shows virtually every system test failing using this branch + the bond PR branch. I don't think this resolves the issue unfortunately :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants