Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate adding in explicit IPC for Nav2 nodes again #4691

Open
SteveMacenski opened this issue Sep 25, 2024 · 4 comments
Open

Investigate adding in explicit IPC for Nav2 nodes again #4691

SteveMacenski opened this issue Sep 25, 2024 · 4 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@SteveMacenski
Copy link
Member

Previously, we didn't do this since IPC required the use of only the default QoS profiles and we use several others in Nav2 for various things. We should investigate if this challenge has been overcome now and we can use IPC.

Additionally, Nav2 default launches with rviz2 and application nodes that may subscribe to topics outside of the container. In this case, these performance improvements are negated, but I suppose worthwhile to include unless there's some reason not to. I believe that only the topics that are communicating outside of the process break IPC, but worth checking on that to make sure the entire node isn't broken out of IPC for a debug logging topic or something.

It would be great to get an update from @clalancette on these thoughts -- then execute as it is seen fit.

@SteveMacenski SteveMacenski added enhancement New feature or request good first issue Good for newcomers labels Sep 25, 2024
@clalancette
Copy link
Contributor

It would be great to get an update from @clalancette on these thoughts -- then execute as it is seen fit.

So during the Jazzy development cycle, we ended up merging in ros2/rclcpp#2303 , which added transient_local functionality to the intra-process manager. However, as far as I understand, the intra-process manager still doesn't support the full range of QoS options.

Still, if transient_local functionality is enough for your use-case, you could consider turning it on. There are 2 different ways to turn it on:

  1. You can pass an rclcpp::NodeOption when constructing the node that will create all publishers and subscriptions (including the built-in ones) as intra-process.
  2. When you create individual publishers and subscriptions, you can turn on intra-process mode for just that entity.

Additionally, Nav2 default launches with rviz2 and application nodes that may subscribe to topics outside of the container. In this case, these performance improvements are negated, but I suppose worthwhile to include unless there's some reason not to. I believe that only the topics that are communicating outside of the process break IPC, but worth checking on that to make sure the entire node isn't broken out of IPC for a debug logging topic or something.

The intra-process manager is fairly sophisticated. If a publisher and subscription on the same topic with the same type are created within the same Context, and intra-process is turned on for both of them, then it will communicate between them using only intra-process. If any one of those conditions is false, it won't use it. However, even in the case where intra-process is used, it still always creates a DDS entity for the outside world to discover. That's so that things like ros2 topic list still show the topic. In the case where you have an intra-process publisher, an intra-process subscription, and a non-intra-process subscription, then what happens is that intra-process is still used between the publisher and the intra-process subscription, and a copy is used to send out to the non-intra-process subscription.

All of that is to say is that the intra-process manager will do the correct thing in all situations, though the efficiency goes down somewhat in situations other than a single intra-process publisher and single intra-process subscription.

@SteveMacenski
Copy link
Member Author

still doesn't support the full range of QoS options.

Are the 'main' 4 (history, durability, depth, reliability) fully covered? We use throughout Nav2 all the options for those four in various places, but not sensitive about the deadline, lifespan, liveliness, or lease duration (yet, at least).

In the case where you have an intra-process publisher, an intra-process subscription, and a non-intra-process subscription, then what happens is that intra-process is still used between the publisher and the intra-process subscription, and a copy is used to send out to the non-intra-process subscription.
All of that is to say is that the intra-process manager will do the correct thing in all situations, though the efficiency goes down somewhat in situations other than a single intra-process publisher and single intra-process subscription.

I believe then I was either initially misinformed or something has changed since I last looked -- was at any point it true that if a subscription was outside of IPC (intra-), it changed all subscriptions to be inter-process?

Follow up question: If we have 2 subscriptions, one intra-process and one inter-process, is the performance any worse on the system by enabling (intra-) IPC than if we did not at all? I assume not, but your the wording of that response makes me want to clarify if there's any reason not to always enable IPC now.

It also used to be true that IPC (intra-) would throw if a sub/pub was using QoS that IPC couldn't handle. Is that still the case or can we enable IPC (intra-) globally now in the software and it'll use where it can, but default back to inter-process (though still with composition benefits) where it cannot?

Thanks for the follow up :-) It looks like IPC has made some great gains since the last I looked at this in the early Foxy days. We got such good performance boosts from Composition and the QoS would throw for the non-default profile, we weren't able to move further. It sounds like we might be able to fully embrace IPC now across the board for another joyous performance boost!

@SteveMacenski
Copy link
Member Author

SteveMacenski commented Dec 12, 2024

I've started on this today and ran into some problems. I filed a ticket for better logging on IPC-specific errors since its really difficult to find where they're coming from in applications with 20+ subscribers and publishers ros2/rclcpp#2703

I believe the system default QoS doesn't actually set the depth size to > 0 which causes a problem with IPC. I think that should probably be updated to force it to be > 0 so that folks don't run into this issue. Feels like a not-awesome user experience to have to manually adjust this (rclcpp::SystemDefaultsQoS().keep_last(1)) for what should be the "default" commonly-used QoS profile. I still need to get around that globally, but after fixing the instances I needed to move on for a particular node. I'm not sure why this one case is such a problem but others using SystemDefaultsQoS work fine, but the next issue is:

transient_local publisher needs to passa valid publisher buffer ptr when calling add_publisher()

which I'm not 100% sure what is going on there, since I thought Jazzy/Rolling supported transient_local QoS now (on Rolling for this work). It appears to be coming from the following, which is interesting since its not even transient local QoS. Removing this line or changing to QoS to a non-default-QoS modified setting makes the warning disappear. Though, there is some transient_local QoS immediately after this line (see collapsed snippet below).

Code Snippet
  auto custom_qos = rclcpp::QoS(rclcpp::KeepLast(1)).transient_local().reliable();
  costmap_pub_ = node->create_publisher<nav_msgs::msg::OccupancyGrid>(
    topic_name, custom_qos);
  costmap_raw_pub_ = node->create_publisher<nav2_msgs::msg::Costmap>(
    topic_name + "_raw", custom_qos);
  costmap_update_pub_ = node->create_publisher<map_msgs::msg::OccupancyGridUpdate>(
    topic_name + "_updates", custom_qos);
  costmap_raw_update_pub_ = node->create_publisher<nav2_msgs::msg::CostmapUpdate>(
    topic_name + "_raw_updates", custom_qos);
  auto qos = rclcpp::SystemDefaultsQoS().keep_last(1);
  footprint_pub_ = create_publisher<geometry_msgs::msg::PolygonStamped>(
    "published_footprint", qos);

Error's coming from here: https://github.com/ros2/rclcpp/blob/8c0161a07f1a44db023650d10f1b2e581c64e1f1/rclcpp/src/rclcpp/intra_process_manager.cpp#L48

The buffer on face value looks like it should be valid from https://github.com/ros2/rclcpp/blob/rolling/rclcpp/include/rclcpp/publisher.hpp#L168-L172. I don't see anywhere in create_intra_process_buffer that it would return a nullptr without itself a throw which would be the log I should see. Even separately of anything I can be doing wrong, that's a state I don't see logically how I could enter.

Here's my working branch https://github.com/ros-navigation/navigation2/tree/ipc

@SteveMacenski
Copy link
Member Author

SteveMacenski commented Dec 12, 2024

Moving past that since its not blocking, though I found the root cause, SystemDefaultsQoS profile works fine for subscriptions but does not for publishers. Filing ros2/rclcpp#2705

I generally see things working!

The only major issue is that the transient local topics that are only published once are not responding to late-joining subscriptions to transfer information. For example, the map is only published once on startup as transient local, but rviz doesn't consistently get it (depending if it was up in time to receive it) and ros2 topic echo /map --qos-durability transient_local only returns a value if it is running at the same time as the initial publication -- it never receives the map afterward.

Is transient local publication tested to work on late joining subscriptions in Rolling/Jazzy? I looks like a bug to me ros2/rclcpp#2704


TODO List

  • enable IPC on all nodes; check general bringup functional
  • Remove the individual node IPC enabling setting and do it in the lifecycle node shared base class instead
  • Create a nav2::DefaultQoS which wraps the SystemDefaultQoS but contains a depth of 1 rather than 0 (temporarily?) SystemDefaultsQoS - force to have depth > 0 for Publishers ros2/rclcpp#2705
  • File ticket on IPC transient-local bug, followup until completion. Blocks remaining tasks [Bug] Jazzy / Rolling Transient-Local Publishers with IPC do not deliver messages to Inter-process subscribers  ros2/rclcpp#2704
  • Test all the plugins, topics are received, nodes (costmap/footprint subscribers. data subscriptions, transient-local topics, test all nodes bringup and run correctly, actions run reliably for a few hours, services, simulation, CI, loopback)
  • Check memory addresses are the same internally. Check if anything going over DDS/wire and resolve.
  • Complete benchmarking using IPC with and without rviz/external subscriptions.
  • Check if other design updates are needed to make best use of IPC (i.e. publish/subscribe smart pointers, )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants