-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Double spin required since 28.2.0 #2589
Comments
Thanks for reporting, will take a look. Definitely sounds fishy. |
I wonder if this issue is related to the use of the notifier waitable?
so with the new approach, when you call spin_* the only item of work ready is the waitable to refresh the entities (even if those entities already have work to do), calling spin a second time will now act on the updated waitset (which includes the new entities and will thus see their items of work) The purpose of this change is that the previous behavior was a huge CPU bottleneck. |
IMO this "bug" is expected behavior with the current code(although it's a breaking change and we didn't advertise it...). See how in the unit-tests we have a The only ways that I can think of to fix it, consist of treating the notifier waitable as a "special entity", so that spin_some and spin_all go back to spinning if that's triggered, but I don't really like it. |
That seems like potentially an oversight. For subscribers on regular data, its not a huge deal IMO to drop the first message while the system is 'warming up' (i.e. first lidar scan), but many subscriptions are the result of infrequent or irregular events (like button pushes) and missing the first of an event class is comparatively bad to be the standard ROS 2 out-of-the-box policy.
I don't think that anything about these need to change, but it would be good to have a unit test dedicated to regressions on 'first-spinning' events. In effect, that's what we caught in Nav2 |
No events are lost. The only difference is that if you call The "warm up" spin is needed only for entities that are added after the node / cb group is added to the executor. In your example, if you create the subscription before adding the callback group to the executor you should solve the issue (for example moving this line to the end of the constructor https://github.com/ros-navigation/navigation2/blob/main/nav2_behavior_tree/plugins/decorator/goal_updater_node.cpp#L40) |
I am not sure how to fix this... but wanted to post the comment for the direction we take...
IMO this breaks the user application, huge behavior change... i really do not think requiring |
This only happens when The problem affect mostly unit-tests where developers are trying to synchronize some operations: I do X, I spin and expect to get the message. The "issue" is that executors have internal events and calling |
After writing that message, I went to read the documentation of the
|
That is my verbal mistake. Events are not triggered when attempted to process the first time 👍
I could to do that for this case in Nav2, but I agree with the above comment from @fujitatomoya - this is not a good UX for ROS 2 users that would cause some head scratching and frustration (and feed the 'ROS sucks' trolls). That's a pretty subtle detail that doesn't seem like from an application developer's perspective should matter, but I can ack from a developer side why that could happen:
|
I wrote a unit-test to exercise this behavior and run it across all executors... However:
The only way in which I made it fail was to do |
This issue has been mentioned on ROS Discourse. There might be relevant details there: https://discourse.ros.org/t/next-client-library-wg-meeting-friday-2nd-august-2024/38881/1 |
I guess we need to modify wait_for_work in the following way :
Alternative:
Second version sounds cleaner to me. |
Something like this : if(notify_waitable_)
{
// rcl::Wait
executors::ExecutorNotifyWaitable::SharedPtr cpy = std::make_shared<executors::ExecutorNotifyWaitable>(*notify_waitable_);
rclcpp::WaitSet tmp_wait_set_;
tmp_wait_set_.add_waitable(cpy);
auto res = tmp_wait_set_.wait(std::chrono::nanoseconds(0));
if(res.kind() == WaitResultKind::Ready)
{
auto & rcl_wait_set = res.get_wait_set().get_rcl_wait_set();
if (cpy->is_ready(rcl_wait_set)) {
cpy->execute(cpy->take_data());
}
}
} I just tested this. This indeed fixes the bug, but some other tests fail because of it... |
This is a first attempt to fix ros2#2589. It passes all tests, and seems to work, but I am 90% certain it introduces unwanted races and lost wakeups. Anyway, its a starting point for a discussion...
I digged into this issue and I found the root cause. This bug affects the
The current usage of the This problem can be mitigated if you have multiple entities in your executor, but this is not the case in the Nav2 unit-test, where the notify waitable is the only entity managed by the executor when the test starts (there's also the goal updated subscription but due to the ordering of operations this is not immediately known to the executor). This implies that the notify waitable is only executed explicitly here https://github.com/ros2/rclcpp/blob/rolling/rclcpp/src/rclcpp/executor.cpp#L744-L748, and never as part of a
This lines happens at the end of a |
The TL;DR is that:
@jmachowinski PR fixes both points for the waitset-based executors: #2595 Nothing should be done for We should add a disclaimer about the |
This issue has been mentioned on ROS Discourse. There might be relevant details there: https://discourse.ros.org/t/next-client-library-wg-meeting-friday-16th-august-2024/39130/1 |
Is there progress here? |
@SteveMacenski I discussed this issue further with @jmachowinski . The events executor PR has some minor issues to fix before merging it We understood the root cause of the problem for the waitset-baser executors. I should have some time to work on both PRs next week. |
Ok, thanks! |
Fully agree here @fujitatomoya
Thanks, following closely as I am attempting to migrate our huge code base to Jazzy and we have a lot of |
Hello, any update on this quite important bug? Thanks |
Hi @tonynajjar to the best of my knowledge this has been fixed here #2591 and backported to Jazzy here: #2628 |
Hi @doisyg , thanks for the quick reply. As far as I understood from @alsora's comment, there is only a fix for the EventsExecutor and the more general one for all executors is pending no? |
Ha, good catch. Not sure then. Following. |
This issue has been mentioned on ROS Discourse. There might be relevant details there: https://discourse.ros.org/t/preparing-for-jazzy-sync-and-patch-release-2024-12-23/41213/5 |
…dded Fixes ros2#2589 Signed-off-by: Janosch Machowinski <[email protected]>
@tonynajjar can you verify, that #2724 fixes the issue for you ? |
…dded Fixes ros2#2589 Signed-off-by: Janosch Machowinski <[email protected]>
Bug report
Required Info:
Expected behavior
Spin All, Spin Some work as previously expected
Actual behavior
Spinning some or all requires a warmup spin that doesn't do anything before processing useful work.
Additional information
Nav2's CI broke recently with a unit test on a behavior tree node that I got to looking into yesterday. I was chatting with @clalancette about it in a comment thread and we realized that this is a
rclcpp
regression.goal_updater_node
spins an internal executor to process a callback function to get some data that has always worked until recently: https://github.com/ros-navigation/navigation2/blob/main/nav2_behavior_tree/plugins/decorator/goal_updater_node.cpp#L60. In debugging, I found that it wasn't getting the data the unit test was sending.Blaming
Executor::spin_some_impl(std::chrono::nanoseconds max_duration, bool exhaustive)
, I see there's been alot of poking around there recently in the last few months, and a PR to fix some other regression in May, which makes me think maybe there's another regression lurking in that corner.Test A:
Pushing the
50
ms up to200
ms still doesn't workTest B:
This works
Test C:
Attempting to use spin some instead of spin all also fails
Test D:
Lets try ticking the node multiple times while having the single
spin_all(50ms)
(which would cause the spin to be called multiple times)This works
This experiment proves to me that something is odd with spinning behavior right now requiring multiple calls (whether sequentially or as part of the broader application loop) to process data. The first run is having no effect. I see this running in Nav2's CI that is running Cyclone and I can replicate this locally in a Docker container with Fast-DDS, so I think its a ROS 2 side issue, not DDS.
The text was updated successfully, but these errors were encountered: