Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race conditions in rmw_wait and map queries to clients #153

Merged
merged 7 commits into from
Apr 13, 2024

Conversation

Yadunund
Copy link
Member

@Yadunund Yadunund commented Apr 13, 2024

Fixes #152.

clalancette and others added 5 commits April 5, 2024 16:10
To very briefly explain, rmw_wait:
1.  Checks to see if any of the entities (subscriptions, clients, etc)
have data ready to go.
2.  If they have data ready to go, then we skip attaching the
condition variable and waiting.
3.  If they do not have data ready to go, then we attach the
condition variable to all entities, take the condition variable
lock, and call wait_for/wait on the condition variable.
4.  Regardless of whether we did 3 or 4, we check every entity
to see if there is data ready, and mark that as appropriate
in the wait set.

There is a race in all of this, however.  If data comes in after
we've checked the entity (1), but before we've attached the
condition variable (3), then we will never be woken up.  In most
cases, this means that we'll wait the full timeout for the wait_for,
which is not what we want.

Fix this by adding another step to 3.  In particular, after we've
locked the condition variable mutex, check the entities again.
Since we change the entities to *also* take the lock before we
notify, this ensures that the entities cannot make changes that
get lost.

Signed-off-by: Chris Lalancette <[email protected]>
Signed-off-by: Chris Lalancette <[email protected]>
I'm not really sure that this is correct, but do it for now.

Signed-off-by: Chris Lalancette <[email protected]>
In particular, make sure that we track requests from
individual clients separately so that we don't mix them
up.  To do that, we store the client gid in the server
set along with the sequence_number and Query itself.

Signed-off-by: Chris Lalancette <[email protected]>
Signed-off-by: Yadunund <[email protected]>
@Yadunund Yadunund changed the title Fixes for test_rclcpp Fix test #101 in test_rclcpp Apr 13, 2024
@Yadunund Yadunund marked this pull request as ready for review April 13, 2024 11:05
@Yadunund
Copy link
Member Author

Yadunund commented Apr 13, 2024

With the changes in this PR I am able to get test_rclcpp to pass

100% tests passed, 0 tests failed out of 118

Label Time Summary:
copyright      =   3.02 sec*proc (1 test)
cppcheck       =   0.39 sec*proc (1 test)
cpplint        =   1.09 sec*proc (1 test)
gtest          =  78.04 sec*proc (60 tests)
launch_test    = 214.65 sec*proc (52 tests)
lint_cmake     =   0.36 sec*proc (1 test)
linter         =   6.28 sec*proc (6 tests)
uncrustify     =   0.50 sec*proc (1 test)
xmllint        =   0.91 sec*proc (1 test)

Total Test time (real) = 299.03 sec

The following tests did not run:
	 28 - test_node_name__rmw_cyclonedds_cpp__rmw_zenoh_cpp (Skipped)
	 56 - test_node_name__rmw_fastrtps_cpp__rmw_zenoh_cpp (Skipped)
	 84 - test_node_name__rmw_fastrtps_dynamic_cpp__rmw_zenoh_cpp (Skipped)
	109 - test_node_name__rmw_zenoh_cpp__rmw_cyclonedds_cpp (Skipped)
	110 - test_node_name__rmw_zenoh_cpp__rmw_fastrtps_cpp (Skipped)
	111 - test_node_name__rmw_zenoh_cpp__rmw_fastrtps_dynamic_cpp (Skipped)

@Yadunund Yadunund changed the title Fix test #101 in test_rclcpp Fix race conditions in rmw_wait and map queries to clients Apr 13, 2024
Signed-off-by: Yadunund <[email protected]>
@Yadunund Yadunund force-pushed the clalancette/wait-fixes branch from 7862207 to cb32cad Compare April 13, 2024 11:11
@Yadunund
Copy link
Member Author

Well some tests seem to be flaky. Especially with exiting correctly. They seem to pass with

colcon test --packages-select test_rclcpp --retest-until-pass 2

@clalancette
Copy link
Collaborator

Well some tests seem to be flaky. Especially with exiting correctly. They seem to pass with

Yes, they are definitelly flaky. When I did debugging on them, what I saw was that sometimes, a request was sent from the client, it was taken by the server, and the server sent a response, but that response never caused a callback to happen on the client. It seems to be something in Zenoh proper, but I need to do a bit more debugging and speak to Zettascale people.

@clalancette
Copy link
Collaborator

Regardless, I think this is overall an improvement, so I'm going to go ahead and approve it (even though some of it is my own work).

@Yadunund Yadunund merged commit c1c6f95 into rolling Apr 13, 2024
5 of 6 checks passed
@delete-merged-branch delete-merged-branch bot deleted the clalancette/wait-fixes branch April 13, 2024 15:04
@Yadunund
Copy link
Member Author

I'm seeing issues with the nav2 demo after merging this in. Specifically, the maps don't show up on RViz. I think it might be best if we split the changes here into 1) rmw_wait fixes and 2) map queries to clients.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[test_rclcpp] #101 test_services_cpp (test_rclcpp.TestTwoExecutables.test_services_cpp)
2 participants