Fix race conditions in rmw_wait and map queries to clients #153

Yadunund · 2024-04-13T09:45:35Z

Fixes #152.

To very briefly explain, rmw_wait: 1. Checks to see if any of the entities (subscriptions, clients, etc) have data ready to go. 2. If they have data ready to go, then we skip attaching the condition variable and waiting. 3. If they do not have data ready to go, then we attach the condition variable to all entities, take the condition variable lock, and call wait_for/wait on the condition variable. 4. Regardless of whether we did 3 or 4, we check every entity to see if there is data ready, and mark that as appropriate in the wait set. There is a race in all of this, however. If data comes in after we've checked the entity (1), but before we've attached the condition variable (3), then we will never be woken up. In most cases, this means that we'll wait the full timeout for the wait_for, which is not what we want. Fix this by adding another step to 3. In particular, after we've locked the condition variable mutex, check the entities again. Since we change the entities to *also* take the lock before we notify, this ensures that the entities cannot make changes that get lost. Signed-off-by: Chris Lalancette <[email protected]>

Signed-off-by: Chris Lalancette <[email protected]>

I'm not really sure that this is correct, but do it for now. Signed-off-by: Chris Lalancette <[email protected]>

In particular, make sure that we track requests from individual clients separately so that we don't mix them up. To do that, we store the client gid in the server set along with the sequence_number and Query itself. Signed-off-by: Chris Lalancette <[email protected]>

Signed-off-by: Yadunund <[email protected]>

Yadunund · 2024-04-13T11:06:01Z

With the changes in this PR I am able to get test_rclcpp to pass

100% tests passed, 0 tests failed out of 118

Label Time Summary:
copyright      =   3.02 sec*proc (1 test)
cppcheck       =   0.39 sec*proc (1 test)
cpplint        =   1.09 sec*proc (1 test)
gtest          =  78.04 sec*proc (60 tests)
launch_test    = 214.65 sec*proc (52 tests)
lint_cmake     =   0.36 sec*proc (1 test)
linter         =   6.28 sec*proc (6 tests)
uncrustify     =   0.50 sec*proc (1 test)
xmllint        =   0.91 sec*proc (1 test)

Total Test time (real) = 299.03 sec

The following tests did not run:
	 28 - test_node_name__rmw_cyclonedds_cpp__rmw_zenoh_cpp (Skipped)
	 56 - test_node_name__rmw_fastrtps_cpp__rmw_zenoh_cpp (Skipped)
	 84 - test_node_name__rmw_fastrtps_dynamic_cpp__rmw_zenoh_cpp (Skipped)
	109 - test_node_name__rmw_zenoh_cpp__rmw_cyclonedds_cpp (Skipped)
	110 - test_node_name__rmw_zenoh_cpp__rmw_fastrtps_cpp (Skipped)
	111 - test_node_name__rmw_zenoh_cpp__rmw_fastrtps_dynamic_cpp (Skipped)

Signed-off-by: Yadunund <[email protected]>

Yadunund · 2024-04-13T11:16:47Z

Well some tests seem to be flaky. Especially with exiting correctly. They seem to pass with

colcon test --packages-select test_rclcpp --retest-until-pass 2

clalancette · 2024-04-13T14:34:33Z

Well some tests seem to be flaky. Especially with exiting correctly. They seem to pass with

Yes, they are definitelly flaky. When I did debugging on them, what I saw was that sometimes, a request was sent from the client, it was taken by the server, and the server sent a response, but that response never caused a callback to happen on the client. It seems to be something in Zenoh proper, but I need to do a bit more debugging and speak to Zettascale people.

clalancette · 2024-04-13T14:35:40Z

Regardless, I think this is overall an improvement, so I'm going to go ahead and approve it (even though some of it is my own work).

Yadunund · 2024-04-13T16:56:34Z

I'm seeing issues with the nav2 demo after merging this in. Specifically, the maps don't show up on RViz. I think it might be best if we split the changes here into 1) rmw_wait fixes and 2) map queries to clients.

)" This reverts commit c1c6f95.

)" (#157) This reverts commit c1c6f95.

clalancette and others added 5 commits April 5, 2024 16:10

Small update to a comment.

13f5b42

Signed-off-by: Chris Lalancette <[email protected]>

Don't return an error if we can't find a number in the sequence map.

10d1517

I'm not really sure that this is correct, but do it for now. Signed-off-by: Chris Lalancette <[email protected]>

Finish changes

5fe0247

Signed-off-by: Yadunund <[email protected]>

Yadunund changed the title ~~Fixes for test_rclcpp~~ Fix test #101 in test_rclcpp Apr 13, 2024

Tweak api to store and retrieve query

10ad473

Signed-off-by: Yadunund <[email protected]>

Yadunund marked this pull request as ready for review April 13, 2024 11:05

Yadunund changed the title ~~Fix test #101 in test_rclcpp~~ Fix race conditions in rmw_wait and map queries to clients Apr 13, 2024

Lint

cb32cad

Signed-off-by: Yadunund <[email protected]>

Yadunund force-pushed the clalancette/wait-fixes branch from 7862207 to cb32cad Compare April 13, 2024 11:11

clalancette approved these changes Apr 13, 2024

View reviewed changes

Yadunund merged commit c1c6f95 into rolling Apr 13, 2024
5 of 6 checks passed

delete-merged-branch bot deleted the clalancette/wait-fixes branch April 13, 2024 15:04

Yadunund added a commit that referenced this pull request Apr 13, 2024

Revert "Fix race conditions in rmw_wait and map queries to clients (#153

b6a07b0

)" This reverts commit c1c6f95.

Yadunund mentioned this pull request Apr 13, 2024

Revert "Fix race conditions in rmw_wait and map queries to clients" #157

Merged

clalancette mentioned this pull request Apr 17, 2024

Timer callback stops firing when no data is moving through the system #158

Closed

Yadunund added a commit that referenced this pull request Apr 18, 2024

Revert "Fix race conditions in rmw_wait and map queries to clients (#153

655a961

)" (#157) This reverts commit c1c6f95.

Yadunund restored the clalancette/wait-fixes branch April 18, 2024 05:18

This was referenced Apr 18, 2024

Fix query queue for multiple clients #159

Merged

Fix a race condition in rmw_wait. #160

Closed

clalancette deleted the clalancette/wait-fixes branch April 18, 2024 16:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race conditions in rmw_wait and map queries to clients #153

Fix race conditions in rmw_wait and map queries to clients #153

Yadunund commented Apr 13, 2024 •

edited

Loading

Yadunund commented Apr 13, 2024 •

edited

Loading

Yadunund commented Apr 13, 2024

clalancette commented Apr 13, 2024

clalancette commented Apr 13, 2024

Yadunund commented Apr 13, 2024

Fix race conditions in rmw_wait and map queries to clients #153

Fix race conditions in rmw_wait and map queries to clients #153

Conversation

Yadunund commented Apr 13, 2024 • edited Loading

Yadunund commented Apr 13, 2024 • edited Loading

Yadunund commented Apr 13, 2024

clalancette commented Apr 13, 2024

clalancette commented Apr 13, 2024

Yadunund commented Apr 13, 2024

Yadunund commented Apr 13, 2024 •

edited

Loading

Yadunund commented Apr 13, 2024 •

edited

Loading