-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race conditions in rmw_wait and map queries to clients #153
Conversation
To very briefly explain, rmw_wait: 1. Checks to see if any of the entities (subscriptions, clients, etc) have data ready to go. 2. If they have data ready to go, then we skip attaching the condition variable and waiting. 3. If they do not have data ready to go, then we attach the condition variable to all entities, take the condition variable lock, and call wait_for/wait on the condition variable. 4. Regardless of whether we did 3 or 4, we check every entity to see if there is data ready, and mark that as appropriate in the wait set. There is a race in all of this, however. If data comes in after we've checked the entity (1), but before we've attached the condition variable (3), then we will never be woken up. In most cases, this means that we'll wait the full timeout for the wait_for, which is not what we want. Fix this by adding another step to 3. In particular, after we've locked the condition variable mutex, check the entities again. Since we change the entities to *also* take the lock before we notify, this ensures that the entities cannot make changes that get lost. Signed-off-by: Chris Lalancette <[email protected]>
Signed-off-by: Chris Lalancette <[email protected]>
I'm not really sure that this is correct, but do it for now. Signed-off-by: Chris Lalancette <[email protected]>
In particular, make sure that we track requests from individual clients separately so that we don't mix them up. To do that, we store the client gid in the server set along with the sequence_number and Query itself. Signed-off-by: Chris Lalancette <[email protected]>
Signed-off-by: Yadunund <[email protected]>
Signed-off-by: Yadunund <[email protected]>
With the changes in this PR I am able to get 100% tests passed, 0 tests failed out of 118
Label Time Summary:
copyright = 3.02 sec*proc (1 test)
cppcheck = 0.39 sec*proc (1 test)
cpplint = 1.09 sec*proc (1 test)
gtest = 78.04 sec*proc (60 tests)
launch_test = 214.65 sec*proc (52 tests)
lint_cmake = 0.36 sec*proc (1 test)
linter = 6.28 sec*proc (6 tests)
uncrustify = 0.50 sec*proc (1 test)
xmllint = 0.91 sec*proc (1 test)
Total Test time (real) = 299.03 sec
The following tests did not run:
28 - test_node_name__rmw_cyclonedds_cpp__rmw_zenoh_cpp (Skipped)
56 - test_node_name__rmw_fastrtps_cpp__rmw_zenoh_cpp (Skipped)
84 - test_node_name__rmw_fastrtps_dynamic_cpp__rmw_zenoh_cpp (Skipped)
109 - test_node_name__rmw_zenoh_cpp__rmw_cyclonedds_cpp (Skipped)
110 - test_node_name__rmw_zenoh_cpp__rmw_fastrtps_cpp (Skipped)
111 - test_node_name__rmw_zenoh_cpp__rmw_fastrtps_dynamic_cpp (Skipped)
|
Signed-off-by: Yadunund <[email protected]>
7862207
to
cb32cad
Compare
Well some tests seem to be flaky. Especially with exiting correctly. They seem to pass with
|
Yes, they are definitelly flaky. When I did debugging on them, what I saw was that sometimes, a request was sent from the client, it was taken by the server, and the server sent a response, but that response never caused a callback to happen on the client. It seems to be something in Zenoh proper, but I need to do a bit more debugging and speak to Zettascale people. |
Regardless, I think this is overall an improvement, so I'm going to go ahead and approve it (even though some of it is my own work). |
I'm seeing issues with the nav2 demo after merging this in. Specifically, the maps don't show up on RViz. I think it might be best if we split the changes here into 1) rmw_wait fixes and 2) map queries to clients. |
)" This reverts commit c1c6f95.
Fixes #152.