-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thread-safe access to rmw_node_data_t #259
Conversation
ae9d50d
to
8f897b5
Compare
eb936da
to
f6d3ab1
Compare
@Yadunund Friendly ping to resolve merge conflicts. |
959da06
to
f106c17
Compare
8f897b5
to
0c091ee
Compare
} | ||
|
||
// Create the liveliness token. | ||
zc_owned_liveliness_token_t token = zc_liveliness_declare_token( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched to an std::recursive_mutex
in 5295774 as this zc_liveliness_declare_token
call will publish a liveliness token which will cause the graph_subscriber in the same session to receive this token and trigger the graph callback. This graph callback also locks the same mutex within rmw_context_impl_s
which causes a deadlock without a recursive_mutex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit curious. What recursive_mutex
does is to allow the same thread to take the lock multiple times without deadlocking. But my understanding (which may be wrong) was that the graph callback ran on a different thread of execution, which recursive_mutex
will not guard against. Can you maybe explain a bit more so I can figure this out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But my understanding (which may be wrong) was that the graph callback ran on a different thread of execution, which recursive_mutex will not guard against.
It seems that this is not the case. See my stack trace:
With the Zenoh router running, run rcl/test_client__rmw_zenoh_cpp
or the usual ros2 run demo_nodes_cpp talker
.
/home/yadunund/ros2_rolling/install/ament_cmake_test/share/ament_cmake_test/cmake/run_test.py /home/yadunund/ros2_rolling/build/rcl/test_results/rcl/test_client__rmw_zenoh_cpp.gtest.xml --package-name rcl --output-file /home/yadunund/ros2_rolling/build/rcl/ament_cmake_gtest/test_client__rmw_zenoh_cpp.txt --env RMW_IMPLEMENTATION=rmw_zenoh_cpp --command /home/yadunund/ros2_rolling/build/rcl/test/test_client --gtest_output=xml:/home/yadunund/ros2_rolling/build/rcl/test_results/rcl/test_client__rmw_zenoh_cpp.gtest.xml
Then get the pid of this process
yadunund@ubuntu-noble-20240225:~/ros2_rolling$ ps -a | grep test_client
1477812 pts/7 00:00:00 test_client
The pid is
1477812
.
Then use gdb
in a second terminal to attach to the pid above.
sudo gdb -q test_client 1477812
Run info threads
(gdb) info threads
Id Target Id Frame
* 1 Thread 0x7ffff76328c0 (LWP 1477812) "test_client" futex_wait (private=0, expected=2, futex_word=0x555555619040) at ../sysdeps/nptl/futex-internal.h:146
2 Thread 0x7fffefa006c0 (LWP 1477822) "rx-1" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
3 Thread 0x7fffefe006c0 (LWP 1477821) "rx-0" 0x00007ffff792a042 in epoll_wait (epfd=19, events=0x7fffe8131bf0, maxevents=1024, timeout=7979)
at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
4 Thread 0x7ffff44006c0 (LWP 1477820) "tx-0" 0x00007ffff792a042 in epoll_wait (epfd=16, events=0x7fffe8009ef0, maxevents=1024, timeout=2475)
at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
5 Thread 0x7ffff4c006c0 (LWP 1477818) "acc-0" 0x00007ffff792a042 in epoll_wait (epfd=12, events=0x555555699850, maxevents=1024, timeout=-1)
at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
6 Thread 0x7ffff54006c0 (LWP 1477816) "net-0" 0x00007ffff792a042 in epoll_wait (epfd=8, events=0x555555694fd0, maxevents=1024, timeout=-1)
at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
7 Thread 0x7ffff58006c0 (LWP 1477815) "app-0" 0x00007ffff792a042 in epoll_wait (epfd=3, events=0x555555690750, maxevents=1024, timeout=-1)
at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
8 Thread 0x7ffff6c006c0 (LWP 1477814) "test_client-ust" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
9 Thread 0x7ffff76006c0 (LWP 1477813) "test_client-ust" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
Inspect thread 1
and get the backtrace with bt
(gdb) thread 1
[Switching to thread 1 (Thread 0x7ffff76328c0 (LWP 1477812))]
#0 futex_wait (private=0, expected=2, futex_word=0x555555619040) at ../sysdeps/nptl/futex-internal.h:146
146 in ../sysdeps/nptl/futex-internal.h
(gdb) bt
#0 futex_wait (private=0, expected=2, futex_word=0x555555619040) at ../sysdeps/nptl/futex-internal.h:146
#1 __GI___lll_lock_wait (futex=futex@entry=0x555555619040, private=0) at ./nptl/lowlevellock.c:49
#2 0x00007ffff78a00f1 in lll_mutex_lock_optimized (mutex=0x555555619040) at ./nptl/pthread_mutex_lock.c:48
#3 ___pthread_mutex_lock (mutex=0x555555619040) at ./nptl/pthread_mutex_lock.c:93
#4 0x00007ffff6d35184 in __gthread_mutex_lock (__mutex=0x555555619040) at /usr/include/x86_64-linux-gnu/c++/13/bits/gthr-default.h:749
#5 0x00007ffff6d351d8 in std::mutex::lock (this=0x555555619040) at /usr/include/c++/13/bits/std_mutex.h:113
#6 0x00007ffff6d35392 in std::lock_guard<std::mutex>::lock_guard (this=0x7ffffffe3480, __m=...) at /usr/include/c++/13/bits/std_mutex.h:249
#7 0x00007ffff6d7e87a in rmw_context_impl_s::graph_sub_data_handler (sample=0x7ffffffe3570, data=0x555555619030)
at /home/yadunund/ros2_rolling/src/ros2/rmw_zenoh/rmw_zenoh_cpp/src/detail/rmw_context_impl_s.cpp:57
#8 0x00007ffff5ba5fc5 in zenohc::liveliness::zc_liveliness_declare_subscriber::{{closure}} ()
from /home/yadunund/ros2_rolling/install/zenoh_c_vendor/opt/zenoh_c_vendor/lib/libzenohc.so
#9 0x00007ffff5df2b6c in zenoh::session::Session::handle_data () from /home/yadunund/ros2_rolling/install/zenoh_c_vendor/opt/zenoh_c_vendor/lib/libzenohc.so
#10 0x00007ffff5e061c5 in <zenoh::session::Session as zenoh::net::primitives::Primitives>::send_declare ()
from /home/yadunund/ros2_rolling/install/zenoh_c_vendor/opt/zenoh_c_vendor/lib/libzenohc.so
#11 0x00007ffff5b6f179 in <zenoh::session::Session as zenoh::net::primitives::EPrimitives>::send_declare ()
from /home/yadunund/ros2_rolling/install/zenoh_c_vendor/opt/zenoh_c_vendor/lib/libzenohc.so
#12 0x00007ffff5dd5447 in <zenoh::net::routing::dispatcher::face::Face as zenoh::net::primitives::Primitives>::send_declare ()
from /home/yadunund/ros2_rolling/install/zenoh_c_vendor/opt/zenoh_c_vendor/lib/libzenohc.so
#13 0x00007ffff5ba5093 in zc_liveliness_declare_token () from /home/yadunund/ros2_rolling/install/zenoh_c_vendor/opt/zenoh_c_vendor/lib/libzenohc.so
#14 0x00007ffff6d94ee3 in rmw_zenoh_cpp::NodeData::make (id=0, session=..., domain_id=0, namespace_="/", node_name="test_client_node", enclave="/")
at /home/yadunund/ros2_rolling/src/ros2/rmw_zenoh/rmw_zenoh_cpp/src/detail/rmw_node_data.cpp:57
#15 0x00007ffff6d80b34 in rmw_context_impl_s::create_node_data (this=0x555555610730, node=0x5555556369c0, ns="/", node_name="test_client_node")
at /home/yadunund/ros2_rolling/src/ros2/rmw_zenoh/rmw_zenoh_cpp/src/detail/rmw_context_impl_s.cpp:474
#16 0x00007ffff6d9fcae in rmw_create_node (context=0x5555556101b0, name=0x5555555b1b0b "test_client_node", namespace_=0x55555561e9d0 "/")
at /home/yadunund/ros2_rolling/src/ros2/rmw_zenoh/rmw_zenoh_cpp/src/rmw_zenoh.cpp:236
#17 0x00007ffff7f8c3e9 in rcl_node_init () from /home/yadunund/ros2_rolling/install/rcl/lib/librcl.so
#18 0x000055555556ee9f in TestClientFixture::SetUp() ()
#19 0x00005555555ac6d1 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ()
#20 0x0000555555592b51 in testing::Test::Run() ()
#21 0x0000555555592d85 in testing::TestInfo::Run() ()
#22 0x0000555555592fc7 in testing::TestSuite::Run() ()
#23 0x00005555555a2acc in testing::internal::UnitTestImpl::RunAllTests() ()
#24 0x00005555555931af in testing::UnitTest::Run() ()
#25 0x0000555555565fb4 in main ()
Looking at the stack trace above you can see that the same Thread 1
calls rmw_context_impl_s::create_node_data()
which invokes zc_liveliness_declare_token()
then blocks on rmw_context_impl_s::graph_sub_data_handler
since the same mutex is already locked.
Hope this clarifies!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Yeah, it is the case that the same thread is re-entering the lock, which is why we need a recursive_mutex here.
This is ready for another round of review. |
} | ||
|
||
// Create the liveliness token. | ||
zc_owned_liveliness_token_t token = zc_liveliness_declare_token( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit curious. What recursive_mutex
does is to allow the same thread to take the lock multiple times without deadlocking. But my understanding (which may be wrong) was that the graph callback ran on a different thread of execution, which recursive_mutex
will not guard against. Can you maybe explain a bit more so I can figure this out?
///============================================================================= | ||
NodeData::~NodeData() | ||
{ | ||
zc_liveliness_undeclare_token(z_move(token_)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a huge fan of having different entities declare/undeclare the token. In this case, I'd prefer if we moved the declaration of the token into the constructor; that way there is no ambiguity of which entity owns it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call out. It was this way before I moved the constructor to public
scope. Now i've implemented a make()
function that will return nullptr
if construction of the entity or liveliness token fails. bda3ed6
ce72182
to
3fd565b
Compare
f569818
to
5eb967c
Compare
9d3274d
to
0053d8d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generally looks good to me. This will need to be retargeted to rolling
and rebased, and @ahcorde 's comments, fixed, but otherwise I think this is good to go.
Signed-off-by: Yadunund <[email protected]>
Signed-off-by: Yadunund <[email protected]>
Signed-off-by: Yadunund <[email protected]>
Signed-off-by: Yadunund <[email protected]>
Signed-off-by: Yadunund <[email protected]>
Signed-off-by: Yadunund <[email protected]>
Signed-off-by: Yadunund <[email protected]>
1af387e
to
9c100ac
Compare
This PR builds off #258.
NodeData
class which stores information about anrmw_node_t
(RMW node). In subsequent PRs this class will be extended to storeunordered_maps
ofPublisherData
,SubscriptionData
, etc.rmw_context_impl_s
has APIs to create, retrieve and deleteNodeData
objects created for variousrmw_node_t
.