Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to one participant per context model #145

Merged
merged 9 commits into from
Apr 22, 2020

Conversation

eboasson
Copy link
Collaborator

@eboasson eboasson commented Apr 9, 2020

@ivanpauno this seems to work and it would be great if you could find the time to look it over. I have only tested it locally using rcl/test and test_communication.

  • Secure ROS can't find the "sros2"-generated key stores, but that seems to be the case with Fast-RTPS as well.
  • I've really relied on the Fast-RTPS RMW layer, so various parts were copied nearly verbatim (like demangling, type support and introspection functions), and some parts (e.g., context setup, discovery-related extensions to creating/destroying nodes, publishers, subscriptions) borrow heavily from the work done there.
  • I was surprised that local readers/writers require a add_entity call in addition to the associate call, but it is the only logical conclusion after observing that skipping the local entities in my discovery thread it makes a ton of tests fail.
  • It does seem to me that it would make sense to define the (de)mangling of topic and type names in rmw_dds_common.
  • I suspect some of the tests have become timing-dependent now because the local readers/writers get added to the graph cache asynchronously at the moment. I could of course do the local add_reader/add_writer calls synchronously, but it would be good to know what is expected.

I think there'll be at least one round of cleaning up to come 😄 and I'm also not quite sure I've done the right thing with the unique_ptrs (not that much of a C++ man ... @rotu will hopefully be able to advise there).

@ivanpauno
Copy link
Member

Secure ROS can't find the "sros2"-generated key stores, but that seems to be the case with Fast-RTPS as well.

There's a problem in Focal with the version of SSL, but I tried in both Bionic and Windows and it's working. You can check also Windows nightlies (from some days ago, as it's broken for other reasons right now).

I was surprised that local readers/writers require a add_entity call in addition to the associate call, but it is the only logical conclusion after observing that skipping the local entities in my discovery thread it makes a ton of tests fail.

That's needed, so a reader/writer can be matched with a Node. If not, you can only match it with the Participant that created it.

It does seem to me that it would make sense to define the (de)mangling of topic and type names in rmw_dds_common.

Yeah, I agree. I was thinking to move all that logic there, but I didn't have the time to do it.

@ivanpauno
Copy link
Member

I suspect some of the tests have become timing-dependent now because the local readers/writers get added to the graph cache asynchronously at the moment. I could of course do the local add_reader/add_writer calls synchronously, but it would be good to know what is expected.

What can happen is that a reader is shown when listing subscriptions, but not shown when listing the subscription that a node created.
I think it's fine if that continues being asynchronous.

This could be probably avoided if the information is somehow communicated through reader/writer userdata (see ros2/design#250 (comment)).
The main reason I didn't choose that path, is that there were not support in all DDS vendors that ROS2 is using.

@ivanpauno
Copy link
Member

@eboasson this was fast! I will do a review ASAP.

@eboasson
Copy link
Collaborator Author

eboasson commented Apr 9, 2020

@rotu just wanted to remark that it conflicts with your #119 but only in a trivial way: a rename of one local variable to eliminate a warning that it shadowed another one. I purposely left the code alone where it had nothing to do with the change in mapping.

Copy link
Member

@ivanpauno ivanpauno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some minimal comments and questions, otherwise LGTM with green CI.

I'm not used to cyclonedds API, so I might be missing some details.

@@ -64,13 +69,18 @@ target_include_directories(rmw_cyclonedds_cpp PUBLIC
$<INSTALL_INTERFACE:include>
)

target_link_libraries(rmw_cyclonedds_cpp CycloneDDS::ddsc)
target_link_libraries(rmw_cyclonedds_cpp
${rmw_dds_common_LIBRARIES__rosidl_typesupport_introspection_cpp}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line shouldn't be necessary after ros2/rmw_dds_common#12

@@ -0,0 +1,53 @@
// Copyright 2019 Open Source Robotics Foundation, Inc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file shouldn't be necessary after ros2/rmw_dds_common#12.

}
/* I suppose I could attach lambda functions one way or another, which would
definitely be more elegant, but this avoids having to deal with the C++
freakishness that is involved and works, too. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine, using lambdas would probably be more confusing.

std::lock_guard<std::mutex> lock(gcdds.domains_lock);
if (!check_create_domain_locked(did, localhost_only)) {
return nullptr;
if (!check_create_domain(domain_id, options->localhost_only)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In rmw_fastrtps, a Participant is created when the first Node is created within a Context.
This is also a valid option, and shouldn't create cross-vendor problems.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea when the context gets initialized relative to when the first node is created. If the two always go hand in hand, then it is not worth changing. But if applications often create the context much earlier (say, before checking their command-line arguments), it would be better to do it the same way as in rmw_fastrtps. I'm happy to go with this for now.

if (ppant_qos == nullptr) {
return RMW_RET_BAD_ALLOC;
}
std::string user_data = std::string("securitycontext=") + std::string(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

securitycontext was renamed enclave

// - Whatever the graph cache implementation does, it shouldn't involve much more than local state
// updates and triggering a guard condition, and so that should be safe.
// - There might be a significant cost to calling the graph cache function to update its internal
// state, and having that occur on the thread receiving data from the network may not be wise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the last point is the main reason I've used a separate thread.
rmw_dds_common can probably be largely improved (and the way ros discovery data is shared too).

rmw_cyclonedds_cpp/src/rmw_node.cpp Outdated Show resolved Hide resolved
context->implementation_identifier,
eclipse_cyclonedds_identifier,
return RMW_RET_INCORRECT_RMW_IMPLEMENTATION);
delete context->impl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this deleting everything created in rmw_init correctly? I didn't check thoroughly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! The RMW publisher, subscription and guard condition objects for interfacing wiht the graph cache weren't deleted (the underlying DDS objects did get deleted as a consequence of deleting the participant).

I've added deleting them to the rmw_context_impl_t destructor and adjusted the initialisation code a bit so that I think it should all work fine now. I've tried it, also with some fault injection, but I might be getting some of the C++-ey bits wrong. @rotu perhaps you could have a look as well?

rmw_cyclonedds_cpp/src/rmw_node.cpp Show resolved Hide resolved
@eboasson eboasson force-pushed the ppant-per-context branch from c8108b4 to 81b6c61 Compare April 16, 2020 10:34
@eboasson
Copy link
Collaborator Author

@ivanpauno I've rebased it to get all the updates rmw_dds_common and some of the other changes. That's the force-push. The other changes I made based on your comments are in separate comments.

Signed-off-by: Erik Boasson <[email protected]>
Copy link
Member

@ivanpauno ivanpauno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment, LGTM!

rmw_cyclonedds_cpp/src/rmw_node.cpp Outdated Show resolved Hide resolved
@ivanpauno
Copy link
Member

ivanpauno commented Apr 16, 2020

Full CI

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

(edit) retriggered CI jobs that failed because of a connection problem.

@ivanpauno
Copy link
Member

@eboasson test_echo_pub of ros2topic has hung in all platforms.
In some of them, I tried twice.

Either this PR introduced a race condition, or this change randomly triggered a race condition elsewhere.

I wouldn't discard the second option, considering ros2/build_farmer#248.
I will test locally if I have the time, but I don't promise anything.

Let me know if you can dig into the problem.

@eboasson
Copy link
Collaborator Author

@eboasson test_echo_pub of ros2topic has hung in all platforms.
In some of them, I tried twice.

Either this PR introduced a race condition, or this change randomly triggered a race condition elsewhere.

I wouldn't discard the second option, considering ros2/build_farmer#248.
I will test locally if I have the time, but I don't promise anything.

Let me know if you can dig into the problem.

I should be able to at least do some investigation on macOS

@eboasson
Copy link
Collaborator Author

@ivanpauno instead of hanging, that test fails for me:

======================================================================
FAIL: test_echo_basic (test_echo_pub.TestROS2TopicEchoPub) (topic='/clitest/topic/echo_incompatible_qos', provide_qos=True, compatible_qos=False)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/erik/ros2_ws/src/ros2/ros2cli/ros2topic/test/test_echo_pub.py", line 250, in test_echo_basic
    assert not command.output, (
AssertionError: Echo CLI should not have received anything with incompatible QoS
assert not 'Incompatible QoS Policy detected: RELIABILITY_QOS_POLICY\n'
 +  where 'Incompatible QoS Policy detected: RELIABILITY_QOS_POLICY\n' = <launch_testing.tools.process.ProcessProxy object at 0x1090be7d0>.output

There's no way the QoS matching suddenly stopped working, but it could be a publisher process that keeps running for too long. That also bears some similarity to a hanging test.

I also get, at the very end:

ros discovery info listener thread: wait failed, will shutdown ...

and that is weird, too, and might be a clue to the other symptoms. (@dennis-adlink has kindly put some effort into reproducing it as well so I could do some other urgent things, but last I heard he didn't encounter any failure.) These details need to be understood first ...

Also, the CI run now fails because I added a function to Cyclone (dds_get_guid, eclipse-cyclonedds/cyclonedds#483) to avoid having to write some truly ridiculous code in rmw_cyclonedds, but the CI run now uses Cyclone's releases/0.5.x branch which doesn't have that. I think for Foxy the CI should be using master. @rotu, I know I saw a few PRs here and there regarding the switching of Cyclone branches used by ROS2 but I never delved into the details. Would you have time to check why our local CI (and perhaps the "real" one, too) switched branches?

The gcdds destructur always runs upon termination of the process, also
when the process hasn't deleted the RMW objects first, and so possibly
before the rmw_context_impl_t destructor runs.  As a consequence the
discovery thread may never be stopped and run into an error from
dds_waitset_wait because the waitset has been deleted.

Deleting the domain in the destructor of rmw_context_impl_t avoids this
problem.

Signed-off-by: Erik Boasson <[email protected]>
@eboasson
Copy link
Collaborator Author

The

ros discovery info listener thread: wait failed, will shutdown ...

message was caused by a global destructor shutting down Cyclone DDS while the discovery thread was still running, which in turn was a consequence of the process not deleting the RMW objects before terminating. I've pushed a fix for that (which simply means it ends abruptly, but that is probably the intended behaviour).

Other than that, the error I'm getting:

test_echo_basic (test_echo_pub.TestROS2TopicEchoPub) ... [WARN] [1587305691.385956000] [cli_echo_pub.cli_echo_pub_test_node]: New subscription discovered on this topic, requesting incompatible QoS. No messages will be sent to it. Last incompatible policy: RELIABILITY_QOS_POLICY

is caused by test script being unable to handle the warnings for incompatible QoS pairs that ROS2 now prints. When using Fast-RTPS the test succeeds because that Fast-RTPS doesn't support the feature yet (eProsima/Fast-DDS#852).

Each test takes ridiculously long, but I have not seen any hanging tests. So I assume your hunch that the hangs tests might be caused by some things in other packages is correct.

I think we ought to do another CI run with the fixes I pushed since the previous one and see what comes out.

@eboasson eboasson changed the title WIP: Switch to one participant per context model Switch to one participant per context model Apr 20, 2020
@ivanpauno
Copy link
Member

ivanpauno commented Apr 21, 2020

Would you have time to check why our local CI (and perhaps the "real" one, too) switched branches?

We're also using releases/0.5.x in CI:
https://github.com/ros2/ros2/blob/5086f2436cbf8450f91fd91770f28883aed52375/ros2.repos#L29

I think we ought to do another CI run with the fixes I pushed since the previous one and see what comes out.

I will give it a try locally and then run CI, using both this change and cyclonedds master.

@ivanpauno
Copy link
Member

ivanpauno commented Apr 21, 2020

Locally, it works fine.
Another CI round:

  • Linux Build Status (unrelated failures)
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

Edit: Originally, I triggered some of the jobs with only rmw_cyclonedds. That's currently broken (http://build.ros2.org/view/Fci/job/Fci__nightly-cyclonedds_ubuntu_focal_amd64/), so I retriggered including Fast-RTPS too.

@ivanpauno
Copy link
Member

@eboasson Failures are unrelated, so I will merge this after the update in ros2.repos file is merged.

Thanks for working on this!

@ivanpauno
Copy link
Member

Ok, I will merge this one, as cyclonedds branch was moved to master again, thanks @eboasson.

@rotu I think it would be good to update the checks here to use master again.

@ivanpauno ivanpauno merged commit 8a1d3fc into ros2:master Apr 22, 2020
@rotu
Copy link
Collaborator

rotu commented Apr 22, 2020

@rotu I think it would be good to update the checks here to use master again.

If you mean ros2/ros2#905, then yes, this corrects my mistake. If you mean something else, please elaborate.

@clalancette
Copy link
Contributor

@ivanpauno @eboasson I think this commit is causing build warnings on Linux:

15:37:21 [ 10%] Building CXX object CMakeFiles/rmw_cyclonedds_cpp.dir/src/rmw_node.cpp.o
15:37:21 /home/jenkins-agent/workspace/ci_linux/ws/src/ros2/rmw_cyclonedds/rmw_cyclonedds_cpp/src/rmw_node.cpp: In function ‘void discovery_thread(rmw_context_impl_t*)’:
15:37:21 /home/jenkins-agent/workspace/ci_linux/ws/src/ros2/rmw_cyclonedds/rmw_cyclonedds_cpp/src/rmw_node.cpp:579:80: warning: type qualifiers ignored on cast result type [-Wignored-qualifiers]
15:37:21   579 |   auto sub = static_cast<CddsSubscription const * const>(impl->common.sub->data);
15:37:21       |                                                                                ^
15:37:21 /home/jenkins-agent/workspace/ci_linux/ws/src/ros2/rmw_cyclonedds/rmw_cyclonedds_cpp/src/rmw_node.cpp:580:96: warning: type qualifiers ignored on cast result type [-Wignored-qualifiers]
15:37:21   580 |   auto gc = static_cast<CddsGuardCondition const * const>(impl->common.listener_thread_gc->data);
15:37:21       |                              

@ivanpauno
Copy link
Member

Sorry, I checked the test failures but I didn't see the warnings.
I will work on a fix.

@ivanpauno ivanpauno mentioned this pull request Apr 22, 2020
@rotu
Copy link
Collaborator

rotu commented Apr 23, 2020

I think this created a build mess by introducing a dependency on FastRTPS.

colcon graph --packages-up-to rmw_cyclonedds_cpp --packages-above fastrtps   
fastrtps                         +**.......
rosidl_typesupport_fastrtps_cpp   +*..***..
rosidl_typesupport_fastrtps_c      +*..**..
rosidl_typesupport_c                +****..
rosidl_generator_py                  + **..
rosidl_typesupport_cpp                +**..
rosidl_default_generators              + * 
rosidl_default_runtime                  +*.
rmw_dds_common                           +*
rmw_cyclonedds_cpp                        +

@ivanpauno
Copy link
Member

I think this created a build mess by introducing a dependency on FastRTPS.

colcon graph --packages-up-to rmw_cyclonedds_cpp --packages-above fastrtps   
fastrtps                         +**.......
rosidl_typesupport_fastrtps_cpp   +*..***..
rosidl_typesupport_fastrtps_c      +*..**..
rosidl_typesupport_c                +****..
rosidl_generator_py                  + **..
rosidl_typesupport_cpp                +**..
rosidl_default_generators              + * 
rosidl_default_runtime                  +*.
rmw_dds_common                           +*
rmw_cyclonedds_cpp                        +

No, it doesn't. Please continue the discussion in ros2/rmw_dds_common#16.

rotu added a commit to RoverRobotics-forks/rmw_cyclonedds that referenced this pull request Apr 25, 2020
Since ros2#145, the CI build of rmw_cyclonedds_cpp has been failing on Windows due to inadvertently injecting fastrtps into the build process.
fastrtps fails to build (eProsima/Fast-DDS#1173) causing the CI to fail.
There doesn't seem to be a better way to suppress this in action-ros-ci ros-tooling/action-ros-ci#177

Fixes ros2#164
rotu added a commit to RoverRobotics-forks/rmw_cyclonedds that referenced this pull request Apr 25, 2020
Since ros2#145, the CI build of rmw_cyclonedds_cpp has been failing on Windows due to inadvertently injecting fastrtps into the build process.
fastrtps fails to build (eProsima/Fast-DDS#1173) causing the CI to fail.
There doesn't seem to be a better way to suppress this in action-ros-ci ros-tooling/action-ros-ci#177

Fixes ros2#164
rotu added a commit to RoverRobotics-forks/rmw_cyclonedds that referenced this pull request Apr 28, 2020
Since ros2#145, the CI build of rmw_cyclonedds_cpp has been failing on Windows due to inadvertently injecting fastrtps into the build process.
fastrtps fails to build (eProsima/Fast-DDS#1173) causing the CI to fail.
There doesn't seem to be a better way to suppress this in action-ros-ci ros-tooling/action-ros-ci#177

Fixes ros2#164
rotu added a commit to RoverRobotics-forks/rmw_cyclonedds that referenced this pull request Apr 28, 2020
Since ros2#145, the CI build of rmw_cyclonedds_cpp has been failing on Windows due to inadvertently injecting fastrtps into the build process.
fastrtps fails to build (eProsima/Fast-DDS#1173) causing the CI to fail.
There doesn't seem to be a better way to suppress this in action-ros-ci ros-tooling/action-ros-ci#177

Fixes ros2#164
rotu added a commit to RoverRobotics-forks/rmw_cyclonedds that referenced this pull request Apr 28, 2020
Since ros2#145, the CI build of rmw_cyclonedds_cpp has been failing on Windows due to inadvertently injecting fastrtps into the build process.
fastrtps fails to build (eProsima/Fast-DDS#1173) causing the CI to fail.
There doesn't seem to be a better way to suppress this in action-ros-ci ros-tooling/action-ros-ci#177

Fixes ros2#164
rotu added a commit that referenced this pull request Apr 29, 2020
Since #145, the CI build of rmw_cyclonedds_cpp has been failing on Windows due to inadvertently injecting fastrtps into the build process.
fastrtps fails to build (eProsima/Fast-DDS#1173) causing the CI to fail.
There doesn't seem to be a better way to suppress this in action-ros-ci ros-tooling/action-ros-ci#177

Fixes #164
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants