Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't block in rmw_init checking for the router. #308

Merged
merged 3 commits into from
Nov 19, 2024

Conversation

clalancette
Copy link
Collaborator

It is much cleaner to just have rmw_init return without looping, so we switch to only checking for the router without waiting. However, if we don't find a router, warn the user.

It still works to start a publisher and subscription, then start the router; they'll connect at that point.

This is potentially somewhat controversial, but I think this is a better experience for users.

@clalancette clalancette requested a review from Yadunund November 12, 2024 21:06
@Yadunund
Copy link
Member

Not sure how I feel about this given that we call zc_liveliness_get to retrieve the graph cache right after checking for the router. If there is no router present, I don't know if we would receive liveliness tokens from other nodes in the graph. We certainly wont get them from nodes on a different host. And I'd say having the graph cache be accurate the first time we call ros2 node list or other introspection methods is a huge advantage of this RMW.

What do you think about

  1. moving the existing router check + zc_liveliness_get into a function which is called once during rmw_init.
  2. If we don't find the router in rmw_init, before we create a node, pub/sub, etc, we call the function from 1) each time?

PS we should also rmb to update the README if we decide to remove this behavior https://github.com/ros2/rmw_zenoh?tab=readme-ov-file#checking-for-a-zenoh-router

@clalancette
Copy link
Collaborator Author

I'm going to answer this in the reverse order.

  1. moving the existing router check + zc_liveliness_get into a function which is called once during rmw_init.

  2. If we don't find the router in rmw_init, before we create a node, pub/sub, etc, we call the function from 1) each time?

I thought about this as well, and the problem here is what we do in e.g. rmw_create_publisher if the router isn't running. We have 3 options:

  1. Return an error. I think this is problematic because the upper layers don't generally expect this, and I think will start throwing exceptions if rmw_create_publisher returns an error.
  2. Return an RMW_RET_OK without doing any work. This isn't correct in that we wouldn't actually be creating the publisher, so when things do come online the publisher wouldn't really exist.
  3. Print a warning and continue on. This is basically equivalent to what this PR is doing, but noisier.

If there is no router present, I don't know if we would receive liveliness tokens from other nodes in the graph.

So in my local testing, what I saw was that at startup, if the router is not running, then we do not receive any liveliness tokens from anything else in the network (even local nodes). However, we do receive liveliness token from ourself.

Once the router starts, then all liveliness tokens seem to be propagated from all nodes to all other nodes in the network. In other words, as soon as the router starts up, ros2 topic list does exactly what we expect.

@Yadunund
Copy link
Member

Once the router starts, then all liveliness tokens seem to be propagated from all nodes to all other nodes in the network. In other words, as soon as the router starts up, ros2 topic list does exactly what we expect.

That is good to know!

How do you feel about having this PR update the default value of zenoh_router_check_attempts() to 1 which would have the same effect as your changes? ie if the envar is unspecified, we check once and continue with the rest of the initialization instead of throwing an error.

That would give us some time to play around with rmw_zenoh with the new behavior while retaining mechanisms to switch back if we find it undesirable.

It is much cleaner to just have rmw_init return without
looping, so we switch to only checking for the router
without waiting.  However, if we don't find a router,
warn the user.

It still works to start a publisher and subscription, then
start the router; they'll connect at that point.

Signed-off-by: Chris Lalancette <[email protected]>
Signed-off-by: Chris Lalancette <[email protected]>
@clalancette clalancette force-pushed the clalancette/remove-router-check branch from b13f168 to e4b97f2 Compare November 14, 2024 01:59
@clalancette
Copy link
Collaborator Author

How do you feel about having this PR update the default value of zenoh_router_check_attempts() to 1 which would have the same effect as your changes? ie if the envar is unspecified, we check once and continue with the rest of the initialization instead of throwing an error.

All right, I ended up doing this in the latest commits. Note that I still had to make some changes here so that we end up only sleeping when we have to, but still print out every second when we don't. But I tested this out locally, and ZENOH_ROUTER_CHECK_ATTEMPTS is still honored here.

Please take another look when you get a chance!

@clalancette clalancette merged commit 3d771c8 into rolling Nov 19, 2024
8 checks passed
@clalancette clalancette deleted the clalancette/remove-router-check branch November 19, 2024 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants