Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Debian/Ubuntu zenoh-bridge-ros2dds.service panic when interface not yet up during boot. #77

Closed
aosmw opened this issue Feb 2, 2024 · 4 comments · Fixed by #112
Labels
bug Something isn't working

Comments

@aosmw
Copy link

aosmw commented Feb 2, 2024

Describe the bug

I am using zenoh-bridge-ros2dds service that, otherwise works, with a conf.json5 containing an entry specifying an endpoint that is not yet availble when it zenoh attempts to bind to it. Maybe zenoh could retry, possibly timeout, rather than panic?

listen: {
endpoints: ["tcp/192.168.0.241:7447"]
},

It appears that the After/Wants = network-online.target is not enough to make sure the interface is up and ready to be bound to.

head -5 /etc/systemd/system/zenoh-bridge-ros2dds.service
[Unit]
Description = Eclipse Zenoh Bridge for ROS2 with a DDS RMW
Documentation=https://github.com/eclipse-zenoh/zenoh-plugin-ros2dds
After=network-online.target
Wants=network-online.target

I can restart the service successfully after the interface is actually up.

Feb 02 11:54:54 iwdbase systemd[1]: Started Eclipse Zenoh Bridge for ROS2 with a DDS RMW.
Feb 02 11:54:54 iwdbase zenoh-bridge-ros2dds[9499]: [2024-02-02T00:54:54Z INFO  zenoh_bridge_ros2dds] zenoh-bridge-ros2dds v0.10.1-rc.2 built with rustc 1.72.0 (5680fa18f 2023-08-23)
Feb 02 11:54:54 iwdbase zenoh-bridge-ros2dds[9499]: [2024-02-02T00:54:54Z DEBUG zenoh::net::runtime] Zenoh Rust API v0.10.1-rc
Feb 02 11:54:54 iwdbase zenoh-bridge-ros2dds[9499]: [2024-02-02T00:54:54Z INFO  zenoh::net::runtime] Using PID: d391bd5b8b9524caa7d489bae271cff0
Feb 02 11:54:54 iwdbase zenoh-bridge-ros2dds[9499]: [2024-02-02T00:54:54Z DEBUG zenoh::net::routing::network] [Routers network] Add node (self) d391bd5b8b9524caa7d489bae271cff0
Feb 02 11:54:54 iwdbase zenoh-bridge-ros2dds[9499]: [2024-02-02T00:54:54Z DEBUG zenoh::net::routing::network] [Peers network] Add node (self) d391bd5b8b9524caa7d489bae271cff0
Feb 02 11:54:54 iwdbase zenoh-bridge-ros2dds[9499]: [2024-02-02T00:54:54Z ERROR zenoh::net::runtime::orchestrator] Unable to open listener tcp/192.168.0.241:7447: Can not create a new TCP listener bound to tcp/192.168.0.241:7447: [192.168.0.241:7447: Cannot assign requested address (os error 99) at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/zenoh-link-tcp-0.10.1-rc/src/unicast.rs:245.] at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/zenoh-link-tcp-0.10.1-rc/src/unicast.rs:333.
Feb 02 11:54:54 iwdbase zenoh-bridge-ros2dds[9499]: thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Can not create a new TCP listener bound to tcp/192.168.0.241:7447: [192.168.0.241:7447: Cannot assign requested address (os error 99) at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/zenoh-link-tcp-0.10.1-rc/src/unicast.rs:245.] at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/zenoh-link-tcp-0.10.1-rc/src/unicast.rs:333.', zenoh-bridge-ros2dds/src/main.rs:77:62
Feb 02 11:54:54 iwdbase zenoh-bridge-ros2dds[9499]: note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Feb 02 11:54:55 iwdbase systemd[1]: zenoh-bridge-ros2dds.service: Main process exited, code=dumped, status=6/ABRT
Feb 02 11:54:55 iwdbase systemd[1]: zenoh-bridge-ros2dds.service: Failed with result 'core-dump'.

To reproduce

  1. Start a zenoh-bridge-ros2dds.service with an endpoint address that does not yet exist.
  2. panic

System info

  • Ubuntu 22.04.3 LTS
  • zenoh-bridge-ros2dds v0.10.1-rc.2 built with rustc 1.72.0 (5680fa18f 2023-08-23)
cat /etc/systemd/system/zenoh-bridge-ros2dds.service 
[Unit]
Description = Eclipse Zenoh Bridge for ROS2 with a DDS RMW
Documentation=https://github.com/eclipse-zenoh/zenoh-plugin-ros2dds
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
Environment=RUST_LOG=debug
#Environment=RUST_BACKTRACE=1
Environment=ROS_DISTRO=humble
Environment=RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
Environment=CYCLONEDDS_URI="file:///etc/aos/zenoh/zenoh-bridge-cyclonedds-config.xml"
Environment=ROS_DOMAIN_ID="10"
ExecStart = /usr/bin/zenoh-bridge-ros2dds -c /etc/aos/zenoh/conf.json5
KillMode=mixed
KillSignal=SIGINT
RestartKillSignal=SIGINT
Restart=on-failure
PermissionsStartOnly=true
User=zenoh-bridge-ros2dds
StandardOutput=journal
StandardError=journal
SyslogIdentifier=zenoh-bridge-ros2dds

[Install]
WantedBy=multi-user.target
cat /etc/aos/zenoh/zenoh-bridge-cyclonedds-config.xml 
<?xml version="1.0" encoding="utf-8"?>
<CycloneDDS
    xmlns="https:://cdds.io/config"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema=instance"
    xsi:schemaLocation="https://cdds.io/config https://raw.githubusercontent.com/eclipse-cyclonedds/cyclonedds/master/etc/cyclonedds.xsd"
>
    <Domain Id="any">
        <General>
            <Interfaces>
                <NetworkInterface address="127.0.0.1" multicast="true"/>
            </Interfaces>
            <DontRoute>true</DontRoute>
        </General>
    </Domain>
</CycloneDDS>
@aosmw aosmw added the bug Something isn't working label Feb 2, 2024
@aosmw
Copy link
Author

aosmw commented Feb 2, 2024

To have the service restart every 5 seconds I added StartLimitBurst, StartLimitIntervalSec, and RestartSec,

[Unit]
Description = Eclipse Zenoh Bridge for ROS2 with a DDS RMW
Documentation=https://github.com/eclipse-zenoh/zenoh-plugin-ros2dds
After=network-online.target
Wants=network-online.target
StartLimitBurst=3
StartLimitIntervalSec=10

[Service]
Type=simple
Environment=RUST_LOG=debug
#Environment=RUST_BACKTRACE=1
Environment=ROS_DISTRO=humble
Environment=RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
Environment=CYCLONEDDS_URI="file:///etc/aos/zenoh/zenoh-bridge-cyclonedds-config.xml"
Environment=ROS_DOMAIN_ID="10"
ExecStart = /usr/bin/zenoh-bridge-ros2dds -c /etc/aos/zenoh/conf.json5
KillMode=mixed
KillSignal=SIGINT
RestartKillSignal=SIGINT
Restart=on-failure
RestartSec=5
PermissionsStartOnly=true
User=zenoh-bridge-ros2dds
StandardOutput=journal
StandardError=journal
SyslogIdentifier=zenoh-bridge-ros2dds

[Install]
WantedBy=multi-user.target

My systemd version is

systemd --version
systemd 249 (249.11-0ubuntu3.12)

@JEnoch
Copy link
Member

JEnoch commented Feb 2, 2024

I agree that ideally Zenoh shall periodically retry to open the configured listen endpoints (in the same way it does for configured connect endpoints). I created eclipse-zenoh/zenoh#712 to address this.

Meanwhile, it makes sense to have the zenoh-bridge-ros2dds.service configured by default with StartLimitBurst, StartLimitIntervalSec, and RestartSec (and to have the bridge cleanly exiting with an error rather than panicking...).

I'll address this soon.

@JEnoch
Copy link
Member

JEnoch commented Apr 11, 2024

I created #112 which sets RestartSec=5 for the service.
I finally don't think configuring a StartLimit is a good idea: some wifi connection might be established a very long time after system start for instance. In such case, it makes sense to never stop trying to start the bridge, until the interface if up and running.

A more ideal solution is probably for the user to change the service configuration to make it wait for a specific interface to be up and assigned with an IP address, as explained here

Finally, eclipse-zenoh/zenoh#770 introduced the possibility to configure Zenoh with retry configuration for listen endpoints. However it will silently and periodically retry to bind the address, hiding possible issue or misconfiguration to the user. I don't think that's desirable for a service.

@Mallets
Copy link
Member

Mallets commented Apr 22, 2024

I echo @JEnoch's reply. I believe the default configuration of Zenoh should make everything explicit, reproducible, and easy to inspect. Having a StartLimit will make certain system dependability a bit flaky where network may become available when StartLimit already expired. So better to not impose such limitation in the shipped .service file since the right value may depend on the actual deployment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants