Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wakunode2 cannot be restarted immediately after stopping #959

Open
dao opened this issue May 10, 2022 · 6 comments
Open

wakunode2 cannot be restarted immediately after stopping #959

dao opened this issue May 10, 2022 · 6 comments

Comments

@dao
Copy link

dao commented May 10, 2022

After stopping wakunode2 with ctrl-c or systemctl restart wakunode2, I see the following logs on failure to restart:

...
DBG 2022-05-10 01:36:47.100+00:00 5/7 Starting node and mounted protocols    topics="wakunode.setup" tid=2801968 file=wakunode2.nim:1346
/mnt/c/source/status-im/nim-waku/waku/v2/node/wakunode2.nim(1348) wakunode2
/mnt/c/source/status-im/nim-waku/waku/v2/node/wakunode2.nim(1255) startNode
/mnt/c/source/status-im/nim-waku/waku/v2/node/wakunode2.nim(924) start
/mnt/c/source/status-im/nim-waku/vendor/nim-chronos/chronos/asyncfutures2.nim(365) futureContinue
/mnt/c/source/status-im/nim-waku/waku/v2/node/wakunode2.nim(930) start
/mnt/c/source/status-im/nim-waku/vendor/nim-libp2p/libp2p/switch.nim(239) start
/mnt/c/source/status-im/nim-waku/vendor/nim-chronos/chronos/asyncfutures2.nim(368) futureContinue
/mnt/c/source/status-im/nim-waku/vendor/nim-libp2p/libp2p/switch.nim(260) start
[[reraised from:
/mnt/c/source/status-im/nim-waku/waku/v2/node/wakunode2.nim(1348) wakunode2
/mnt/c/source/status-im/nim-waku/waku/v2/node/wakunode2.nim(1255) startNode
/mnt/c/source/status-im/nim-waku/waku/v2/node/wakunode2.nim(924) start
/mnt/c/source/status-im/nim-waku/vendor/nim-chronos/chronos/asyncfutures2.nim(368) futureContinue
/mnt/c/source/status-im/nim-waku/vendor/nim-chronos/chronos/asyncmacro2.nim(275) start
]]
[[reraised from:
/mnt/c/source/status-im/nim-waku/waku/v2/node/wakunode2.nim(1348) wakunode2
/mnt/c/source/status-im/nim-waku/waku/v2/node/wakunode2.nim(1255) startNode
]]
Error: unhandled exception: Failed to start one transport [TransportError]
wakunode2.service: Main process exited, code=exited, status=1/FAILURE
wakunode2.service: Failed with result 'exit-code'.
wakunode2.service: Scheduled restart job, restart counter is at 7.

In this case, wakunode2 was able to be restarted 60-70 seconds after the command to restart was issued (RestartSec=10, restart counter at 7; subsequent restart worked normally)

@kaiserd
Copy link
Contributor

kaiserd commented May 13, 2022

Thank you for posting this.

The reason is that sockets are currently not freed when the node is interrupted during setup. I takes a bit until the sockets are freed by the OS.
(For local testing, you can use --ports-shift=1 (incrementing after each restart).)

In the current version, this should only happen when the node is interrupted during setup. After the setup is complete, ctrl+c should cause a graceful shutdown.

Still, we should free resources when interrupted during setup, too.

@dao
Copy link
Author

dao commented May 13, 2022

It also happens after setup. In this case, the node had been operating for some amount of time before it crashed, initiating the restart sequence seen in the log. I have also seen it happen in testing where it's been running a while and then i restart to launch with a new config. I'll try to capture detailed debug logs of shutdown before this busy port startup error

@kaiserd
Copy link
Contributor

kaiserd commented May 13, 2022

Thank you for the info!

@kaiserd
Copy link
Contributor

kaiserd commented May 16, 2022

@dao

I have also seen it happen in testing where it's been running a while and then i restart to launch with a new config.

So, the problem in any case is a quick restart after a node shutdown?

I added issues

which cover aspects of your issue.

The missing part is resources not being freed even if the node ran for longer (past the setup).
Systemd should send SIGTERM when calling restart, which is handled by nwaku.
So, you should see info "Shutting down after receiving SIGTERM"

The SIGTERM default in systemd can be changed to SIGKILL, which would explain the behaviour... but that's unlikely.

I'll try to capture detailed debug logs of shutdown before this busy port startup error

That would be nice :).

@dao
Copy link
Author

dao commented May 17, 2022

That would be nice :).

All I've got so far, just tried to restart because I was seeing a bunch of invalid peers and wanted to update config.

INF 2022-05-17 00:27:37.729+00:00 Attempting dial                            topics="wakunode" tid=2973638 file=wakunode2.nim:790 wireAddr=/ip4/216.238.69.161/tcp/8000/wss peerId=16U*K8PVSD
INF 2022-05-17 00:27:37.730+00:00 Dialing peer from manager                  topics="wakupeers" tid=2973638 file=peer_manager.nim:51 wireAddr=/ip4/216.238.69.161/tcp/8000/wss peerId=16U*K8PVSD
DBG 2022-05-17 00:27:37.730+00:00 Dialing peer                               topics="libp2p dialer" tid=2973638 file=dialer.nim:53 peerId=16U*K8PVSD
DBG 2022-05-17 00:27:37.732+00:00 exiting pubsub read loop                   topics="libp2p pubsubpeer" tid=2973638 file=pubsubpeer.nim:160 conn=16U*K8PVSD:6282eb1eafefe64834159a15 peer=16U*K8PVSD closed=true
DBG 2022-05-17 00:27:37.732+00:00 exiting pubsub read loop                   topics="libp2p pubsubpeer" tid=2973638 file=pubsubpeer.nim:160 conn=16U*K8PVSD:628258a2afefe64834159854 peer=16U*K8PVSD closed=true
DBG 2022-05-17 00:27:37.733+00:00 Could not establish send connection        topics="libp2p pubsubpeer" tid=2973638 file=pubsubpeer.nim:205 msg="Couldn't get muxed stream"
DBG 2022-05-17 00:27:37.733+00:00 Error dialing                              topics="libp2p dialer" tid=2973638 file=dialer.nim:240 conn=16U*K8PVSD:628258a2afefe64834159852 msg="Stream EOF!"
DBG 2022-05-17 00:27:37.733+00:00 Dialing remote peer failed                 topics="wakupeers" tid=2973638 file=peer_manager.nim:73 msg="Stream EOF!"
ERR 2022-05-17 00:27:37.734+00:00 Failed to connect to peer                  topics="wakunode" tid=2973638 file=wakunode2.nim:797 wireAddr=/ip4/216.238.69.161/tcp/8000/wss peerId=16U*K8PVSD
nwaku.service: Succeeded.
Stopped nwaku.
Started nwaku.
DBG 2022-05-17 00:27:45.449+00:00 1/7 Setting up storage                     topics="wakunode.setup" tid=3069664 file=wakunode2.nim:1307
DBG 2022-05-17 00:27:45.450+00:00 2/7 Retrieve dynamic bootstrap nodes       topics="wakunode.setup" tid=3069664 file=wakunode2.nim:1320
ERR 2022-05-17 00:27:45.450+00:00 2/7 Retrieving dynamic bootstrap nodes failed. Continuing without dynamic bootstrap nodes. topics="wakunode.setup" tid=3069664 file=wakunode2.nim:1325
DBG 2022-05-17 00:27:45.450+00:00 3/7 Initializing node                      topics="wakunode.setup" tid=3069664 file=wakunode2.nim:1329
INF 2022-05-17 00:27:45.458+00:00 Initializing networking                    topics="wakunode.setup" tid=3069664 file=wakunode2.nim:236 addrs="@[/dns4/relayer.of.holdings/tcp/60000, /dns4/relayer.of.holdings/tcp/8000/wss]"
DBG 2022-05-17 00:27:45.465+00:00 creating new PeerManager                   topics="wakunode.setup" tid=3069664 file=peer_manager.nim:127
DBG 2022-05-17 00:27:45.465+00:00 no peer storage found                      topics="wakunode.setup" tid=3069664 file=peer_manager.nim:139
DBG 2022-05-17 00:27:45.466+00:00 4/7 Mounting protocols                     topics="wakunode.setup" tid=3069664 file=wakunode2.nim:1339
DBG 2022-05-17 00:27:45.466+00:00 initWakuRelay                              topics="wakurelay" tid=3069664 file=waku_relay.nim:48
DBG 2022-05-17 00:27:45.466+00:00 init WakuRelay                             topics="wakurelay" tid=3069664 file=waku_relay.nim:25
DBG 2022-05-17 00:27:45.466+00:00 init WakuRelay                             topics="wakurelay" tid=3069664 file=waku_relay.nim:25
INF 2022-05-17 00:27:45.466+00:00 mounting relay                             topics="wakunode" tid=3069664 file=wakunode2.nim:702 relayMessages=true
INF 2022-05-17 00:27:45.466+00:00 relay mounted successfully                 topics="wakunode" tid=3069664 file=wakunode2.nim:723
INF 2022-05-17 00:27:45.466+00:00 mounting libp2p ping protocol              topics="wakunode" tid=3069664 file=wakunode2.nim:742
INF 2022-05-17 00:27:45.466+00:00 mounting swap                              topics="wakunode" tid=3069664 file=wakunode2.nim:475 mode=Soft
INF 2022-05-17 00:27:45.466+00:00 wakuSwap init 2                            topics="wakunode" tid=3069664 file=waku_swap.nim:284
INF 2022-05-17 00:27:45.467+00:00 wakuSwap init 1                            topics="wakuswap" tid=3069664 file=waku_swap.nim:216
INF 2022-05-17 00:27:45.467+00:00 mounting store                             topics="wakunode" tid=3069664 file=wakunode2.nim:482
DBG 2022-05-17 00:27:45.467+00:00 mounting store with swap                   topics="wakunode" tid=3069664 file=wakunode2.nim:488
DBG 2022-05-17 00:27:45.467+00:00 init                                       topics="wakunode" tid=3069664 file=waku_store.nim:386
DBG 2022-05-17 00:27:45.467+00:00 5/7 Starting node and mounted protocols    topics="wakunode.setup" tid=3069664 file=wakunode2.nim:1346
/mnt/c/source/status-im/nim-waku/waku/v2/node/wakunode2.nim(1348) wakunode2
/mnt/c/source/status-im/nim-waku/waku/v2/node/wakunode2.nim(1255) startNode
/mnt/c/source/status-im/nim-waku/waku/v2/node/wakunode2.nim(924) start
/mnt/c/source/status-im/nim-waku/vendor/nim-chronos/chronos/asyncfutures2.nim(365) futureContinue
/mnt/c/source/status-im/nim-waku/waku/v2/node/wakunode2.nim(930) start
/mnt/c/source/status-im/nim-waku/vendor/nim-libp2p/libp2p/switch.nim(239) start
/mnt/c/source/status-im/nim-waku/vendor/nim-chronos/chronos/asyncfutures2.nim(368) futureContinue
/mnt/c/source/status-im/nim-waku/vendor/nim-libp2p/libp2p/switch.nim(260) start
[[reraised from:
/mnt/c/source/status-im/nim-waku/waku/v2/node/wakunode2.nim(1348) wakunode2
/mnt/c/source/status-im/nim-waku/waku/v2/node/wakunode2.nim(1255) startNode
/mnt/c/source/status-im/nim-waku/waku/v2/node/wakunode2.nim(924) start
/mnt/c/source/status-im/nim-waku/vendor/nim-chronos/chronos/asyncfutures2.nim(368) futureContinue
/mnt/c/source/status-im/nim-waku/vendor/nim-chronos/chronos/asyncmacro2.nim(275) start
]]
[[reraised from:
/mnt/c/source/status-im/nim-waku/waku/v2/node/wakunode2.nim(1348) wakunode2
/mnt/c/source/status-im/nim-waku/waku/v2/node/wakunode2.nim(1255) startNode
]]
Error: unhandled exception: Failed to start one transport [TransportError]
nwaku.service: Main process exited, code=exited, status=1/FAILURE
nwaku.service: Failed with result 'exit-code'.
nwaku.service: Scheduled restart job, restart counter is at 1.
Stopped nwaku.

@dao
Copy link
Author

dao commented May 18, 2022

@dao

I have also seen it happen in testing where it's been running a while and then i restart to launch with a new config.

So, the problem in any case is a quick restart after a node shutdown?

I added issues

which cover aspects of your issue.

The missing part is resources not being freed even if the node ran for longer (past the setup). Systemd should send SIGTERM when calling restart, which is handled by nwaku. So, you should see info "Shutting down after receiving SIGTERM"

The SIGTERM default in systemd can be changed to SIGKILL, which would explain the behaviour... but that's unlikely.

I'll try to capture detailed debug logs of shutdown before this busy port startup error

That would be nice :).

latest stop

DBG 2022-05-18 02:05:48.541+00:00 get_waku_v2_relay_v1_messages              topics="relay api" tid=3096284 file=relay_api.nim:72 topic=/waku/2/default-waku/proto


Stopping nwaku...
INF 2022-05-18 02:33:02.669+00:00 Shutting down after receiving SIGTERM      topics="wakunode.setup" tid=3096284 file=wakunode2.nim:1380
DBG 2022-05-18 02:33:02.670+00:00 stop                                       topics="wakurelay" tid=3096284 file=waku_relay.nim:93
INF 2022-05-18 02:33:02.674+00:00 Dialing peer from manager                  topics="wakupeers" tid=3096284 file=peer_manager.nim:51 wireAddr=/ip4/18.117.34.173/tcp/60000 peerId=16U*LgUomj
^Tnwaku.service: State 'stop-final-sigterm' timed out. Killing.
nwaku.service: Killing process 3096284 (wakunode2) with signal SIGKILL.
nwaku.service: Killing process 3096292 (wakunode2) with signal SIGKILL.
nwaku.service: Failed with result 'timeout'.
Stopped nwaku.

systemctl stop nwaku resulted in nwaku hearing the SIGTERM signal and knowing it should stop. Then it starts dialing a new peer. The stop instruction times out and is SIGKILLd.

After waiting a minute before restarting (because I was typing this), start fails exactly as above .. 8 times (at 10s intervals).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Icebox
Development

No branches or pull requests

5 participants