You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Right now the manager exits EVerest whenever a child (module) dies. Then systemd usually restarts EVerest to recover. If a module hangs in a command handler, the framework will also timeout and exit the module process which results in a restart of EVerest. We recently added some dead-lock detecting mutexes to EvseManager as well to ensure restarts in case a module hangs somewhere.
There is however no generic watchdog functionality that can be used inside of a module to watch a long running thread (e.g. a mainloop thread, or an IO thread) to see if it is still running, so there are still some threads in some modules that may hang without leading to an exit of the module process.
We should ensure that such scenarios will restart EVerest.
The text was updated successfully, but these errors were encountered:
Work has started in #514 and corresponding PRs in framework and utils. Each module now has a WatchdogSupervisor that can be used within the module to register watchdogs for individual threads. The plan is to extend this further, so that the full chain down to the hardware watchdog is covered:
Module registers a watchdog with its supervisor. The supervisor checks in its own thread if the target thread is still alive. The supervisor thread itself sends MQTT pings to the manager. The manager ensures that the supervisor threads of all modules are running and MQTT communication still works. The manager sends (optionally) watchdog pings to systemd (needs to be setup in the service file), and systemd itself uses the hardware watchdog device to ensure it is still alive.
Right now the manager exits EVerest whenever a child (module) dies. Then systemd usually restarts EVerest to recover. If a module hangs in a command handler, the framework will also timeout and exit the module process which results in a restart of EVerest. We recently added some dead-lock detecting mutexes to EvseManager as well to ensure restarts in case a module hangs somewhere.
There is however no generic watchdog functionality that can be used inside of a module to watch a long running thread (e.g. a mainloop thread, or an IO thread) to see if it is still running, so there are still some threads in some modules that may hang without leading to an exit of the module process.
We should ensure that such scenarios will restart EVerest.
The text was updated successfully, but these errors were encountered: