Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watchdog in EVerest #539

Open
corneliusclaussen opened this issue Feb 9, 2024 · 1 comment
Open

Watchdog in EVerest #539

corneliusclaussen opened this issue Feb 9, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@corneliusclaussen
Copy link
Contributor

Right now the manager exits EVerest whenever a child (module) dies. Then systemd usually restarts EVerest to recover. If a module hangs in a command handler, the framework will also timeout and exit the module process which results in a restart of EVerest. We recently added some dead-lock detecting mutexes to EvseManager as well to ensure restarts in case a module hangs somewhere.

There is however no generic watchdog functionality that can be used inside of a module to watch a long running thread (e.g. a mainloop thread, or an IO thread) to see if it is still running, so there are still some threads in some modules that may hang without leading to an exit of the module process.

We should ensure that such scenarios will restart EVerest.

@corneliusclaussen corneliusclaussen added the enhancement New feature or request label Feb 9, 2024
@corneliusclaussen
Copy link
Contributor Author

Work has started in #514 and corresponding PRs in framework and utils. Each module now has a WatchdogSupervisor that can be used within the module to register watchdogs for individual threads. The plan is to extend this further, so that the full chain down to the hardware watchdog is covered:

Module registers a watchdog with its supervisor. The supervisor checks in its own thread if the target thread is still alive. The supervisor thread itself sends MQTT pings to the manager. The manager ensures that the supervisor threads of all modules are running and MQTT communication still works. The manager sends (optionally) watchdog pings to systemd (needs to be setup in the service file), and systemd itself uses the hardware watchdog device to ensure it is still alive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant