-
Notifications
You must be signed in to change notification settings - Fork 39
Workflow Locking and Recovery
nFlow uses a locking mechanism to ensure that workflow instances are processed exclusively by one executor at a time. This is achieved by setting the nflow_workflow.executor_id
field at the database level, which prevents other executors from modifying the same workflow instance.
However, if an executor fails (e.g., due to a crash or network issues), the lock must be released to prevent workflow instances from becoming stuck indefinitely. nFlow provides a mechanism called workflow instance recovery to handle such scenarios.
When an executor in an executor group (defined by nflow_executor.executor_group
) loses its heartbeat (nflow_executor.active
), other executors in the group can recover the locked workflow instances (identified by nflow_workflow.executor_id
).
When recovery occurs:
- The
nflow_workflow.executor_id
is cleared. - The workflow instance becomes eligible for re-execution. This is why it's important to ensure that state implementations are idempotent.
- A workflow instance action is generated, which allows you to track recovery events.
By default, recovery is triggered when an executor's heartbeat (nflow_executor.active
) is delayed by more than 900 seconds (15 minutes). This interval can be configured using the nflow.executor.timeout.seconds
property.
Each executor checks for workflow instances eligible for recovery when it updates its heartbeat. By default, this happens every 60 seconds, and can be adjusted using the nflow.executor.keepalive.seconds
property.
Executors will attempt to persist the state after workflow execution indefinitely, unless the workflow instance has been recovered by another executor. In such cases, the state update from the original executor is ignored, and a warning is logged.
- If the database becomes unavailable, a single executor will continue trying to persist the state indefinitely once the database is available again.
- If the executor is restarted during the outage, it is treated as a new executor (with a new ID). The workflow instance will then be recovered according to the configuration mentioned above.
- If only one executor loses access to the database, other executors will recover its locked workflow instances according to the recovery process.
- If all executors lose access to the database, a race condition could occur. There’s no guarantee that the pending state update will complete before recovery. However, it is guaranteed that the workflow instance state will be processed at least once (hence the importance of idempotent state implementations).