Workflow Locking and Recovery

nFlow uses a locking mechanism to ensure that workflow instances are processed exclusively by one executor at a time. This is achieved by setting the nflow_workflow.executor_id field at the database level, which prevents other executors from modifying the same workflow instance.

However, if an executor fails (e.g., due to a crash or network issues), the lock must be released to prevent workflow instances from becoming stuck indefinitely. nFlow provides a mechanism called workflow instance recovery to handle such scenarios.

Workflow Instance Recovery and State Persistence

When an executor in an executor group (defined by nflow_executor.executor_group) loses its heartbeat (nflow_executor.active), other executors in the group can recover the locked workflow instances (identified by nflow_workflow.executor_id).

Recovery Process

When recovery occurs:

The nflow_workflow.executor_id is cleared.
The workflow instance becomes eligible for re-execution. This is why it's important to ensure that state implementations are idempotent.
A workflow instance action is generated, which allows you to track recovery events.

Recovery Trigger

By default, recovery is triggered when an executor's heartbeat (nflow_executor.active) is delayed by more than 900 seconds (15 minutes). This interval can be configured using the nflow.executor.timeout.seconds property.

Recovery Check Frequency

Each executor checks for workflow instances eligible for recovery when it updates its heartbeat. By default, this happens every 60 seconds, and can be adjusted using the nflow.executor.keepalive.seconds property.

State Persistence After Recovery

Executors will attempt to persist the state after workflow execution indefinitely, unless the workflow instance has been recovered by another executor. In such cases, the state update from the original executor is ignored, and a warning is logged.

Example Scenario: Database Unavailability

Single Executor

If the database becomes unavailable, a single executor will continue trying to persist the state indefinitely once the database is available again.
If the executor is restarted during the outage, it is treated as a new executor (with a new ID). The workflow instance will then be recovered according to the configuration mentioned above.

Multiple Executors

If only one executor loses access to the database, other executors will recover its locked workflow instances according to the recovery process.
If all executors lose access to the database, a race condition could occur. There’s no guarantee that the pending state update will complete before recovery. However, it is guaranteed that the workflow instance state will be processed at least once (hence the importance of idempotent state implementations).