You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
m_etcd.ClaimTTL is critical to ensuring a task is being executed exactly once in a Metafora cluster.
Currently the claim TTL is 120s and the claim is actually refreshed every 90s. If a refresh fails and the task is lost, it has at most 30s to exit before the exactly-once guarantee is lost and the task is eligible for simultaneous execution within the cluster.
Claim TTL and refresh interval should be configurable because:
They are critical to Metafora's correct operation.
Make TTL configurable on m_etcd.EtcdCoordinator instead of via a global.
Document refresh calculation from taskmgr.go and/or make it configurable
Future Improvements
The coordinator could inform the task handler when the claim will expire via Stop() or metadata on a statemachine.Message. This would allow a handler to detect that simultaneous execution may have occurred and choose to rollback a transaction, not flush data, avoid checkpointing, etc if possible.
Claim TTL / Refresh Interval may be more appropriate to define per task since a safe interval depends on how long a task handler takes to exit. Ideally a handler would checkpoint in intervals less than the Claim TTL, so if the claim expires it can simply skip its final checkpoint since another node may be executing that task.
The text was updated successfully, but these errors were encountered:
Description
m_etcd.ClaimTTL
is critical to ensuring a task is being executed exactly once in a Metafora cluster.Currently the claim TTL is 120s and the claim is actually refreshed every 90s. If a refresh fails and the task is lost, it has at most 30s to exit before the exactly-once guarantee is lost and the task is eligible for simultaneous execution within the cluster.
Claim TTL and refresh interval should be configurable because:
Solution: Configurable TTL, documented refresh calculation
m_etcd.EtcdCoordinator
instead of via a global.taskmgr.go
and/or make it configurableFuture Improvements
Stop()
or metadata on astatemachine.Message
. This would allow a handler to detect that simultaneous execution may have occurred and choose to rollback a transaction, not flush data, avoid checkpointing, etc if possible.The text was updated successfully, but these errors were encountered: