re-enqueue pending workflows during recovery #739

maxdml · 2025-01-29T19:21:09Z

When performing recovery, we now re-enqueue workflows that came from a queue. This allows tasks from a queue to respect the concurrency limits.
Re-enqueue = reset the start time and executor assignment in the queue table. This ensures the task is re-inserted in the same position in the queue.

src/dbos-executor.ts

maxdml · 2025-01-30T06:09:12Z

src/wfqueue.ts

        logger.info("Workflow queues:");
        for (const [qn, q] of this.wfQueuesByName) {
-            const conc = q.concurrency !== undefined ? `${q.concurrency}` : 'No concurrency limit set';
+            const conc = q.concurrency !== undefined ? `global concurrency limit: ${q.concurrency}` : 'No concurrency limit set';
            logger.info(`    ${qn}: ${conc}`);
+            const workerconc = q.workerConcurrency !== undefined ? `worker concurrency limit: ${q.workerConcurrency}` : 'No worker concurrency limit set';
+            logger.info(`    ${qn}: ${workerconc}`);


@apoliakov btw I noticed these logs are not in order in the dashboard. I assume due to timestamps granularity. (And also concurrency because the workflow queue runner microTask goes to sleep periodically.)

src/system_database.ts

chuck-dbos · 2025-01-30T21:23:10Z

I am having a real problem understanding what this is supposed to accomplish, and that the test shows that it is accomplished.

OK, so it says it will change the database record contents and it does. (But why though?) This is a reverse state change (PENDING->ENQUEUED, where the prior invariant was that it always went the other way) and the consequences of that are not really tested.

maxdml · 2025-01-30T21:50:26Z

I am having a real problem understanding what this is supposed to accomplish, and that the test shows that it is accomplished.

OK, so it says it will change the database record contents and it does. (But why though?) This is a reverse state change (PENDING->ENQUEUED, where the prior invariant was that it always went the other way) and the consequences of that are not really tested.

What this is doing is ensuring that, upon explicit runs of the recovery logic, we don't immediately start executing workflows otherwise assigned to a queue. This effectively re-enqueue them at the exact place they were. The reason we are doing that is that the current logic violates concurrency limits for a queue (specifically, with a given worker being able to process queue tasks from two threads: the queue thread itself and the recovery thread.)

Are worried about the FIFO properties of the queue? At this point the task has already been dequeued and workers might have dequeued other tasks (within the limits of concurrency if any). So we have no way to enforce FIFO for recovered workflows, with the current queue semantics.

What this does is re-enqueue with the "highest" priority (i.e., leave the enqueue time unchanged such that the task will be dequeued first at the next iteration.) We could decide to re-enqueue with a new created_at value (place the recovered task at the end of the queue.)

What the test does is verify that the task is indeed cleared from its assignment by the recovery code, dequeued again and executed. We could make the test more complex and ensuring that concurrency is respected, but I trust this is already covered by the existing worker concurrency tests.

Are you suggesting I missed places of the code which rely on PENDING workflows never being ENQUEUED again?

maxdml · 2025-02-07T02:22:35Z

src/system_database.ts

@@ -325,7 +327,7 @@ export class PostgresSystemDatabase implements SystemDatabase {
    // Every time we init the status, we increment `recovery_attempts` by 1.
    // Thus, when this number becomes equal to `maxRetries + 1`, we should mark the workflow as `RETRIES_EXCEEDED`.
    const attempts = resRow.recovery_attempts;
-    if (attempts > initStatus.maxRetries) {
+    if (attempts > initStatus.maxRetries + 1) {


I implemented "max attempts", instead of "max recoveries" in the last TS PR.

maxdml commented Jan 29, 2025

View reviewed changes

src/dbos-executor.ts Outdated Show resolved Hide resolved

maxdml marked this pull request as ready for review January 29, 2025 23:12

maxdml commented Jan 30, 2025

View reviewed changes

src/system_database.ts Outdated Show resolved Hide resolved

qianl15 requested a review from chuck-dbos January 30, 2025 19:46

chuck-dbos reviewed Jan 30, 2025

View reviewed changes

src/system_database.ts Outdated Show resolved Hide resolved

maxdml marked this pull request as draft February 4, 2025 02:05

maxdml force-pushed the fix-queue-recovery branch from 77b11dc to 1229320 Compare February 5, 2025 18:51

maxdml added 15 commits February 6, 2025 13:58

re-enqueue pending workflows during recovery

c6a05ab

add debug logs and handle serialization failures

5354952

also set wf status to ENQUEUED when re-enqueuing

fa30da8

add a test

8557d81

revert some formatting stuff

feaab7e

add back missing new property

1858d90

remove unused argument

bd80ebc

add logging

d288c8e

getPendingWorkflow also returns queueName

27352c2

handle exceptions to rollback and close the client

938bb2e

better test

211a518

nits + lint

0e32e27

simplify

4e66a83

fix merge conflict

0b65884

fix merge conflict, prettier

0f21f36

maxdml force-pushed the fix-queue-recovery branch from 0dba40b to 0f21f36 Compare February 6, 2025 22:07

maxdml added 4 commits February 6, 2025 14:10

rename function

51f7d42

simplify + do not throw

0896bf6

simplify

282df71

fix max recovery check

d8d1f83

maxdml added 3 commits February 6, 2025 16:52

clear

1ae95c3

fix DLQ tests

0d54539

fix recovery test and improve new test

2ad16b1

maxdml commented Feb 7, 2025

View reviewed changes

maxdml marked this pull request as ready for review February 7, 2025 02:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

re-enqueue pending workflows during recovery #739

re-enqueue pending workflows during recovery #739

maxdml commented Jan 29, 2025 •

edited

Loading

maxdml Jan 30, 2025 •

edited

Loading

chuck-dbos commented Jan 30, 2025

maxdml commented Jan 30, 2025 •

edited

Loading

maxdml Feb 7, 2025

re-enqueue pending workflows during recovery #739

Are you sure you want to change the base?

re-enqueue pending workflows during recovery #739

Conversation

maxdml commented Jan 29, 2025 • edited Loading

maxdml Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

chuck-dbos commented Jan 30, 2025

maxdml commented Jan 30, 2025 • edited Loading

maxdml Feb 7, 2025

Choose a reason for hiding this comment

maxdml commented Jan 29, 2025 •

edited

Loading

maxdml Jan 30, 2025 •

edited

Loading

maxdml commented Jan 30, 2025 •

edited

Loading