Retrying a task, that has been aborted #1131

majorov-daniil · 2024-07-29T13:43:04Z

Hi guys!

I've been doing some manual QA for a project I'm working on and found some issues with the newly added abort logic for task cancellation. By the way, a neat feature I was waiting for, glad it has been added!

Here's the general context - I have a single type of task for the procrastinate, and I use a retry strategy over it to handle some expected exceptions. I also use a context.should_abort_async within the task to handle its cancellation. The issue appears with the following steps:

a worker has started to process a task, marking its state as doing;
in the middle of it, the task has been aborted from a different process, which changes the task state to aborting;
after that, some expected error happens in the task itself (before the should_abort_async check), causing the worker to retry this task due to a retry strategy being used;
during the retry procrastinate_retry_job SQL function will be used, which will raise a Job was not found or not in "doing" status error;

This will shut down the worker process and will leave a task in the procrastinate_jobs table with an aborting status. If I just restart it, it won't also process newly added tasks for what it seems due to hanging an aborting task.

I understand that this might be a niche scenario, but I wonder if this behavior with the retry logic has been considered for a task abortion process. A quick and simple fix for this would be to mind the aborting status during job retry logic, but there is probably more to it.

Thanks for your attention!

The text was updated successfully, but these errors were encountered:

ewjoachim · 2024-07-29T16:04:25Z

I understand that this might be a niche scenario

No it's a fully fledged bug you've found, and it deserves being fixed as much as any other bug.

ewjoachim · 2024-07-29T16:10:05Z

So from my understanding, a minimal reproduction code would be:

import procrastinate

app = procrastinate.ProcrastinateApp(connector=procrastinate.PsycopgConnector())

@app.task(pass_context=True, retry=True)
async def mytask(context):
    await await app.job_manager.cancel_job_by_id_async(context.job.id)

    0/0

async def main():
    await mytask.defer_async()
    await app.run_worker_async()

asyncio.run(main())

(note: haven't tested this)

There are 2 things we should do:

This case should be expected and we should react appropriately
And it's probably not expected that the error that triggers Job was not found or not in "doing" status crashes the worker
(add the above example as a test)

medihack · 2024-07-29T16:53:43Z

I will provide a fix as soon as possible.

This case should be expected and we should react appropriately

It should be enough to also allow aborting in the procrastinate_retry_job function. This will need another migration. Is there anything else you'd be able to think of?

And it's probably not expected that the error that triggers Job was not found or not in "doing" status crashes the worker

(add the above example as a test)

Why do you think so? This is in the same direction as #1108, where a user reported that "small errors" crashed the worker. If we say it's a bug, then a worker crash is ok, or? Of course, we could just log the error to the console and keep the worker running (but then such bugs could remain undetected).

medihack · 2024-07-29T17:34:41Z

When I think about it ... the job should maybe not be retried then, but nevertheless aborted (the abortion request came earlier than the error). That way we wouldn't need an additional migration. Both ways seem acceptable to me. @ewjoachim What do you think?

And when I even think more about it 😉 ... the migration is a better solution. The user may only want to handle the abortion in well-defined states by himself. So, I would respect this by just doing the retry then.

ewjoachim · 2024-07-29T21:24:30Z

Why do you think so? This is in the same direction as #1108, where a user reported that "small errors" crashed the worker. If we say it's a bug, then a worker crash is ok, or? Of course, we could just log the error to the console and keep the worker running (but then such bugs could remain undetected).

I think it could go either way, but I think it might be possible that without a bug from procrastinate, the user would find themselves in a situation where the job couldn't be found anymore (or in the correct state), and while I would expect this to be noisy, crashing the worker for that would, itself, be the bug more than the normal consequence of what happened.

Let's say the user was a bit too harsh in cleaning up stalled jobs, that were actually still running. Let's say the user deleted a job from the DB. There may be other examples.
Even if those are not good things, I wonder if crashing the server is the 1/ right 2/ expected thing to do.

I guess the question is: if this happens, is it more likely that it's a user error or a bug ?
That said: both here and in the other one, it seemed to be more bugs so maybe you're right...

ewjoachim · 2024-07-29T21:35:13Z

For the other question:
I'm a bit hesitant, I think the cancellation request bears some weight. The user has request that the job doesn't go further. It feels a bit weird that we would reschedule a job that was requested to be cancelled. It means the job will need to wait in the queue, potentially for a long time, just to finally start and hopefully stop immediately.
Also, unless I'm mistaken, if the cancellation had happened while the job had already failed and was waiting for a retry rather than while it was still running, it would be cancelled immediately.

medihack · 2024-07-29T22:06:35Z

I guess the question is: if this happens, is it more likely that it's a user error or a bug ? That said: both here and in the other one, it seemed to be more bugs so maybe you're right...

Maybe we could add an option to make the worker more fault-tolerant. Eventually, after the worker refactoring #1114 was finished,

For the other question: I'm a bit hesitant, I think the cancellation request bears some weight. The user has request that the job doesn't go further. It feels a bit weird that we would reschedule a job that was requested to be cancelled. It means the job will need to wait in the queue, potentially for a long time, just to finally start and hopefully stop immediately. Also, unless I'm mistaken, if the cancellation had happened while the job had already failed and was waiting for a retry rather than while it was still running, it would be cancelled immediately.

I am torn about which is the better choice, but I am also okay with aborting such a task without retrying it. I also have also noticed another problem with my fix. The task won't be aborted at all after a retry. The retry will just reset it to todo. I will create a new PR with an alternative implementation that aborts the job.

Another question then might be how to handle a job of a task that raises, has an aborting status, but is not intended to be retried. Should this also have an aborted status at the end or only the failed ones that should be retried?

Another alternative would be to set a task that raises and has an aborting status to failed, regardless of if it should be retried or not. We could add this to the documentation. Looks the most consistent to me.

majorov-daniil · 2024-07-30T07:39:41Z

From the user side of things, it feels right to set the job status as failed:

An error occurred during abortion, so the status reflects that;
Retry shouldn't mind errors during abortion (though this is debatable I guess);
Ultimately, if a job has been marked for abortion I would want to finish its processing. So in this logic, if an error occurs and a job ends up failed rather than aborted, it feels expected in case of errors;

ewjoachim · 2024-07-30T09:54:04Z

(thanks a lot for sharing your voice)

majorov-daniil · 2024-07-30T10:19:20Z

(thanks a lot for sharing your voice)

Thanks for such a quick reaction to the issue. And just in general for making this project - it's good to have a PSQL-based task queue alternative 😊

onlyann · 2024-07-30T12:38:26Z

Is it too late to reconsider the "aborting" status?

I wonder if things wouldn't be simpler by not having such a status.
Yes, there would be a need to store somewhere that a task has been requested to be aborted but is a status an appropriate way?

Also, maybe a good idea to have a state diagram of the state transitions.

See https://github.com/timgit/pg-boss/blob/master/docs/readme.md#job-states for reference.

medihack · 2024-07-30T18:22:40Z

I am sorry, but I think that ship has already sailed. If we were to get rid of that status now, then it would be a breaking change. I also still think it's not a wrong decision to have it as a status. I would think differently if it masked another state, but it doesn't. But yes, a state diagram would be nice. Would you create a new issue so that we remember that?

onlyann · 2024-07-30T21:37:13Z

Yes, it would be a breaking change and require a new major version.

It is masking the doing state.
State transitions from aborting are the same than doing.

A job marked as aborting is not aborting until the worker reacts to the change.

medihack · 2024-07-30T21:49:11Z

Okay, after some more thought, I think moving the aborting state to a separate database field could make sense. Also, since the user does not necessarily have to follow this request, it would still be preserved in a subsequent possible retry. As for the breaking change, one could also argue that the job states are internal API.
The disadvantage would be that such a change would result in a fairly large migration.
@ewjoachim What do you think? Is this a breaking change or not?
@onlyann Also we should move this to a new issue (I will close this one with a patch that fixes the reported error).

ewjoachim · 2024-07-30T22:13:03Z

I think it's worth breaking, but let's discuss it more in a dedicated issue.

onlyann · 2024-07-30T22:26:52Z

@medihack thank you for giving it consideration. This is much appreciated.
+1 on making it a separate issue.

ewjoachim added Issue type: Bug 🐞 Issue appropriate for: Occasional contributors 😉 This issue will be best tackled by people with a minimum of experience on the project labels Jul 29, 2024

medihack mentioned this issue Jul 29, 2024

Fix retry when status of job is aborting #1132

Closed

10 tasks

medihack closed this as completed Jul 29, 2024

medihack reopened this Jul 29, 2024

medihack mentioned this issue Jul 29, 2024

Set job to failed when task raises and job is aborting (regardless retry) #1133

Merged

10 tasks

ewjoachim mentioned this issue Jul 30, 2024

3.x ? #1135

Closed

medihack mentioned this issue Jul 31, 2024

Reconsider the "aborting" status #1136

Closed

medihack closed this as completed in #1133 Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrying a task, that has been aborted #1131

Retrying a task, that has been aborted #1131

majorov-daniil commented Jul 29, 2024

ewjoachim commented Jul 29, 2024

ewjoachim commented Jul 29, 2024 •

edited

Loading

medihack commented Jul 29, 2024

medihack commented Jul 29, 2024 •

edited

Loading

ewjoachim commented Jul 29, 2024

ewjoachim commented Jul 29, 2024

medihack commented Jul 29, 2024 •

edited

Loading

majorov-daniil commented Jul 30, 2024

ewjoachim commented Jul 30, 2024

majorov-daniil commented Jul 30, 2024

onlyann commented Jul 30, 2024 •

edited

Loading

medihack commented Jul 30, 2024

onlyann commented Jul 30, 2024 •

edited

Loading

medihack commented Jul 30, 2024 •

edited

Loading

ewjoachim commented Jul 30, 2024

onlyann commented Jul 30, 2024

Retrying a task, that has been aborted #1131

Retrying a task, that has been aborted #1131

Comments

majorov-daniil commented Jul 29, 2024

ewjoachim commented Jul 29, 2024

ewjoachim commented Jul 29, 2024 • edited Loading

medihack commented Jul 29, 2024

medihack commented Jul 29, 2024 • edited Loading

ewjoachim commented Jul 29, 2024

ewjoachim commented Jul 29, 2024

medihack commented Jul 29, 2024 • edited Loading

majorov-daniil commented Jul 30, 2024

ewjoachim commented Jul 30, 2024

majorov-daniil commented Jul 30, 2024

onlyann commented Jul 30, 2024 • edited Loading

medihack commented Jul 30, 2024

onlyann commented Jul 30, 2024 • edited Loading

medihack commented Jul 30, 2024 • edited Loading

ewjoachim commented Jul 30, 2024

onlyann commented Jul 30, 2024

ewjoachim commented Jul 29, 2024 •

edited

Loading

medihack commented Jul 29, 2024 •

edited

Loading

medihack commented Jul 29, 2024 •

edited

Loading

onlyann commented Jul 30, 2024 •

edited

Loading

onlyann commented Jul 30, 2024 •

edited

Loading

medihack commented Jul 30, 2024 •

edited

Loading