Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrying a task, that has been aborted #1131

Closed
majorov-daniil opened this issue Jul 29, 2024 · 16 comments · Fixed by #1133
Closed

Retrying a task, that has been aborted #1131

majorov-daniil opened this issue Jul 29, 2024 · 16 comments · Fixed by #1133
Labels
Issue appropriate for: Occasional contributors 😉 This issue will be best tackled by people with a minimum of experience on the project

Comments

@majorov-daniil
Copy link

Hi guys!

I've been doing some manual QA for a project I'm working on and found some issues with the newly added abort logic for task cancellation. By the way, a neat feature I was waiting for, glad it has been added!

Here's the general context - I have a single type of task for the procrastinate, and I use a retry strategy over it to handle some expected exceptions. I also use a context.should_abort_async within the task to handle its cancellation. The issue appears with the following steps:

  • a worker has started to process a task, marking its state as doing;
  • in the middle of it, the task has been aborted from a different process, which changes the task state to aborting;
  • after that, some expected error happens in the task itself (before the should_abort_async check), causing the worker to retry this task due to a retry strategy being used;
  • during the retry procrastinate_retry_job SQL function will be used, which will raise a Job was not found or not in "doing" status error;

This will shut down the worker process and will leave a task in the procrastinate_jobs table with an aborting status. If I just restart it, it won't also process newly added tasks for what it seems due to hanging an aborting task.

I understand that this might be a niche scenario, but I wonder if this behavior with the retry logic has been considered for a task abortion process. A quick and simple fix for this would be to mind the aborting status during job retry logic, but there is probably more to it.

Thanks for your attention!

@ewjoachim
Copy link
Member

I understand that this might be a niche scenario

No it's a fully fledged bug you've found, and it deserves being fixed as much as any other bug.

@ewjoachim ewjoachim added Issue type: Bug 🐞 Issue appropriate for: Occasional contributors 😉 This issue will be best tackled by people with a minimum of experience on the project labels Jul 29, 2024
@ewjoachim
Copy link
Member

ewjoachim commented Jul 29, 2024

So from my understanding, a minimal reproduction code would be:

import procrastinate

app = procrastinate.ProcrastinateApp(connector=procrastinate.PsycopgConnector())

@app.task(pass_context=True, retry=True)
async def mytask(context):
    await await app.job_manager.cancel_job_by_id_async(context.job.id)

    0/0

async def main():
    await mytask.defer_async()
    await app.run_worker_async()

asyncio.run(main())

(note: haven't tested this)

There are 2 things we should do:

  • This case should be expected and we should react appropriately
  • And it's probably not expected that the error that triggers Job was not found or not in "doing" status crashes the worker
  • (add the above example as a test)

@medihack
Copy link
Member

I will provide a fix as soon as possible.

  • This case should be expected and we should react appropriately

It should be enough to also allow aborting in the procrastinate_retry_job function. This will need another migration. Is there anything else you'd be able to think of?

  • And it's probably not expected that the error that triggers Job was not found or not in "doing" status crashes the worker
  • (add the above example as a test)

Why do you think so? This is in the same direction as #1108, where a user reported that "small errors" crashed the worker. If we say it's a bug, then a worker crash is ok, or? Of course, we could just log the error to the console and keep the worker running (but then such bugs could remain undetected).

@medihack
Copy link
Member

medihack commented Jul 29, 2024

When I think about it ... the job should maybe not be retried then, but nevertheless aborted (the abortion request came earlier than the error). That way we wouldn't need an additional migration. Both ways seem acceptable to me. @ewjoachim What do you think?

And when I even think more about it 😉 ... the migration is a better solution. The user may only want to handle the abortion in well-defined states by himself. So, I would respect this by just doing the retry then.

@ewjoachim
Copy link
Member

Why do you think so? This is in the same direction as #1108, where a user reported that "small errors" crashed the worker. If we say it's a bug, then a worker crash is ok, or? Of course, we could just log the error to the console and keep the worker running (but then such bugs could remain undetected).

I think it could go either way, but I think it might be possible that without a bug from procrastinate, the user would find themselves in a situation where the job couldn't be found anymore (or in the correct state), and while I would expect this to be noisy, crashing the worker for that would, itself, be the bug more than the normal consequence of what happened.

Let's say the user was a bit too harsh in cleaning up stalled jobs, that were actually still running. Let's say the user deleted a job from the DB. There may be other examples.
Even if those are not good things, I wonder if crashing the server is the 1/ right 2/ expected thing to do.

I guess the question is: if this happens, is it more likely that it's a user error or a bug ?
That said: both here and in the other one, it seemed to be more bugs so maybe you're right...

@ewjoachim
Copy link
Member

For the other question:
I'm a bit hesitant, I think the cancellation request bears some weight. The user has request that the job doesn't go further. It feels a bit weird that we would reschedule a job that was requested to be cancelled. It means the job will need to wait in the queue, potentially for a long time, just to finally start and hopefully stop immediately.
Also, unless I'm mistaken, if the cancellation had happened while the job had already failed and was waiting for a retry rather than while it was still running, it would be cancelled immediately.

@medihack
Copy link
Member

medihack commented Jul 29, 2024

I guess the question is: if this happens, is it more likely that it's a user error or a bug ? That said: both here and in the other one, it seemed to be more bugs so maybe you're right...

Maybe we could add an option to make the worker more fault-tolerant. Eventually, after the worker refactoring #1114 was finished,

For the other question: I'm a bit hesitant, I think the cancellation request bears some weight. The user has request that the job doesn't go further. It feels a bit weird that we would reschedule a job that was requested to be cancelled. It means the job will need to wait in the queue, potentially for a long time, just to finally start and hopefully stop immediately. Also, unless I'm mistaken, if the cancellation had happened while the job had already failed and was waiting for a retry rather than while it was still running, it would be cancelled immediately.

I am torn about which is the better choice, but I am also okay with aborting such a task without retrying it. I also have also noticed another problem with my fix. The task won't be aborted at all after a retry. The retry will just reset it to todo. I will create a new PR with an alternative implementation that aborts the job.

Another question then might be how to handle a job of a task that raises, has an aborting status, but is not intended to be retried. Should this also have an aborted status at the end or only the failed ones that should be retried?

Another alternative would be to set a task that raises and has an aborting status to failed, regardless of if it should be retried or not. We could add this to the documentation. Looks the most consistent to me.

@majorov-daniil
Copy link
Author

From the user side of things, it feels right to set the job status as failed:

  • An error occurred during abortion, so the status reflects that;
  • Retry shouldn't mind errors during abortion (though this is debatable I guess);
  • Ultimately, if a job has been marked for abortion I would want to finish its processing. So in this logic, if an error occurs and a job ends up failed rather than aborted, it feels expected in case of errors;

@ewjoachim
Copy link
Member

(thanks a lot for sharing your voice)

@majorov-daniil
Copy link
Author

(thanks a lot for sharing your voice)

Thanks for such a quick reaction to the issue. And just in general for making this project - it's good to have a PSQL-based task queue alternative 😊

@onlyann
Copy link
Contributor

onlyann commented Jul 30, 2024

Is it too late to reconsider the "aborting" status?

I wonder if things wouldn't be simpler by not having such a status.
Yes, there would be a need to store somewhere that a task has been requested to be aborted but is a status an appropriate way?

Also, maybe a good idea to have a state diagram of the state transitions.

See https://github.com/timgit/pg-boss/blob/master/docs/readme.md#job-states for reference.

@medihack
Copy link
Member

I am sorry, but I think that ship has already sailed. If we were to get rid of that status now, then it would be a breaking change. I also still think it's not a wrong decision to have it as a status. I would think differently if it masked another state, but it doesn't. But yes, a state diagram would be nice. Would you create a new issue so that we remember that?

@onlyann
Copy link
Contributor

onlyann commented Jul 30, 2024

Yes, it would be a breaking change and require a new major version.

It is masking the doing state.
State transitions from aborting are the same than doing.

A job marked as aborting is not aborting until the worker reacts to the change.

@medihack
Copy link
Member

medihack commented Jul 30, 2024

Okay, after some more thought, I think moving the aborting state to a separate database field could make sense. Also, since the user does not necessarily have to follow this request, it would still be preserved in a subsequent possible retry. As for the breaking change, one could also argue that the job states are internal API.
The disadvantage would be that such a change would result in a fairly large migration.
@ewjoachim What do you think? Is this a breaking change or not?
@onlyann Also we should move this to a new issue (I will close this one with a patch that fixes the reported error).

@ewjoachim
Copy link
Member

I think it's worth breaking, but let's discuss it more in a dedicated issue.

@ewjoachim ewjoachim mentioned this issue Jul 30, 2024
@onlyann
Copy link
Contributor

onlyann commented Jul 30, 2024

@medihack thank you for giving it consideration. This is much appreciated.
+1 on making it a separate issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue appropriate for: Occasional contributors 😉 This issue will be best tackled by people with a minimum of experience on the project
Projects
None yet
4 participants