-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retrying a task, that has been aborted #1131
Comments
No it's a fully fledged bug you've found, and it deserves being fixed as much as any other bug. |
So from my understanding, a minimal reproduction code would be: import procrastinate
app = procrastinate.ProcrastinateApp(connector=procrastinate.PsycopgConnector())
@app.task(pass_context=True, retry=True)
async def mytask(context):
await await app.job_manager.cancel_job_by_id_async(context.job.id)
0/0
async def main():
await mytask.defer_async()
await app.run_worker_async()
asyncio.run(main()) (note: haven't tested this) There are 2 things we should do:
|
I will provide a fix as soon as possible.
It should be enough to also allow
Why do you think so? This is in the same direction as #1108, where a user reported that "small errors" crashed the worker. If we say it's a bug, then a worker crash is ok, or? Of course, we could just log the error to the console and keep the worker running (but then such bugs could remain undetected). |
When I think about it ... the job should maybe not be retried then, but nevertheless aborted (the abortion request came earlier than the error). That way we wouldn't need an additional migration. Both ways seem acceptable to me. @ewjoachim What do you think? And when I even think more about it 😉 ... the migration is a better solution. The user may only want to handle the abortion in well-defined states by himself. So, I would respect this by just doing the retry then. |
I think it could go either way, but I think it might be possible that without a bug from procrastinate, the user would find themselves in a situation where the job couldn't be found anymore (or in the correct state), and while I would expect this to be noisy, crashing the worker for that would, itself, be the bug more than the normal consequence of what happened. Let's say the user was a bit too harsh in cleaning up stalled jobs, that were actually still running. Let's say the user deleted a job from the DB. There may be other examples. I guess the question is: if this happens, is it more likely that it's a user error or a bug ? |
For the other question: |
Maybe we could add an option to make the worker more fault-tolerant. Eventually, after the worker refactoring #1114 was finished,
I am torn about which is the better choice, but I am also okay with aborting such a task without retrying it. I also have also noticed another problem with my fix. The task won't be aborted at all after a retry. The retry will just reset it to Another question then might be how to handle a job of a task that raises, has an aborting status, but is not intended to be retried. Should this also have an aborted status at the end or only the failed ones that should be retried? Another alternative would be to set a task that raises and has an aborting status to |
From the user side of things, it feels right to set the job status as
|
(thanks a lot for sharing your voice) |
Thanks for such a quick reaction to the issue. And just in general for making this project - it's good to have a PSQL-based task queue alternative 😊 |
Is it too late to reconsider the "aborting" status? I wonder if things wouldn't be simpler by not having such a status. Also, maybe a good idea to have a state diagram of the state transitions. See https://github.com/timgit/pg-boss/blob/master/docs/readme.md#job-states for reference. |
I am sorry, but I think that ship has already sailed. If we were to get rid of that status now, then it would be a breaking change. I also still think it's not a wrong decision to have it as a status. I would think differently if it masked another state, but it doesn't. But yes, a state diagram would be nice. Would you create a new issue so that we remember that? |
Yes, it would be a breaking change and require a new major version. It is masking the A job marked as |
Okay, after some more thought, I think moving the aborting state to a separate database field could make sense. Also, since the user does not necessarily have to follow this request, it would still be preserved in a subsequent possible retry. As for the breaking change, one could also argue that the job states are internal API. |
I think it's worth breaking, but let's discuss it more in a dedicated issue. |
@medihack thank you for giving it consideration. This is much appreciated. |
Hi guys!
I've been doing some manual QA for a project I'm working on and found some issues with the newly added
abort
logic for task cancellation. By the way, a neat feature I was waiting for, glad it has been added!Here's the general context - I have a single type of task for the procrastinate, and I use a retry strategy over it to handle some expected exceptions. I also use a
context.should_abort_async
within the task to handle its cancellation. The issue appears with the following steps:doing
;aborting
;should_abort_async
check), causing the worker to retry this task due to a retry strategy being used;procrastinate_retry_job
SQL function will be used, which will raise aJob was not found or not in "doing" status
error;This will shut down the worker process and will leave a task in the
procrastinate_jobs
table with anaborting
status. If I just restart it, it won't also process newly added tasks for what it seems due to hanging anaborting
task.I understand that this might be a niche scenario, but I wonder if this behavior with the retry logic has been considered for a task abortion process. A quick and simple fix for this would be to mind the
aborting
status during job retry logic, but there is probably more to it.Thanks for your attention!
The text was updated successfully, but these errors were encountered: