-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"inprogress" jobs aren't actually executing after a worker terminates abnormally #542
Comments
Hey @Ed1lan, sorry about that! Those jobs should be released automatically (marked as failed) when the supervisor starts next time. Is this happening in development only? Was your computer going to sleep or something when all those workers died? |
Hey @rosa, thank you for your fast reply! This happened on production environment and in that moment the server was getting a backup snapshot going on. Maybe the backup shutted down the workers? But after that they didn't got marked as failed as they supposed to |
Hmm no, that shouldn't be related 🤔 They could have crashed or something, but that shouldn't happen 🤔 I just noticed, in your first screenshot above, that the code that should have marked your in-progress jobs as failed did run, these are the lines that say:
and a list of job IDs. You should see similar lines for the jobs in your other screenshots, the ones in progress for which the process doesn't exist. That didn't happen? |
No, it didn't. Those jobs stayed in-progress and claimed as shown in the others screenshots. |
And they didn't get released when you restarted the supervisor? Releasing in-progress jobs happens automatically at start, and from that log line above, I know it's happening correctly in your case. |
It usually does release the jobs, but it didn't happen this time and those jobs kept claimed for two days until I manually released them, restarting solid_queue or the server itself didn't help |
Hi! I am working with SolidQueue in a project and recently I have encountered a problem, the mission_control-jobs gem shows as some jobs are running, but when checking the system processes they do not exist.
Investigating further what could be happening, I have discovered some things. I don't know why, but the workers shutted down and started automatically and tried to reclaim the jobs that were running at that moment, but this failed showing the following logs:
I have read the documentation and I have seen that it talks about the case in which “someone pulls the cable”, and checked that indeed, as mentioned, the jobs that were running at that time are in the SolidQueue::ClaimedExecution table and if I check the current status of each of those jobs marks as “inprogress”, but when checking the associated SolidQueue::Process, it does not exist.
I ran a little code to show up the status of the jobs claimed by processes that do not exists to show this
Trying to see how it behaved, I modified the “finished_at” field of one of the jobs, which made it to be marked internally as :finished, but it still appears in the list of inprogress jobs of mission_control and it is still in the SolidQueue::ClaimedExecution table.
Additionally, if I inspect that job is specific, it now comes up as if it is :finished.
How can I release these jobs correctly? How can I prevent this from happening? Am I missing something?
For the moment I planned to set manually those jobs as finished setting the finished_at date and removing them from SolidQueue::ClaimedExecution table, but I would like to know how could I prevent this from happening or if this is a bug
PD: Sorry if my english is poor
The text was updated successfully, but these errors were encountered: