-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not fail job on error in GlobalJobPreLoad #15
Do not fail job on error in GlobalJobPreLoad #15
Conversation
… server connection timeout. Fix ynput#14
NOTE: This should be tested on linux mashines too. |
@fabiaserra - any chance you could roll this out on Linux machines and test? |
Yeah, this is the exact same change I did on my fork when we started discussing it but I haven't been able yet to do a test to confirm it's working, I will try soon |
BTW bump version of deadline plugin pls |
By the way - there may be cases where failing the job still makes sense. E.g. the: if ayon_publish_job == "1" and ayon_render_job == "1":
raise RuntimeError(
"Misconfiguration. Job couldn't be both render and publish."
) Or if the if not exe_list:
raise RuntimeError(
"Path to AYON executable not configured."
"Please set it in Ayon Deadline Plugin."
) Will always fail - since it's set on the job or deadlineplugin and will be the same result for all machines. So it may make sense to fail the job then? Maybe this: if not all(add_kwargs.values()):
raise RuntimeError((
"Missing required env vars: AYON_PROJECT_NAME,"
" AYON_FOLDER_PATH, AYON_TASK_NAME, AYON_APP_NAME"
)) May also make sense to always fail since it should behave quite similar across the workers/machines? There are also cases where it may make sense to directly mark the Worker as bad for the job. For example this: exe_list = get_ayon_executable()
exe = FileUtils.SearchFileList(exe_list)
if not exe:
raise RuntimeError((
"Ayon executable was not found in the semicolon "
"separated list \"{}\"."
"The path to the render executable can be configured"
" from the Plugin Configuration in the Deadline Monitor."
).format(exe_list)) This may fail per worker depending on whether it has the exe to be found at any of the paths. There is a high likelihood that that machine may not find it the next run either? As such - we may do a follow-up PR or issue to be more specific with our raised errors. For example raising a dedicated error for when we should fail the job. class AYONJobConfigurationError(RuntimeError):
"""An error of which we know when raised that the full job should fail
and retrying by other machines will be worthless.
This may be the case if e.g. not the fully required env vars are configured
to inject the AYON environment.
""" Or a dedicated error when we should mark the Worker as bad: class AYONWorkerBadForJobError(RuntimeError):
"""When raised, the worker will be marked bad for the current job.
This should be raised when we know that the machine will most likely
also fail on subsequent tries.
""" However - a server timeout should allow the job to just error and let it requeue with the same worker.. so it can try again. |
I can confirm this works in Linux |
Works for me on Windows - @kalisp @iLLiCiTiT ready for merge when you've decided next steps. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I am wondering if we should handle exceptions more with more granularity, because for example fo code reading json might fail because either the file doesn't exist on that particular machine (but if that wasn't cought by process handling above, we have a bug somewhere). This is putting responsibility on Deadline (which is fine), I am merely pointing out to that too broad Exception.
A client also ended up reporting this worked for them and actually helped them - so, on to merging! :) |
…ource_maya_cache_interface_from_repo into develop Reviewed-on: http://bepic-docker01:3090/ayon/ayon-deadline/pulls/15
Changelog Description
Do not force the full job to fail - it may be just this machine has a server connection timeout.
This allows a single machine to fail - yet others to continue in the queue.
This also allows a single machine to try again - since it'll use Deadline's default "mark bad after X errors" setting to decide when a worker is 'bad'.
Additional info
Fix #14
Testing notes: