Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail GlobalJobPreLoad harder when we know for sure repeating the task won't work #20

Open
2 tasks done
BigRoy opened this issue Jul 29, 2024 · 0 comments
Open
2 tasks done
Assignees

Comments

@BigRoy
Copy link
Contributor

BigRoy commented Jul 29, 2024

Is there an existing issue for this?

  • I have searched the existing issues.

Please describe the feature you have in mind and explain what the current shortcomings are?

Since #15 the full job won't fail if the GlobalJobPreLoad has an error during the AYON environment injection.

However, there may very well be cases where we can now that restarting won't be feasible as reported here

For example:

    if ayon_publish_job == "1" and ayon_render_job == "1":
        raise RuntimeError(
            "Misconfiguration. Job couldn't be both render and publish."
        )

Or if the AyonExecutable is not configured at all.

    if not exe_list:
        raise RuntimeError(
            "Path to AYON executable not configured."
            "Please set it in Ayon Deadline Plugin."
        )

Will always fail - since it's set on the job or deadlineplugin and will be the same result for all machines. So it may make sense to fail the job then?

Maybe this:

        if not all(add_kwargs.values()):
            raise RuntimeError((
                "Missing required env vars: AYON_PROJECT_NAME,"
                " AYON_FOLDER_PATH, AYON_TASK_NAME, AYON_APP_NAME"
            ))

May also make sense to always fail since it should behave quite similar across the workers/machines?


There are also cases where it may make sense to directly mark the Worker as bad for the job.

For example this:

        exe_list = get_ayon_executable()
        exe = FileUtils.SearchFileList(exe_list)

        if not exe:
            raise RuntimeError((
               "Ayon executable was not found in the semicolon "
               "separated list \"{}\"."
               "The path to the render executable can be configured"
               " from the Plugin Configuration in the Deadline Monitor."
            ).format(exe_list))

This may fail per worker depending on whether it has the exe to be found at any of the paths.

There is a high likelihood that that machine may not find it the next run either?
So we could mark the worker "bad" for the job? Using RepositoryUtils.AddBadSlaveForJob...

How would you imagine the implementation of the feature?

For example raising a dedicated error for when we should fail the job.

class AYONJobConfigurationError(RuntimeError):
    """An error of which we know when raised that the full job should fail
    and retrying by other machines will be worthless.

    This may be the case if e.g. not the fully required env vars are configured
    to inject the AYON environment.
    """

Or a dedicated error when we should mark the Worker as bad:

class AYONWorkerBadForJobError(RuntimeError):
    """When raised, the worker will be marked bad for the current job.

    This should be raised when we know that the machine will most likely
    also fail on subsequent tries.
    """

However - a server timeout should allow the job to just error and let it requeue with the same worker.. so it can try again.
So a lot of error attributed to not being able to access the server itself should not generate such a hard failure.

Are there any labels you wish to add?

  • I have added the relevant labels to the enhancement request.

Describe alternatives you've considered:

Just leave it completely up to the Deadline Settings for 'monitoring failures' instead of forcing a behavior onto it. - Yet at the same time, we do want to avoid many machines trying many times if we know early on all would fail regardless.

Additional context:

No response

@BigRoy BigRoy added the type: enhancement Improvement of existing functionality or minor addition label Jul 29, 2024
@dee-ynput dee-ynput removed the type: enhancement Improvement of existing functionality or minor addition label Nov 1, 2024
tweak-wtf pushed a commit to tweak-wtf/ayon-deadline that referenced this issue Dec 13, 2024
…pr_structure into develop

Reviewed-on: http://bepic-docker01:3090/ayon/ayon-deadline/pulls/20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants