chore: Add retry to pipeline templates constructors to add retrier to each pipeline step #179

ca-nguyen · 2021-11-05T02:59:26Z

Description

Fix build failures due to Sagemaker ThrottlingException when running pipeline integration tests

Fixes #(issue) - N/A

Why is the change necessary?

Recent build failures were due to Sagemaker ThrottlingException (Rate exceeded) during following tests:

Solution

Add an optional retry argument to the pipeline template constructors (InferencePipeline and TrainingPipeline) in order to add a retry strategy for each pipeline steps. The same retrier will be added for each step.

Caveat: This fix applies the retry strategy to all steps in the pipeline. The customer won't be able to customize the strategy for each step.

Alternate solution 1:

We could add the option for the client to customize retry strategies for each pipeline step by accepting a dict, in addition to accepting Retry object.

Caveat: The retry strategy dict keys must correspond exactly to the step variable names - A validation step could be added to warn the customer of any unrecognized keys.

For example:

retry_strategy_per_step = {
   'training_step': <training_retry_strategy>,
   'model_step': <model_retry_strategy>,
   'endpoint_config_step': <endpoint_config_retry_strategy>,
   'deploy_step': <deploy_retry_strategy>
}

If a dict is received, only add retriers to steps with defined strategies in that dict.

Alternate solution 2:

Only add retries to integration tests by updating the pipeline workflow with the added retries

# Once pipeline is created do something like:
sagemaker_retry_strategy = Retry(
    error_equals=["SageMaker.AmazonSageMakerException"],
    interval_seconds=5,
    max_attempts=5,
    backoff_rate=2
)

steps = pipeline.workflow.definition.branch.steps
for step in steps:
    step.add_retry(sagemaker_retry_strategy)

pipeline.workflow.update(Chain(steps))

Caveat: If the fix is only applied to the integration tests, customers who want to add retry strategies to the pipeline steps will have to do this each time they are creating a pipeline

Testing

Updated integ test and added unit test
Generated doc locally

Pull Request Checklist

Please check all boxes (including N/A items)

Testing

Unit tests added
Integration test added
Manual testing - why was it necessary? could it be automated? - N/A

Documentation

docs: All relevant docs updated
docstrings: All public APIs documented

Title and description

Change type: Title is prefixed with change type: and follows conventional commits
References: Indicate issues fixed via: Fixes #xxx - N/A

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license.

…e step

wong-a · 2021-11-12T19:21:22Z

This is a feature, not a chore. Adding new functionality should be motivated from the customer's POV, not just to fix the tests. That being said, I see some value here. Ideally, we should've had retries added by default. But it's still possible to add retriers by updating the Chain directly, right? Do we have any open issues related to this?

It would be nice to add a preconfigured retry strategy like you defined in the tests. It's not uncommon for SDKs to have default and reusable retry strategies. Customers using the pipeline classes probably don't want to deal with much of the lower level ASL constructs.

wong-a · 2021-11-12T19:16:16Z

src/stepfunctions/template/pipeline/inference.py

@@ -54,6 +54,7 @@ def __init__(self, preprocessor, estimator, inputs, s3_bucket, role, client=None
                * (list[`sagemaker.amazon.amazon_estimator.RecordSet`]) - A list of `sagemaker.amazon.amazon_estimator.RecordSet` objects, where each instance is a different channel of training data.
            s3_bucket (str): S3 bucket under which the output artifacts from the training job will be stored. The parent path used is built using the format: ``s3://{s3_bucket}/{pipeline_name}/models/{job_name}/``. In this format, `pipeline_name` refers to the keyword argument provided for TrainingPipeline. If a `pipeline_name` argument was not provided, one is auto-generated by the pipeline as `training-pipeline-<timestamp>`. Also, in the format, `job_name` refers to the job name provided when calling the :meth:`TrainingPipeline.run()` method.
            client (SFN.Client, optional): boto3 client to use for creating and interacting with the inference pipeline in Step Functions. (default: None)
+            retry (Retry): A retrier that defines the each pipeline step's retry policy. See `Error handling in Step Functions <https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html#error-handling-retrying-after-an-error>`_ for more details. (default: None)


Suggested change

retry (Retry): A retrier that defines the each pipeline step's retry policy. See `Error handling in Step Functions <https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html#error-handling-retrying-after-an-error>`_ for more details. (default: None)

retry (Retry): A retrier that defines the retry policy for each step in the pipeline. See `Error handling in Step Functions <https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html#error-handling-retrying-after-an-error>`_ for more details. (default: None)

Any reason to not make this a list for multiple retriers?

StepFunctions-Bot · 2021-11-12T20:44:16Z

AWS CodeBuild CI Report

CodeBuild project: AutoBuildProject6AEA49D1-sEHrOdk7acJc
Commit ID: aea996c
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Add retry arg to pipeline constructors to add retrier to each pipelin…

7bff6e6

…e step

ca-nguyen marked this pull request as ready for review November 5, 2021 06:46

ca-nguyen requested a review from shivlaks November 5, 2021 16:51

Merge branch 'main' into add-retry-to-pipeline-test

aea996c

wong-a reviewed Nov 12, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Add retry to pipeline templates constructors to add retrier to each pipeline step #179

chore: Add retry to pipeline templates constructors to add retrier to each pipeline step #179

ca-nguyen commented Nov 5, 2021 •

edited

Loading

wong-a commented Nov 12, 2021

wong-a Nov 12, 2021

StepFunctions-Bot commented Nov 12, 2021

	retry (Retry): A retrier that defines the each pipeline step's retry policy. See `Error handling in Step Functions <https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html#error-handling-retrying-after-an-error>`_ for more details. (default: None)
	retry (Retry): A retrier that defines the retry policy for each step in the pipeline. See `Error handling in Step Functions <https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html#error-handling-retrying-after-an-error>`_ for more details. (default: None)

chore: Add retry to pipeline templates constructors to add retrier to each pipeline step #179

Are you sure you want to change the base?

chore: Add retry to pipeline templates constructors to add retrier to each pipeline step #179

Conversation

ca-nguyen commented Nov 5, 2021 • edited Loading

Description

Why is the change necessary?

Solution

Alternate solution 1:

Alternate solution 2:

Testing

Pull Request Checklist

Testing

Documentation

Title and description

wong-a commented Nov 12, 2021

wong-a Nov 12, 2021

Choose a reason for hiding this comment

StepFunctions-Bot commented Nov 12, 2021

AWS CodeBuild CI Report

ca-nguyen commented Nov 5, 2021 •

edited

Loading