-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Detect when a backfill has stopped producing runs but some of its target subset was never materialized and fail the backfill #27885
base: master
Are you sure you want to change the base?
Conversation
Summary: AssetGraphSubset.empty() returns an empty subset. What we wanted here was a boolean that tells you whether the subset was empty. This was (in master) causing asset backfills to always log that there was nothing to request, even if they then went on to request things Test Plan: New test case, Launch an asset backfill and view the logs, verify that output now includes the asset partitions being requested.
"Backfill stopped producing runs, but the following partitions were never materialized or failed, " | ||
"possibly because the runs that were supposed to materialize them succeeded but never materialized them: {_asset_graph_subset_to_str(partitions_without_materialization_status, asset_graph)}. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing f
prefix on the string, which means the expression {_asset_graph_subset_to_str(...)}
won't be interpolated. Add the prefix to make it:
f"Backfill stopped producing runs, but the following partitions were never materialized or failed, possibly because the runs that were supposed to materialize them succeeded but never materialized them: {_asset_graph_subset_to_str(partitions_without_materialization_status, asset_graph)}. "
Spotted by Graphite Reviewer
Is this helpful? React 👍 or 👎 to let us know.
dc39b02
to
43d6337
Compare
@@ -3290,14 +3295,21 @@ def test_asset_backfill_not_complete_until_retries_complete( | |||
parent_run_id=run_to_retry.run_id, | |||
) | |||
|
|||
print("RETREID RUN: " + str(retried_run)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There appears to be a typo in the debug print statement: RETREID RUN
should be RETRIED RUN
Spotted by Graphite Reviewer
Is this helpful? React 👍 or 👎 to let us know.
Condition 1 ensures that for each asset partition we have attempted to materialize it or have determined we | ||
cannot materialize it because of a failed dependency. Condition 2 ensures that no retries of failed runs are | ||
in progress. Condition 3 guards against a race condition where a failed run could be automatically retried | ||
but it was not added into the queue in time to be caught by condition 2. | ||
cannot materialize it because of a failed dependency. Condition 1 ensures that no retries of failed runs are | ||
in progress. Condition 2 guards against a race condition where a failed run could be automatically retried | ||
but it was not added into the queue in time to be caught by condition 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring has a small error. The sentence "Condition 1 ensures that no retries of failed runs are in progress" should be "Condition 2 ensures that no retries of failed runs are in progress", since this matches the actual conditions described in the docstring above.
Spotted by Graphite Reviewer
Is this helpful? React 👍 or 👎 to let us know.
raise DagsterBackfillFailedError( | ||
"Backfill stopped producing runs, but the following partitions were never materialized or failed, " | ||
"possibly because the runs that were supposed to materialize them succeeded but never materialized them: " | ||
f"{_asset_graph_subset_to_str(partitions_without_materialization_status, asset_graph)}. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message string already has a period at the end - there's one after the asset_graph
parameter. Adding another would result in two consecutive periods.
Spotted by Graphite Reviewer
Is this helpful? React 👍 or 👎 to let us know.
43d6337
to
7b26baf
Compare
…get subset was never materialized and fail the backfill
7b26baf
to
dedccbf
Compare
73cfd57
to
a21fc83
Compare
@jamiedemaria @clairelin135 before I go in and clean this up and add tests i'm curious if you have any concerns about the overall approach here? There would be a bit of additional data fetching on each backfill iteration but i don't think its significantly more than we are already doing in _get_failed_and_downstream_asset_graph_subset (one of the changes I would make to clean things up is pass the list of runs through so that they don't need to be fetched twice) |
if ( | ||
partitions_without_materialization_status.is_empty or asset_backfill_data_unchanged | ||
) and backfill_runs_are_complete( | ||
logger=logger, | ||
backfill_runs=backfill_runs, | ||
): | ||
if not partitions_without_materialization_status.is_empty: | ||
raise DagsterBackfillFailedError( | ||
"Backfill stopped producing runs, but the following partitions were never materialized or failed, " | ||
"possibly because the runs that were supposed to materialize them succeeded without creating the expected materializations: " | ||
f"{_asset_graph_subset_to_str(partitions_without_materialization_status, asset_graph)}." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's probably a less confusing way to structure this and separate out the two cases
case 1: backfill is not making progress, there are requested partitions not in a terminal state, and there are no more in progress or retrying runs - assume its stuck and fail the backfill
case 2: all requested partitions have been materialized or failed and there are no in progress runs - move to a successful or failed state depending on whether there are any failed partitions
Summary & Motivation
Possible solution for #17745.
The idea is:
How I Tested These Changes
Going to add tests based on initial feedback
Changelog