Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Detect when a backfill has stopped producing runs but some of its target subset was never materialized and fail the backfill #27885

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

gibsondan
Copy link
Member

@gibsondan gibsondan commented Feb 18, 2025

Summary & Motivation

Possible solution for #17745.

The idea is:

  • at the start of the tick, get all the in progress and failed runs for the backfill
  • If we don't make any progress this tick, check to see if there are runs that are still producing work
  • If there aren't, assuming that one of those runs didn't materialize partitions it was supposed to and fail the backfill.

How I Tested These Changes

Going to add tests based on initial feedback

Changelog

Summary:
AssetGraphSubset.empty() returns an empty subset. What we wanted here was a boolean that tells you whether the subset was empty. This was (in master) causing asset backfills to always log that there was nothing to request, even if they then went on to request things

Test Plan:
New test case, Launch an asset backfill and view the logs, verify that output now includes the asset partitions being requested.
Comment on lines 1088 to 1089
"Backfill stopped producing runs, but the following partitions were never materialized or failed, "
"possibly because the runs that were supposed to materialize them succeeded but never materialized them: {_asset_graph_subset_to_str(partitions_without_materialization_status, asset_graph)}. "
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing f prefix on the string, which means the expression {_asset_graph_subset_to_str(...)} won't be interpolated. Add the prefix to make it:

f"Backfill stopped producing runs, but the following partitions were never materialized or failed, possibly because the runs that were supposed to materialize them succeeded but never materialized them: {_asset_graph_subset_to_str(partitions_without_materialization_status, asset_graph)}. "

Spotted by Graphite Reviewer

Is this helpful? React 👍 or 👎 to let us know.

@gibsondan gibsondan force-pushed the raiseontargetwithoutrequest branch from dc39b02 to 43d6337 Compare February 18, 2025 03:38
@@ -3290,14 +3295,21 @@ def test_asset_backfill_not_complete_until_retries_complete(
parent_run_id=run_to_retry.run_id,
)

print("RETREID RUN: " + str(retried_run))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There appears to be a typo in the debug print statement: RETREID RUN should be RETRIED RUN

Spotted by Graphite Reviewer

Is this helpful? React 👍 or 👎 to let us know.

Comment on lines 917 to +921
Condition 1 ensures that for each asset partition we have attempted to materialize it or have determined we
cannot materialize it because of a failed dependency. Condition 2 ensures that no retries of failed runs are
in progress. Condition 3 guards against a race condition where a failed run could be automatically retried
but it was not added into the queue in time to be caught by condition 2.
cannot materialize it because of a failed dependency. Condition 1 ensures that no retries of failed runs are
in progress. Condition 2 guards against a race condition where a failed run could be automatically retried
but it was not added into the queue in time to be caught by condition 1.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring has a small error. The sentence "Condition 1 ensures that no retries of failed runs are in progress" should be "Condition 2 ensures that no retries of failed runs are in progress", since this matches the actual conditions described in the docstring above.

Spotted by Graphite Reviewer

Is this helpful? React 👍 or 👎 to let us know.

raise DagsterBackfillFailedError(
"Backfill stopped producing runs, but the following partitions were never materialized or failed, "
"possibly because the runs that were supposed to materialize them succeeded but never materialized them: "
f"{_asset_graph_subset_to_str(partitions_without_materialization_status, asset_graph)}. "
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message string already has a period at the end - there's one after the asset_graph parameter. Adding another would result in two consecutive periods.

Spotted by Graphite Reviewer

Is this helpful? React 👍 or 👎 to let us know.

@gibsondan gibsondan force-pushed the raiseontargetwithoutrequest branch from 43d6337 to 7b26baf Compare February 18, 2025 03:45
…get subset was never materialized and fail the backfill
@gibsondan gibsondan force-pushed the raiseontargetwithoutrequest branch from 7b26baf to dedccbf Compare February 18, 2025 03:51
@gibsondan gibsondan changed the title Detect when a backfill has stopped producing runs but some of its target subset was never materialized and fail the backfill RFC: Detect when a backfill has stopped producing runs but some of its target subset was never materialized and fail the backfill Feb 18, 2025
@gibsondan
Copy link
Member Author

@jamiedemaria @clairelin135 before I go in and clean this up and add tests i'm curious if you have any concerns about the overall approach here? There would be a bit of additional data fetching on each backfill iteration but i don't think its significantly more than we are already doing in _get_failed_and_downstream_asset_graph_subset (one of the changes I would make to clean things up is pass the list of runs through so that they don't need to be fetched twice)

Comment on lines +1088 to +1099
if (
partitions_without_materialization_status.is_empty or asset_backfill_data_unchanged
) and backfill_runs_are_complete(
logger=logger,
backfill_runs=backfill_runs,
):
if not partitions_without_materialization_status.is_empty:
raise DagsterBackfillFailedError(
"Backfill stopped producing runs, but the following partitions were never materialized or failed, "
"possibly because the runs that were supposed to materialize them succeeded without creating the expected materializations: "
f"{_asset_graph_subset_to_str(partitions_without_materialization_status, asset_graph)}."
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's probably a less confusing way to structure this and separate out the two cases

case 1: backfill is not making progress, there are requested partitions not in a terminal state, and there are no more in progress or retrying runs - assume its stuck and fail the backfill

case 2: all requested partitions have been materialized or failed and there are no in progress runs - move to a successful or failed state depending on whether there are any failed partitions

Base automatically changed from fixemptylogging to master February 18, 2025 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant