-
-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Better filtering of data & metadata in chart-diff #3667
Conversation
Quick links (staging server):
Login: chart-diff: ✅No charts for review.data-diff: ✅ No differences foundLegend: +New ~Modified -Removed =Identical Details
Hint: Run this locally with etl diff REMOTE data/ --include yourdataset --verbose --snippet Automatically updated datasets matching weekly_wildfires|excess_mortality|covid|fluid|flunet|country_profile|garden/ihme_gbd/2019/gbd_risk are not included Edited: 2024-12-02 08:47:43 UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this!
I left a few suggestions to also be able to catch changes in snapshots.
And to be able to get the list of affected steps, I don't think we have any alternative but using vt.steps_df. Maybe we could invest some time to optimize that process (since it's used in multiple places).
However, what you propose is already an improvement, so feel free to merge as-is.
|
||
# Add all downstream dependencies of those datasets. | ||
DAG = load_dag() | ||
dag_steps = list(filter_to_subgraph(DAG, dataset_catalog_paths, downstream=True).keys()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is OK, as it filters down the DAG a little bit. But using VersionTracker.steps_df would be much more precise. You could do:
steps_df[(steps_df["step"].isin([...])]["all_active_usages"]
And that would give you only the steps that are affected by the changed files. That would be ultimately what we need. But I understand that loading steps_df
is very slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I profiled steps_df
, but couldn't find any low-hanging fruit that would significantly speed it up. It just does a lot of things, which takes time. We'd have to refactor it a lot to make it both fast enough for such a simple use case as this and flexible for ETL dashboard. Anyway, I copied your comment to code to not get lost.
e4421ca
to
c1d1e82
Compare
Config change
for unrelated charts.I intended to use the function
get_datasets_from_version_tracker
, but callingvt.steps_df
was too slow. Instead, I added a new function,get_all_changed_catalog_paths
, which identifies the paths of changed steps along with their downstream dependencies. Let me know if a faster function for this purpose already exists.Parsing catalog paths over and over again with
.split("/")
is tedious and prone to errors. I wanted to create a special objectStepPath
in this PR #3165, but wasn't fully satisfied with it and didn't end up merging. It'd be nice to refactor it all at one point.