✨ Better filtering of data & metadata in chart-diff #3667

Marigold · 2024-12-02T08:45:39Z

Fixed a bug that displayed Config change for unrelated charts.
Filtered data and metadata changes by checking which files were modified on the staging server.

I intended to use the function get_datasets_from_version_tracker, but calling vt.steps_df was too slow. Instead, I added a new function, get_all_changed_catalog_paths, which identifies the paths of changed steps along with their downstream dependencies. Let me know if a faster function for this purpose already exists.

Parsing catalog paths over and over again with .split("/") is tedious and prone to errors. I wanted to create a special object StepPath in this PR #3165, but wasn't fully satisfied with it and didn't end up merging. It'd be nice to refactor it all at one point.

owidbot · 2024-12-02T08:47:44Z

Quick links (staging server):

Site Dev	Site Preview	Admin	Wizard	Docs

Login: ssh owid@staging-site-chartdiff-data-metadata

chart-diff: ✅

No charts for review.

data-diff: ✅ No differences found

Legend: +New  ~Modified  -Removed  =Identical  Details
Hint: Run this locally with etl diff REMOTE data/ --include yourdataset --verbose --snippet

Edited: 2024-12-02 08:47:43 UTC
Execution time: 16.02 seconds

pabloarosado

Thanks for fixing this!
I left a few suggestions to also be able to catch changes in snapshots.
And to be able to get the list of affected steps, I don't think we have any alternative but using vt.steps_df. Maybe we could invest some time to optimize that process (since it's used in multiple places).
However, what you propose is already an improvement, so feel free to merge as-is.

apps/wizard/utils/io.py

pabloarosado · 2024-12-04T09:38:42Z

apps/wizard/utils/io.py

+
+    # Add all downstream dependencies of those datasets.
+    DAG = load_dag()
+    dag_steps = list(filter_to_subgraph(DAG, dataset_catalog_paths, downstream=True).keys())


This is OK, as it filters down the DAG a little bit. But using VersionTracker.steps_df would be much more precise. You could do:

steps_df[(steps_df["step"].isin([...])]["all_active_usages"]

And that would give you only the steps that are affected by the changed files. That would be ultimately what we need. But I understand that loading steps_df is very slow.

I profiled steps_df, but couldn't find any low-hanging fruit that would significantly speed it up. It just does a lot of things, which takes time. We'd have to refactor it a lot to make it both fast enough for such a simple use case as this and flexible for ETL dashboard. Anyway, I copied your comment to code to not get lost.

Marigold marked this pull request as ready for review December 2, 2024 08:45

github-actions bot assigned Marigold Dec 2, 2024

Marigold requested a review from pabloarosado December 2, 2024 08:59

pabloarosado approved these changes Dec 4, 2024

View reviewed changes

Marigold added 2 commits December 5, 2024 09:29

✨ Better filtering of data & metadata in chart-diff

d642582

wip

c1d1e82

Marigold force-pushed the chartdiff-data-metadata branch from e4421ca to c1d1e82 Compare December 5, 2024 08:29

Marigold added 2 commits December 5, 2024 09:52

wip

d717203

wip

99e5989

Marigold merged commit b20b45b into master Dec 5, 2024
4 of 7 checks passed

Marigold deleted the chartdiff-data-metadata branch December 5, 2024 09:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ Better filtering of data & metadata in chart-diff #3667

✨ Better filtering of data & metadata in chart-diff #3667

Marigold commented Dec 2, 2024 •

edited

Loading

owidbot commented Dec 2, 2024

pabloarosado left a comment

pabloarosado Dec 4, 2024

Marigold Dec 5, 2024

✨ Better filtering of data & metadata in chart-diff #3667

✨ Better filtering of data & metadata in chart-diff #3667

Conversation

Marigold commented Dec 2, 2024 • edited Loading

owidbot commented Dec 2, 2024

pabloarosado left a comment

Choose a reason for hiding this comment

pabloarosado Dec 4, 2024

Choose a reason for hiding this comment

Marigold Dec 5, 2024

Choose a reason for hiding this comment

Marigold commented Dec 2, 2024 •

edited

Loading