Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Better filtering of data & metadata in chart-diff #3667

Merged
merged 4 commits into from
Dec 5, 2024

Conversation

Marigold
Copy link
Collaborator

@Marigold Marigold commented Dec 2, 2024

  • Fixed a bug that displayed Config change for unrelated charts.
  • Filtered data and metadata changes by checking which files were modified on the staging server.

I intended to use the function get_datasets_from_version_tracker, but calling vt.steps_df was too slow. Instead, I added a new function, get_all_changed_catalog_paths, which identifies the paths of changed steps along with their downstream dependencies. Let me know if a faster function for this purpose already exists.

Parsing catalog paths over and over again with .split("/") is tedious and prone to errors. I wanted to create a special object StepPath in this PR #3165, but wasn't fully satisfied with it and didn't end up merging. It'd be nice to refactor it all at one point.

@Marigold Marigold marked this pull request as ready for review December 2, 2024 08:45
@owidbot
Copy link
Contributor

owidbot commented Dec 2, 2024

Quick links (staging server):

Site Dev Site Preview Admin Wizard Docs

Login: ssh owid@staging-site-chartdiff-data-metadata

chart-diff: ✅ No charts for review.
data-diff: ✅ No differences found
Legend: +New  ~Modified  -Removed  =Identical  Details
Hint: Run this locally with etl diff REMOTE data/ --include yourdataset --verbose --snippet

Automatically updated datasets matching weekly_wildfires|excess_mortality|covid|fluid|flunet|country_profile|garden/ihme_gbd/2019/gbd_risk are not included

Edited: 2024-12-02 08:47:43 UTC
Execution time: 16.02 seconds

@Marigold Marigold requested a review from pabloarosado December 2, 2024 08:59
Copy link
Contributor

@pabloarosado pabloarosado left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this!
I left a few suggestions to also be able to catch changes in snapshots.
And to be able to get the list of affected steps, I don't think we have any alternative but using vt.steps_df. Maybe we could invest some time to optimize that process (since it's used in multiple places).
However, what you propose is already an improvement, so feel free to merge as-is.

apps/wizard/utils/io.py Outdated Show resolved Hide resolved
apps/wizard/utils/io.py Show resolved Hide resolved

# Add all downstream dependencies of those datasets.
DAG = load_dag()
dag_steps = list(filter_to_subgraph(DAG, dataset_catalog_paths, downstream=True).keys())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is OK, as it filters down the DAG a little bit. But using VersionTracker.steps_df would be much more precise. You could do:

steps_df[(steps_df["step"].isin([...])]["all_active_usages"]

And that would give you only the steps that are affected by the changed files. That would be ultimately what we need. But I understand that loading steps_df is very slow.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I profiled steps_df, but couldn't find any low-hanging fruit that would significantly speed it up. It just does a lot of things, which takes time. We'd have to refactor it a lot to make it both fast enough for such a simple use case as this and flexible for ETL dashboard. Anyway, I copied your comment to code to not get lost.

@Marigold Marigold force-pushed the chartdiff-data-metadata branch from e4421ca to c1d1e82 Compare December 5, 2024 08:29
@Marigold Marigold merged commit b20b45b into master Dec 5, 2024
4 of 7 checks passed
@Marigold Marigold deleted the chartdiff-data-metadata branch December 5, 2024 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants