Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a very high level comparison tool to diff the results of entire branches of the ETL #282

Closed
danyx23 opened this issue Jul 11, 2022 · 4 comments
Labels

Comments

@danyx23
Copy link
Contributor

danyx23 commented Jul 11, 2022

#280 adds a high level diff tool for one table at a time. What we would like to have for staging (or feature branches) is a tool that can efficiently diff everything against master. To do so we don't want to calculate diffs for every file in the catalog twice, so I think we should first check md5 hashes (do we have those on the file level already?) and then compare them against a target catalog. It would be great if that catalog could be our main remote one as well (I think S3 can provide md5 hashes for files without having to download them?).

It would also be good to have the ability to check only a subset of all tables with this tool (e.g. everything in garden/OWID)

@Marigold
Copy link
Collaborator

check md5 hashes (do we have those on the file level already?)

We have them as column checksum in our catalog, so it's enough to download both production and staging catalog and compare them for out of sync data.

I think S3 can provide md5 hashes for files without having to download them?

True!

@stale
Copy link

stale bot commented Sep 9, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Sep 9, 2022
@stale stale bot closed this as completed Sep 16, 2022
@Marigold
Copy link
Collaborator

@danyx23 how about reopening this one? It's quite relevant for our ETL CI efforts.

@danyx23 danyx23 reopened this Sep 20, 2022
@stale stale bot removed the wontfix This will not be worked on label Sep 20, 2022
@danyx23 danyx23 added the pinned label Sep 20, 2022
@Marigold
Copy link
Collaborator

Marigold commented Jan 3, 2023

Implemented with datadiff.

@Marigold Marigold closed this as completed Jan 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants