-
-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add compare tool #280
Add compare tool #280
Conversation
As I mentioned above, most of the diffing logic is done in the other PR. If you can, please focus on the cli itself in this one. To run it, do poetry install
poetry shell
compare dataframes DF1 DF2
# or
compare etl_catalog CHANNEL NAMESPACE DATASET TABLE |
Hi Daniel, thanks a lot for this! The CLI looks very neat. I can't go through the PR now in detail, but I have some general preliminary comments: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look good! Couple minor comments (feel free to ignore). I'm gonna try it on real datasets when I have a chance, but that shouldn't prevent merging this.
def yield_list_lines( | ||
description: str, items: Iterable[Any] | ||
) -> Generator[str, None, None]: | ||
sublines = [item for item in items] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Could be list(items)
I think)
) | ||
|
||
|
||
def load_table(path_str: str) -> catalog.Table: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(There's a similar method that could be used instead, but it'd probably look weirder than your method)
help="Print truncated lists if they are longer than the given length.", | ||
) | ||
def etl_catalog( | ||
channel: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(For me it would be more natural to write it as a path / URI channel/namespace/dataset/table
instead of having them separated)
1dbf7bb
to
f29bc79
Compare
This PR implements the first version of a compare CLI that allows easy comparison of either arbitrary dataframes (feather, csv, parquet) or etl files (in that case against the version currently uploaded in the production catalog). Much of the logic for the diffing lives in the data-utils-py repo and is managed in this PR: owid/owid-datautils-py#29 .
This PR currently includes a temporary copy of the new HighlevelDiff class for development. It will be removed here and imported from data-utils-py when the PR over there is merged.