Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add compare tool #280

Merged
merged 4 commits into from
Aug 1, 2022
Merged

Add compare tool #280

merged 4 commits into from
Aug 1, 2022

Conversation

danyx23
Copy link
Contributor

@danyx23 danyx23 commented Jul 11, 2022

This PR implements the first version of a compare CLI that allows easy comparison of either arbitrary dataframes (feather, csv, parquet) or etl files (in that case against the version currently uploaded in the production catalog). Much of the logic for the diffing lives in the data-utils-py repo and is managed in this PR: owid/owid-datautils-py#29 .

This PR currently includes a temporary copy of the new HighlevelDiff class for development. It will be removed here and imported from data-utils-py when the PR over there is merged.

@danyx23 danyx23 requested review from pabloarosado and Marigold July 11, 2022 11:31
@danyx23
Copy link
Contributor Author

danyx23 commented Jul 11, 2022

As I mentioned above, most of the diffing logic is done in the other PR. If you can, please focus on the cli itself in this one. To run it, do

poetry install
poetry shell
compare dataframes DF1 DF2 
# or
compare etl_catalog CHANNEL NAMESPACE DATASET TABLE

@pabloarosado
Copy link
Contributor

Hi Daniel, thanks a lot for this! The CLI looks very neat. I can't go through the PR now in detail, but I have some general preliminary comments:

  • When you have many columns with diffs, the formatting gets a bit messy:
    Screen Shot 2022-07-11 at 14 01 50
  • Is your plan to move this back to data-utils? I think it makes sense to keep the tool there, instead of having it split between the two repos.

Copy link
Collaborator

@Marigold Marigold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look good! Couple minor comments (feel free to ignore). I'm gonna try it on real datasets when I have a chance, but that shouldn't prevent merging this.

def yield_list_lines(
description: str, items: Iterable[Any]
) -> Generator[str, None, None]:
sublines = [item for item in items]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Could be list(items) I think)

)


def load_table(path_str: str) -> catalog.Table:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(There's a similar method that could be used instead, but it'd probably look weirder than your method)

help="Print truncated lists if they are longer than the given length.",
)
def etl_catalog(
channel: str,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(For me it would be more natural to write it as a path / URI channel/namespace/dataset/table instead of having them separated)

@larsyencken larsyencken merged commit 7fd919b into master Aug 1, 2022
@larsyencken larsyencken deleted the compare-tool branch August 1, 2022 10:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants