Add compare tool #280

danyx23 · 2022-07-11T11:31:36Z

This PR implements the first version of a compare CLI that allows easy comparison of either arbitrary dataframes (feather, csv, parquet) or etl files (in that case against the version currently uploaded in the production catalog). Much of the logic for the diffing lives in the data-utils-py repo and is managed in this PR: owid/owid-datautils-py#29 .

This PR currently includes a temporary copy of the new HighlevelDiff class for development. It will be removed here and imported from data-utils-py when the PR over there is merged.

danyx23 · 2022-07-11T11:38:01Z

As I mentioned above, most of the diffing logic is done in the other PR. If you can, please focus on the cli itself in this one. To run it, do

poetry install
poetry shell
compare dataframes DF1 DF2 
# or
compare etl_catalog CHANNEL NAMESPACE DATASET TABLE

pabloarosado · 2022-07-11T12:09:31Z

Hi Daniel, thanks a lot for this! The CLI looks very neat. I can't go through the PR now in detail, but I have some general preliminary comments:

When you have many columns with diffs, the formatting gets a bit messy:
Is your plan to move this back to data-utils? I think it makes sense to keep the tool there, instead of having it split between the two repos.

Marigold

Look good! Couple minor comments (feel free to ignore). I'm gonna try it on real datasets when I have a chance, but that shouldn't prevent merging this.

Marigold · 2022-07-11T13:55:54Z

etl/tempcompare.py

+def yield_list_lines(
+    description: str, items: Iterable[Any]
+) -> Generator[str, None, None]:
+    sublines = [item for item in items]


(Could be list(items) I think)

Marigold · 2022-07-12T09:21:52Z

etl/compare.py

+    )
+
+
+def load_table(path_str: str) -> catalog.Table:


(There's a similar method that could be used instead, but it'd probably look weirder than your method)

Marigold · 2022-07-12T09:26:24Z

etl/compare.py

+    help="Print truncated lists if they are longer than the given length.",
+)
+def etl_catalog(
+    channel: str,


(For me it would be more natural to write it as a path / URI channel/namespace/dataset/table instead of having them separated)

…ing and lint issues

danyx23 requested review from pabloarosado and Marigold July 11, 2022 11:31

This was referenced Jul 11, 2022

High level diff utility should check and report metadata differences #281

Closed

Add a very high level comparison tool to diff the results of entire branches of the ETL #282

Closed

Marigold approved these changes Jul 12, 2022

View reviewed changes

pabloarosado approved these changes Jul 15, 2022

View reviewed changes

danyx23 added 4 commits August 1, 2022 11:50

feat: implement first version of dataframe compare cli tool

f6248bc

nit: minor formatting and output changes

f4a7f7d

formatting

0be9ba7

enhance: sync HighLevelDiff temp copy from data-utils-py, fix formatt…

f29bc79

…ing and lint issues

larsyencken force-pushed the compare-tool branch from 1dbf7bb to f29bc79 Compare August 1, 2022 09:51

larsyencken merged commit 7fd919b into master Aug 1, 2022

larsyencken deleted the compare-tool branch August 1, 2022 10:17

larsyencken assigned danyx23 Aug 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add compare tool #280

Add compare tool #280

danyx23 commented Jul 11, 2022

danyx23 commented Jul 11, 2022

pabloarosado commented Jul 11, 2022

Marigold left a comment

Marigold Jul 11, 2022

Marigold Jul 12, 2022

Marigold Jul 12, 2022

Add compare tool #280

Add compare tool #280

Conversation

danyx23 commented Jul 11, 2022

danyx23 commented Jul 11, 2022

pabloarosado commented Jul 11, 2022

Marigold left a comment

Choose a reason for hiding this comment

Marigold Jul 11, 2022

Choose a reason for hiding this comment

Marigold Jul 12, 2022

Choose a reason for hiding this comment

Marigold Jul 12, 2022

Choose a reason for hiding this comment