Skip to content
This repository has been archived by the owner on Nov 1, 2023. It is now read-only.

feat: add a high level dataframe diff utility class #29

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

danyx23
Copy link
Contributor

@danyx23 danyx23 commented Jul 6, 2022

This adds a utility class to diff two dataframes and get a structured high level representation of the differences (if any). I'm not super familiar with pandas so please let me know if you see issues with this. It's also only barely tested but I'd like to get a proof of concept up that I can use in the ETL for a high level diffing tool and then we can all iterate on improving the core diff utility class here.

@danyx23 danyx23 requested review from Marigold and pabloarosado July 6, 2022 15:06
@codecov
Copy link

codecov bot commented Jul 6, 2022

Codecov Report

Merging #29 (5a4299c) into main (1d6d379) will increase coverage by 0.97%.
The diff coverage is 91.80%.

❗ Current head 5a4299c differs from pull request most recent head 84dc1c0. Consider uploading reports for the commit 84dc1c0 to get more accurate results

@@            Coverage Diff             @@
##             main      #29      +/-   ##
==========================================
+ Coverage   84.36%   85.34%   +0.97%     
==========================================
  Files           7        7              
  Lines         403      464      +61     
==========================================
+ Hits          340      396      +56     
- Misses         63       68       +5     
Impacted Files Coverage Δ
owid/datautils/dataframes.py 93.54% <91.80%> (-0.86%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1d6d379...84dc1c0. Read the comment docs.

@pabloarosado pabloarosado requested a review from lucasrodes July 7, 2022 06:41
Copy link
Contributor

@Marigold Marigold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start as a core class for diffs. It's still missing a summary method showing high level differences, but I assume this will be added in the future. Curious how is this gonna look on real world datasets.

I also found a similar project (surprisingly there aren't more). It's quite limited though, but give it a try if you want.

This will be an array of index values. If the index is a MultiIndex, the index values will be tuples.
"""
if self.value_differences is None:
return pd.array([])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

columns.values type is np.ndarray, shouldn't this be np.array([]) then? (pd.array is used very rarely and mostly for pandas internal purposes)

Copy link
Collaborator

@pabloarosado pabloarosado left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, it's going to be very useful. I made some suggestions, feel free to disregard them if you disagree.

@@ -218,6 +218,179 @@ def are_equal(
return equal, compared


class DataFrameHighLevelDiff:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we are in the dataframes module, I don't think it's necessary to mention DataFrame in the name. Maybe just Compare (which I think is more accurate than Diff).

df1: pd.DataFrame,
df2: pd.DataFrame,
absolute_tolerance: float,
relative_tolerance: float,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not giving these two a default value? In numpy isclose the default values are rtol=1e-05, and atol=1e-08.

self.diff()

@property
def columns_with_differences(self) -> Any:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could just be called columns (and it could potentially return columns without differences if an argument return_equal is True).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changes this a bit now and removed this and the simliar rows property for now (instead you access the value_differences dataframe which is now subset to the structural overlap of the two datasets

return self.value_differences.columns.values

@property
def rows_with_differences(self) -> Any:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could just be called rows (and it could potentially return rows without differences if an argument return_equal is True).

return pd.array([])
return self.value_differences.index.values

def diff(self) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be a hidden method. Otherwise, if compare.diff() returns nothing, one may think that this means the dataframes are equal.

@@ -223,6 +223,238 @@ def test_on_dataframes_with_object_columns_with_nans(self):
)[0]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to apply the old TestAreDataFramesEqual on the new are_equal method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea - did that!

Also incorporates PR feedback. 4 new tests are failing, these still need to be discussed.
@danyx23
Copy link
Contributor Author

danyx23 commented Jul 11, 2022

Hi @Marigold and @pabloarosado - thanks a lot for your feedback, much appreciated! I did another round now that also adds a matching CLI in the etl repo and added the high level human readable diff output generation here. For testing, the ETL repo is nicer (because of the CLI), but if possible please review the diff logic again here and let me know what you think. I'll also post in the data-analysts channel a little overview with screenshots about what it looks like ATM

@pabloarosado as you suggested I added a copy of the existing compare tests (great idea!). 4 of these are currently failing and I'd like to discuss with you what makes more sense (e.g. I am leaning towards not raising changed order of columns as a difference in this high level tool but one can certainly argue about this).

@pabloarosado
Copy link
Collaborator

Hi Daniel! as I said in the other PR, I can't go through it now, but have some general comments:

  1. You call this "high level diff", but what would be another (low level) diff? If we have no plan to do something conceptually different, I suppose this should be simply called "Compare" (as I mentioned before).
  2. I think that if the columns are in a different order, the dataframes should not be considered equal. Currently, that's the logic for rows, so it makes sense to apply the same logic to columns.
  3. Another thought (possibly for another PR). Currently, we search for differences among common columns. I'm thinking: Maybe, instead of comparing the dataframes row by row, we should also merge on common indexes to search for differences among common rows/columns, what do you think?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants