-
-
Notifications
You must be signed in to change notification settings - Fork 1
feat: add a high level dataframe diff utility class #29
base: main
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## main #29 +/- ##
==========================================
+ Coverage 84.36% 85.34% +0.97%
==========================================
Files 7 7
Lines 403 464 +61
==========================================
+ Hits 340 396 +56
- Misses 63 68 +5
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start as a core class for diffs. It's still missing a summary
method showing high level differences, but I assume this will be added in the future. Curious how is this gonna look on real world datasets.
I also found a similar project (surprisingly there aren't more). It's quite limited though, but give it a try if you want.
owid/datautils/dataframes.py
Outdated
This will be an array of index values. If the index is a MultiIndex, the index values will be tuples. | ||
""" | ||
if self.value_differences is None: | ||
return pd.array([]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
columns.values
type is np.ndarray
, shouldn't this be np.array([])
then? (pd.array
is used very rarely and mostly for pandas internal purposes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this, it's going to be very useful. I made some suggestions, feel free to disregard them if you disagree.
owid/datautils/dataframes.py
Outdated
@@ -218,6 +218,179 @@ def are_equal( | |||
return equal, compared | |||
|
|||
|
|||
class DataFrameHighLevelDiff: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we are in the dataframes
module, I don't think it's necessary to mention DataFrame
in the name. Maybe just Compare
(which I think is more accurate than Diff
).
owid/datautils/dataframes.py
Outdated
df1: pd.DataFrame, | ||
df2: pd.DataFrame, | ||
absolute_tolerance: float, | ||
relative_tolerance: float, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not giving these two a default value? In numpy isclose
the default values are rtol=1e-05
, and atol=1e-08
.
self.diff() | ||
|
||
@property | ||
def columns_with_differences(self) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could just be called columns
(and it could potentially return columns without differences if an argument return_equal
is True).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changes this a bit now and removed this and the simliar rows property for now (instead you access the value_differences dataframe which is now subset to the structural overlap of the two datasets
return self.value_differences.columns.values | ||
|
||
@property | ||
def rows_with_differences(self) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could just be called rows
(and it could potentially return rows without differences if an argument return_equal
is True).
owid/datautils/dataframes.py
Outdated
return pd.array([]) | ||
return self.value_differences.index.values | ||
|
||
def diff(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be a hidden method. Otherwise, if compare.diff()
returns nothing, one may think that this means the dataframes are equal.
@@ -223,6 +223,238 @@ def test_on_dataframes_with_object_columns_with_nans(self): | |||
)[0] | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to apply the old TestAreDataFramesEqual
on the new are_equal
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea - did that!
Also incorporates PR feedback. 4 new tests are failing, these still need to be discussed.
Hi @Marigold and @pabloarosado - thanks a lot for your feedback, much appreciated! I did another round now that also adds a matching CLI in the etl repo and added the high level human readable diff output generation here. For testing, the ETL repo is nicer (because of the CLI), but if possible please review the diff logic again here and let me know what you think. I'll also post in the data-analysts channel a little overview with screenshots about what it looks like ATM @pabloarosado as you suggested I added a copy of the existing compare tests (great idea!). 4 of these are currently failing and I'd like to discuss with you what makes more sense (e.g. I am leaning towards not raising changed order of columns as a difference in this high level tool but one can certainly argue about this). |
Hi Daniel! as I said in the other PR, I can't go through it now, but have some general comments:
|
This adds a utility class to diff two dataframes and get a structured high level representation of the differences (if any). I'm not super familiar with pandas so please let me know if you see issues with this. It's also only barely tested but I'd like to get a proof of concept up that I can use in the ETL for a high level diffing tool and then we can all iterate on improving the core diff utility class here.