feat: add a high level dataframe diff utility class #29

danyx23 · 2022-07-06T15:06:12Z

This adds a utility class to diff two dataframes and get a structured high level representation of the differences (if any). I'm not super familiar with pandas so please let me know if you see issues with this. It's also only barely tested but I'd like to get a proof of concept up that I can use in the ETL for a high level diffing tool and then we can all iterate on improving the core diff utility class here.

codecov · 2022-07-06T15:10:35Z

Codecov Report

Merging #29 (5a4299c) into main (1d6d379) will increase coverage by 0.97%.
The diff coverage is 91.80%.

❗ Current head 5a4299c differs from pull request most recent head 84dc1c0. Consider uploading reports for the commit 84dc1c0 to get more accurate results

@@            Coverage Diff             @@
##             main      #29      +/-   ##
==========================================
+ Coverage   84.36%   85.34%   +0.97%     
==========================================
  Files           7        7              
  Lines         403      464      +61     
==========================================
+ Hits          340      396      +56     
- Misses         63       68       +5

Impacted Files	Coverage Δ
owid/datautils/dataframes.py	`93.54% <91.80%> (-0.86%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1d6d379...84dc1c0. Read the comment docs.

Marigold

Good start as a core class for diffs. It's still missing a summary method showing high level differences, but I assume this will be added in the future. Curious how is this gonna look on real world datasets.

I also found a similar project (surprisingly there aren't more). It's quite limited though, but give it a try if you want.

Marigold · 2022-07-07T10:18:00Z

owid/datautils/dataframes.py

+        This will be an array of index values. If the index is a MultiIndex, the index values will be tuples.
+        """
+        if self.value_differences is None:
+            return pd.array([])


columns.values type is np.ndarray, shouldn't this be np.array([]) then? (pd.array is used very rarely and mostly for pandas internal purposes)

pabloarosado

Thanks for this, it's going to be very useful. I made some suggestions, feel free to disregard them if you disagree.

pabloarosado · 2022-07-08T12:47:01Z

owid/datautils/dataframes.py

@@ -218,6 +218,179 @@ def are_equal(
    return equal, compared


+class DataFrameHighLevelDiff:


Given that we are in the dataframes module, I don't think it's necessary to mention DataFrame in the name. Maybe just Compare (which I think is more accurate than Diff).

pabloarosado · 2022-07-08T12:52:24Z

owid/datautils/dataframes.py

+        df1: pd.DataFrame,
+        df2: pd.DataFrame,
+        absolute_tolerance: float,
+        relative_tolerance: float,


Why not giving these two a default value? In numpy isclose the default values are rtol=1e-05, and atol=1e-08.

pabloarosado · 2022-07-08T13:02:39Z

owid/datautils/dataframes.py

+        self.diff()
+
+    @property
+    def columns_with_differences(self) -> Any:


This could just be called columns (and it could potentially return columns without differences if an argument return_equal is True).

I changes this a bit now and removed this and the simliar rows property for now (instead you access the value_differences dataframe which is now subset to the structural overlap of the two datasets

pabloarosado · 2022-07-08T13:02:56Z

owid/datautils/dataframes.py

+        return self.value_differences.columns.values
+
+    @property
+    def rows_with_differences(self) -> Any:


This could just be called rows (and it could potentially return rows without differences if an argument return_equal is True).

pabloarosado · 2022-07-08T13:19:38Z

owid/datautils/dataframes.py

+            return pd.array([])
+        return self.value_differences.index.values
+
+    def diff(self) -> None:


This should probably be a hidden method. Otherwise, if compare.diff() returns nothing, one may think that this means the dataframes are equal.

pabloarosado · 2022-07-08T13:24:29Z

tests/test_dataframes.py

@@ -223,6 +223,238 @@ def test_on_dataframes_with_object_columns_with_nans(self):
        )[0]



It would be good to apply the old TestAreDataFramesEqual on the new are_equal method.

Great idea - did that!

Also incorporates PR feedback. 4 new tests are failing, these still need to be discussed.

danyx23 · 2022-07-11T11:41:53Z

Hi @Marigold and @pabloarosado - thanks a lot for your feedback, much appreciated! I did another round now that also adds a matching CLI in the etl repo and added the high level human readable diff output generation here. For testing, the ETL repo is nicer (because of the CLI), but if possible please review the diff logic again here and let me know what you think. I'll also post in the data-analysts channel a little overview with screenshots about what it looks like ATM

@pabloarosado as you suggested I added a copy of the existing compare tests (great idea!). 4 of these are currently failing and I'd like to discuss with you what makes more sense (e.g. I am leaning towards not raising changed order of columns as a difference in this high level tool but one can certainly argue about this).

pabloarosado · 2022-07-11T12:19:32Z

Hi Daniel! as I said in the other PR, I can't go through it now, but have some general comments:

You call this "high level diff", but what would be another (low level) diff? If we have no plan to do something conceptually different, I suppose this should be simply called "Compare" (as I mentioned before).
I think that if the columns are in a different order, the dataframes should not be considered equal. Currently, that's the logic for rows, so it makes sense to apply the same logic to columns.
Another thought (possibly for another PR). Currently, we search for differences among common columns. I'm thinking: Maybe, instead of comparing the dataframes row by row, we should also merge on common indexes to search for differences among common rows/columns, what do you think?

feat: add a high level dataframe diff utility class

5a4299c

danyx23 requested review from Marigold and pabloarosado July 6, 2022 15:06

pabloarosado requested a review from lucasrodes July 7, 2022 06:41

Marigold approved these changes Jul 7, 2022

View reviewed changes

pabloarosado approved these changes Jul 8, 2022

View reviewed changes

feat: add highlevel diff summary line generator

84dc1c0

Also incorporates PR feedback. 4 new tests are failing, these still need to be discussed.

danyx23 mentioned this pull request Jul 11, 2022

Add compare tool owid/etl#280

Merged

danyx23 requested review from pabloarosado and Marigold July 11, 2022 11:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add a high level dataframe diff utility class #29

feat: add a high level dataframe diff utility class #29

danyx23 commented Jul 6, 2022

codecov bot commented Jul 6, 2022 •

edited

Loading

Marigold left a comment

Marigold Jul 7, 2022

pabloarosado left a comment

pabloarosado Jul 8, 2022

pabloarosado Jul 8, 2022

pabloarosado Jul 8, 2022

danyx23 Jul 8, 2022

pabloarosado Jul 8, 2022

pabloarosado Jul 8, 2022

pabloarosado Jul 8, 2022

danyx23 Jul 11, 2022

danyx23 commented Jul 11, 2022

pabloarosado commented Jul 11, 2022

		@@ -218,6 +218,179 @@ def are_equal(
		return equal, compared


		class DataFrameHighLevelDiff:

		@@ -223,6 +223,238 @@ def test_on_dataframes_with_object_columns_with_nans(self):
		)[0]

feat: add a high level dataframe diff utility class #29

Are you sure you want to change the base?

feat: add a high level dataframe diff utility class #29

Conversation

danyx23 commented Jul 6, 2022

codecov bot commented Jul 6, 2022 • edited Loading

Codecov Report

Marigold left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pabloarosado left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danyx23 commented Jul 11, 2022

pabloarosado commented Jul 11, 2022

codecov bot commented Jul 6, 2022 •

edited

Loading