Implement groupby multi level support #486

phofl · 2023-12-06T14:54:09Z

No description provided.

milesgranger

I'm okay with this, but the comment on repeated/similar code is maybe a by product of the current design?

milesgranger · 2023-12-11T13:03:50Z

dask_expr/_groupby.py

+    @functools.cached_property
+    def _by_meta(self):
+        if isinstance(self.by, Expr):
+            return meta_nonempty(self.by._meta)
+        elif is_scalar(self.by):
+            return self.by
+        else:
+            return [
+                meta_nonempty(x._meta) if isinstance(x, Expr) else x for x in self.by
+            ]
+
+    @functools.cached_property
+    def _by_columns(self):
+        if isinstance(self.by, Expr):
+            return []
+        else:
+            return [x for x in self.by if not isinstance(x, Expr)]
+


It feels not great to have identical and/or very similar code (GroupByReduction/GroupByApplyConcatApply._by_columns are exactly the same, and GroupByReduction/GroupByApplyConcatApply._by_meta are nearly the same) between classes. Do you think the logic can be combined into ApplyConcatApply?

No, that's not groupby specific, so that's not the place for this logic.

The groupby structure needs a refactor anyway to make this more consistent but that's something for a follow up if the actual implementation is ironed out

I rewrote parts of the implementation, still not happy with it, but less duplicated code now

crusaderky · 2023-12-11T12:28:40Z

dask_expr/_groupby.py

@@ -860,6 +907,22 @@ def _extract_meta(x, nonempty=False):
 ###


+def _validate_by_expr(obj, by):


Could you think of a better name, and possibly add some docstring?
Elsewhere in dask, "validate" functions typically contain a bunch of asserts and return None.
This function seems to extract sometimes a column name, sometimes an Expr, sometimes something else.
Maybe "clean_by_expr" or "preprocess_by_expr"?

_clean_by_expr seems fine

crusaderky · 2023-12-11T12:29:31Z

dask_expr/_groupby.py

+        if not are_co_aligned(obj.expr, by.expr):
+            raise ValueError("by must be in the DataFrames columns.")
+        return by.expr
+    return by


Could you add a comment explaining what use cases are not collected by any of the above switches?
I gather from elsewhere that Expr is a possible use case; are there others?

Added,

it can be a proper column name, e.g. by="a"

crusaderky · 2023-12-11T12:32:59Z

dask_expr/_groupby.py

-            self.by = by.expr
-        else:
-            self.by = [by] if np.isscalar(by) else list(by)
+        self.by = [by] if np.isscalar(by) or isinstance(by, Expr) else list(by)


Everywhere else you added code branches that allow the by attribute to be either a scalar (Expr or otherwise) or a list. This however says that by is always a list?

I much prefer the latter. Test coverage for all these new code branches is quite spotty - something that coercing everything into a list as soon as it's acquired from the user would prevent.

Very good point, that was an oversight on my part. It will always be a list after this pr is in

# Conflicts: # dask_expr/_groupby.py # dask_expr/tests/test_groupby.py

phofl · 2023-12-15T12:15:54Z

merging this, it blocks follow ups

phofl added 2 commits December 6, 2023 15:40

Implement groupby multi level support

1a2efe9

Implement groupby multi level support

ab9a663

milesgranger approved these changes Dec 11, 2023

View reviewed changes

Merge branch 'main' into list_raise

b717928

crusaderky reviewed Dec 11, 2023

View reviewed changes

phofl added 7 commits December 12, 2023 11:09

Update

23a2c1d

Add commen

af4ee66

Rewrite

17d1308

Add tests and fixups

f2ee711

Merge branch 'main' into list_raise

30d458b

Merge branch 'main' into list_raise

6047b82

# Conflicts: # dask_expr/_groupby.py # dask_expr/tests/test_groupby.py

Fixup

dc5bfea

phofl merged commit 02a449c into dask:main Dec 15, 2023
9 checks passed

phofl deleted the list_raise branch December 15, 2023 12:16

phofl restored the list_raise branch December 15, 2023 12:16

phofl deleted the list_raise branch December 15, 2023 15:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement groupby multi level support #486

Implement groupby multi level support #486

phofl commented Dec 6, 2023

milesgranger left a comment

milesgranger Dec 11, 2023

phofl Dec 11, 2023

phofl Dec 13, 2023

crusaderky Dec 11, 2023

phofl Dec 12, 2023

crusaderky Dec 11, 2023

phofl Dec 12, 2023

crusaderky Dec 11, 2023

phofl Dec 12, 2023

phofl commented Dec 15, 2023

		@@ -860,6 +907,22 @@ def _extract_meta(x, nonempty=False):
		###


		def _validate_by_expr(obj, by):

Implement groupby multi level support #486

Implement groupby multi level support #486

Conversation

phofl commented Dec 6, 2023

milesgranger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl commented Dec 15, 2023