-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement groupby multi level support #486
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with this, but the comment on repeated/similar code is maybe a by product of the current design?
dask_expr/_groupby.py
Outdated
@functools.cached_property | ||
def _by_meta(self): | ||
if isinstance(self.by, Expr): | ||
return meta_nonempty(self.by._meta) | ||
elif is_scalar(self.by): | ||
return self.by | ||
else: | ||
return [ | ||
meta_nonempty(x._meta) if isinstance(x, Expr) else x for x in self.by | ||
] | ||
|
||
@functools.cached_property | ||
def _by_columns(self): | ||
if isinstance(self.by, Expr): | ||
return [] | ||
else: | ||
return [x for x in self.by if not isinstance(x, Expr)] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels not great to have identical and/or very similar code (GroupByReduction/GroupByApplyConcatApply._by_columns are exactly the same, and GroupByReduction/GroupByApplyConcatApply._by_meta are nearly the same) between classes. Do you think the logic can be combined into ApplyConcatApply
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's not groupby specific, so that's not the place for this logic.
The groupby structure needs a refactor anyway to make this more consistent but that's something for a follow up if the actual implementation is ironed out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rewrote parts of the implementation, still not happy with it, but less duplicated code now
dask_expr/_groupby.py
Outdated
@@ -860,6 +907,22 @@ def _extract_meta(x, nonempty=False): | |||
### | |||
|
|||
|
|||
def _validate_by_expr(obj, by): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you think of a better name, and possibly add some docstring?
Elsewhere in dask, "validate" functions typically contain a bunch of asserts and return None.
This function seems to extract sometimes a column name, sometimes an Expr, sometimes something else.
Maybe "clean_by_expr" or "preprocess_by_expr"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_clean_by_expr
seems fine
if not are_co_aligned(obj.expr, by.expr): | ||
raise ValueError("by must be in the DataFrames columns.") | ||
return by.expr | ||
return by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment explaining what use cases are not collected by any of the above switches?
I gather from elsewhere that Expr is a possible use case; are there others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added,
it can be a proper column name, e.g. by="a"
self.by = by.expr | ||
else: | ||
self.by = [by] if np.isscalar(by) else list(by) | ||
self.by = [by] if np.isscalar(by) or isinstance(by, Expr) else list(by) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everywhere else you added code branches that allow the by attribute to be either a scalar (Expr or otherwise) or a list. This however says that by is always a list?
I much prefer the latter. Test coverage for all these new code branches is quite spotty - something that coercing everything into a list as soon as it's acquired from the user would prevent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good point, that was an oversight on my part. It will always be a list after this pr is in
# Conflicts: # dask_expr/_groupby.py # dask_expr/tests/test_groupby.py
merging this, it blocks follow ups |
No description provided.