Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement groupby multi level support #486

Merged
merged 10 commits into from
Dec 15, 2023
Merged

Implement groupby multi level support #486

merged 10 commits into from
Dec 15, 2023

Conversation

phofl
Copy link
Collaborator

@phofl phofl commented Dec 6, 2023

No description provided.

Copy link
Collaborator

@milesgranger milesgranger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with this, but the comment on repeated/similar code is maybe a by product of the current design?

Comment on lines 424 to 441
@functools.cached_property
def _by_meta(self):
if isinstance(self.by, Expr):
return meta_nonempty(self.by._meta)
elif is_scalar(self.by):
return self.by
else:
return [
meta_nonempty(x._meta) if isinstance(x, Expr) else x for x in self.by
]

@functools.cached_property
def _by_columns(self):
if isinstance(self.by, Expr):
return []
else:
return [x for x in self.by if not isinstance(x, Expr)]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels not great to have identical and/or very similar code (GroupByReduction/GroupByApplyConcatApply._by_columns are exactly the same, and GroupByReduction/GroupByApplyConcatApply._by_meta are nearly the same) between classes. Do you think the logic can be combined into ApplyConcatApply?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's not groupby specific, so that's not the place for this logic.

The groupby structure needs a refactor anyway to make this more consistent but that's something for a follow up if the actual implementation is ironed out

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote parts of the implementation, still not happy with it, but less duplicated code now

@@ -860,6 +907,22 @@ def _extract_meta(x, nonempty=False):
###


def _validate_by_expr(obj, by):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you think of a better name, and possibly add some docstring?
Elsewhere in dask, "validate" functions typically contain a bunch of asserts and return None.
This function seems to extract sometimes a column name, sometimes an Expr, sometimes something else.
Maybe "clean_by_expr" or "preprocess_by_expr"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_clean_by_expr seems fine

if not are_co_aligned(obj.expr, by.expr):
raise ValueError("by must be in the DataFrames columns.")
return by.expr
return by
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment explaining what use cases are not collected by any of the above switches?
I gather from elsewhere that Expr is a possible use case; are there others?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added,

it can be a proper column name, e.g. by="a"

self.by = by.expr
else:
self.by = [by] if np.isscalar(by) else list(by)
self.by = [by] if np.isscalar(by) or isinstance(by, Expr) else list(by)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everywhere else you added code branches that allow the by attribute to be either a scalar (Expr or otherwise) or a list. This however says that by is always a list?

I much prefer the latter. Test coverage for all these new code branches is quite spotty - something that coercing everything into a list as soon as it's acquired from the user would prevent.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point, that was an oversight on my part. It will always be a list after this pr is in

@phofl
Copy link
Collaborator Author

phofl commented Dec 15, 2023

merging this, it blocks follow ups

@phofl phofl merged commit 02a449c into dask:main Dec 15, 2023
9 checks passed
@phofl phofl deleted the list_raise branch December 15, 2023 12:16
@phofl phofl restored the list_raise branch December 15, 2023 12:16
@phofl phofl deleted the list_raise branch December 15, 2023 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants