Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Accumulator::evaluate and Accumulator::state to take &mut self #8925

Merged
merged 5 commits into from
Jan 24, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jan 21, 2024

Which issue does this PR close?

Closes #8934
Part of #5472

Rationale for this change

I want to make a fast COUNT DISTINCT accumulator for strings, and a key optimization is to avoid copying the strings more than once. However, the current signature of Accumulator makes this tough

What changes are included in this PR?

  1. Change Accumulator::evaluate and Accumulator::state to take &mut self
  2. Update all uses
  3. Fix median and distinct accumulators to avoid a copy

Alternate Considered: Accumulator::evaluate and Accumulator::state take self

Given the Accumulator is never used after calls to evaluate and merge a consuming API might make more sense -- and in fact I tried making these APIs consuming like this:

trait Accumulator {
...
    fn evaluate(self) -> Result<ScalarValue>;
...
}

However, the aggregator actually has Box<dyn Accumulator>:
https://github.com/apache/arrow-datafusion/blob/70ffb2fd3df7fbc744587ce3299ffb52c8ca6cbd/datafusion/physical-expr/src/aggregate/groups_accumulator/adapter.rs#L55-L63

I get a compiler error about about them not being sized when I try to call

let boxed_accumulator: Box<dyn Accumulator> = ...;
boxed_accumulator.evaluate()?

I think we could work around this, as described here by adding a evaluate_boxed and state_boxed variants, for example

trait Accumulator {
...
    fn evaluate(self) -> Result<ScalarValue>;
    fn evaluate_boxed(self: Box<Self>) -> Result<ScalarValue>;
...
}

However, I think that would just make the trait harder to use and &mut self is enough to implement zero copy semantics for those that want it

Are these changes tested?

Covered by existing tests

Are there any user-facing changes?

Yes, the signature of user defined Accumulators needs to be changed to take &mut self

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions core Core DataFusion crate substrait labels Jan 21, 2024
@alamb alamb added the api change Changes the API exposed to users of the crate label Jan 21, 2024
///
/// This function gets a `mut` accumulator to allow for the accumulator to
/// use an arrow compatible internal state when possible.
fn evaluate(&mut self) -> Result<ScalarValue>;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the core change in this PR -- make evaluate and state get a &mut reference so the Accumulator can avoid copies

This is technically possible today (with "inner mutability", for example a Mutex) but that is both not documented and quite unobvious

@@ -104,7 +104,7 @@ impl Accumulator for GeometricMean {
// This function serializes our state to `ScalarValue`, which DataFusion uses
// to pass this state between execution stages.
// Note that this can be arbitrary data.
fn state(&self) -> Result<Vec<ScalarValue>> {
fn state(&mut self) -> Result<Vec<ScalarValue>> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example of what DataFusion users have to do if this PR is merged -- they have to change the signatures to take mut

// TODO: evaluate could pass &mut self
let mut d = self.all_values.clone();
fn evaluate(&mut self) -> Result<ScalarValue> {
let mut d = std::mem::take(&mut self.all_values);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 here is an example showing why this API change is good -- you can avoid this copy

This change this might actually improve the performance of median for large inputs measurably (though I have not measured it)

@alamb alamb marked this pull request as ready for review January 21, 2024 12:15
@alamb
Copy link
Contributor Author

alamb commented Jan 22, 2024

While this is an API change I think it is a minimally invasive one and will not cause too much pain downstream, and thus I plan to merge it tomorrow Tue Jan 22 2024 unless others would like more chance to review

Thank you for the PR review @Dandandan

@alamb alamb merged commit 5d70c32 into apache:main Jan 24, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions substrait
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change Accumulator::evaluate and Accumulator::state to take &mut self
2 participants