-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change Accumulator::evaluate
and Accumulator::state
to take &mut self
#8925
Conversation
/// | ||
/// This function gets a `mut` accumulator to allow for the accumulator to | ||
/// use an arrow compatible internal state when possible. | ||
fn evaluate(&mut self) -> Result<ScalarValue>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the core change in this PR -- make evaluate
and state
get a &mut
reference so the Accumulator can avoid copies
This is technically possible today (with "inner mutability", for example a Mutex
) but that is both not documented and quite unobvious
@@ -104,7 +104,7 @@ impl Accumulator for GeometricMean { | |||
// This function serializes our state to `ScalarValue`, which DataFusion uses | |||
// to pass this state between execution stages. | |||
// Note that this can be arbitrary data. | |||
fn state(&self) -> Result<Vec<ScalarValue>> { | |||
fn state(&mut self) -> Result<Vec<ScalarValue>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an example of what DataFusion users have to do if this PR is merged -- they have to change the signatures to take mut
// TODO: evaluate could pass &mut self | ||
let mut d = self.all_values.clone(); | ||
fn evaluate(&mut self) -> Result<ScalarValue> { | ||
let mut d = std::mem::take(&mut self.all_values); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉 here is an example showing why this API change is good -- you can avoid this copy
This change this might actually improve the performance of median for large inputs measurably (though I have not measured it)
While this is an API change I think it is a minimally invasive one and will not cause too much pain downstream, and thus I plan to merge it tomorrow Tue Jan 22 2024 unless others would like more chance to review Thank you for the PR review @Dandandan |
Which issue does this PR close?
Closes #8934
Part of #5472
Rationale for this change
I want to make a fast COUNT DISTINCT accumulator for strings, and a key optimization is to avoid copying the strings more than once. However, the current signature of
Accumulator
makes this toughWhat changes are included in this PR?
Accumulator::evaluate
andAccumulator::state
to take&mut self
median
anddistinct
accumulators to avoid a copyAlternate Considered:
Accumulator::evaluate
andAccumulator::state
takeself
Given the Accumulator is never used after calls to
evaluate
andmerge
a consuming API might make more sense -- and in fact I tried making these APIs consuming like this:However, the aggregator actually has
Box<dyn Accumulator>
:https://github.com/apache/arrow-datafusion/blob/70ffb2fd3df7fbc744587ce3299ffb52c8ca6cbd/datafusion/physical-expr/src/aggregate/groups_accumulator/adapter.rs#L55-L63
I get a compiler error about about them not being sized when I try to call
I think we could work around this, as described here by adding a
evaluate_boxed
andstate_boxed
variants, for exampleHowever, I think that would just make the trait harder to use and
&mut self
is enough to implement zero copy semantics for those that want itAre these changes tested?
Covered by existing tests
Are there any user-facing changes?
Yes, the signature of user defined Accumulators needs to be changed to take
&mut self