-
Notifications
You must be signed in to change notification settings - Fork 819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mutable OffsetBufferBuilder #5440
Conversation
arrow-buffer/src/builder/offset.rs
Outdated
|
||
#[derive(Debug, Clone, PartialEq, Eq)] | ||
pub struct OffsetsBuilder { | ||
offsets: Vec<usize>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this builds a Vec<usize>
, that means it's not O(1)
to an OffsetBuffer
? Should this be Vec<O>
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you. OffsetSizeTrait
is proper but it's in crate arrow-array
. If we want to bound O with OffsetSizeTrait
we need move it into arrow-buffer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having O: ArrowNativeType
is fine, but you should store a Vec<O>
, not Vec<usize>
, and only use usize in the API. That also means you get overflow errors when pushing new elements, not on finish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OffsetSizeTrait
extends std::ops::AddAssign
so that we can get next offset by offset + length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need AddAssign... You can copy to a new variable and use plain std::ops::Add
right?
arrow-buffer/src/builder/offset.rs
Outdated
|
||
/// takes the builder itself and returns an [`OffsetBuffer`] | ||
pub fn finish(self) -> OffsetBuffer<O> { | ||
OffsetBuffer::new(ScalarBuffer::from(self.offsets)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As written this will validate the offsets again
arrow-buffer/src/builder/offset.rs
Outdated
#[inline] | ||
pub fn push_length(&mut self, length: O) { | ||
let last_offset = self.offsets.last().unwrap(); | ||
let next_offset = *last_offset + length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be checked arithmetic if we plan to elide the check on finish
So my understanding is there are two potential motivations for this functionality:
Currently this PR I believe addresses 1, but not really 2. I would perhaps suggest taking a look at the way Finally I think I would reduce the number of methods we provide to just the bare minimum, it is always easy to add methods in future, but removing methods is hard, so we should only commit to methods with clear use-cases |
arrow-buffer/src/builder/offset.rs
Outdated
|
||
/// push a length into the builder. | ||
#[inline] | ||
pub fn push_length(&mut self, length: O) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend only providing a single push_length
method that takes a usize
, and panics on overflow. I can't think of many use-cases for appending a length specified in O
, as offsets are typically derived from slices which have a length measured in usize
arrow-buffer/src/builder/offset.rs
Outdated
} | ||
|
||
/// get an iterator of lengths from builder's underlying offsets | ||
pub fn lengths(&self) -> impl IntoIterator<Item = O> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the use-case for this?
arrow-buffer/src/builder/offset.rs
Outdated
/// create from offsets | ||
/// caller guarantees that offsets are monotonically increasing values | ||
#[inline] | ||
pub fn new_unchecked(offsets: Vec<O>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub fn new_unchecked(offsets: Vec<O>) -> Self { | |
pub unsafe fn new_unchecked(offsets: Vec<O>) -> Self { |
Presuming we change to using unchecked in finish
arrow-buffer/src/builder/offset.rs
Outdated
} | ||
} | ||
|
||
impl<IntoIter: IntoIterator<Item = O>, O: ArrowNativeType + Add<Output = O> + Sub<Output = O>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Extend
or FromIterator
would be more idiomatic, but I'm not sure of the use-case for this tbh
arrow-buffer/src/builder/offset.rs
Outdated
use crate::{ArrowNativeType, OffsetBuffer, ScalarBuffer}; | ||
|
||
#[derive(Debug, Clone, PartialEq, Eq)] | ||
pub struct OffsetsBuilder<O: ArrowNativeType> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub struct OffsetsBuilder<O: ArrowNativeType> { | |
pub struct OffsetBufferBuilder<O: ArrowNativeType> { |
For consistency with the other naming
arrow-buffer/src/builder/offset.rs
Outdated
offsets: Vec<O>, | ||
} | ||
|
||
/// builder for [`OffsetBuffer`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// builder for [`OffsetBuffer`] | |
/// Builder for [`OffsetBuffer`] |
Can we capitalize the first word of doc comments please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove the APIs in terms of O
, they're confusing imo and given the intended use with slices which have usize
length, I think they're hard to use
I also wonder if you saw this
In particular, tracking the in-progress size using usize, incrementing it using checked arithmetic, and converting to O with usize_as. Then at the end, i.e. in OffsetsBuilder::finish, checking if the value of usize overflows O. I'd also be very interested in any benchmarks of this approach.
arrow-buffer/src/builder/offset.rs
Outdated
} | ||
|
||
/// Push a length of usize type into builder | ||
pub fn push_length(&mut self, length: O) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub fn push_length(&mut self, length: O) { | |
pub fn push_length(&mut self, length: usize) { |
arrow-buffer/src/builder/offset.rs
Outdated
|
||
/// Extend from another OffsetsBuilder. | ||
/// It gets a lengths iterator from another builder and extend the builder with the iter. | ||
pub fn extend_with_builder(&mut self, another: OffsetBufferBuilder<O>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove this method, I'm not sure of the use-case for it
arrow-buffer/src/builder/offset.rs
Outdated
/// It gets a lengths iterator from another builder and extend the builder with the iter. | ||
pub fn extend_with_builder(&mut self, another: OffsetBufferBuilder<O>) { | ||
self.reserve(another.len() - 1); | ||
let mut last_offset = another.offsets[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw windows(2)
is a nicer pattern for this
I get offset by adding last offset with length, I have to convert length to |
The |
I think we're getting our wires crossed slightly, I'll get a PR up showing what I'm looking for in the next couple of days |
Which issue does this PR close?
Closes #5384.
Rationale for this change
See #5384
What changes are included in this PR?
Add a mutable OffsetsBuilder
Are there any user-facing changes?
No