Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mutable OffsetBufferBuilder #5440

Closed
wants to merge 13 commits into from
Closed

Conversation

doki23
Copy link
Contributor

@doki23 doki23 commented Feb 27, 2024

Which issue does this PR close?

Closes #5384.

Rationale for this change

See #5384

What changes are included in this PR?

Add a mutable OffsetsBuilder

Are there any user-facing changes?

No

@github-actions github-actions bot added the arrow Changes to the arrow crate label Feb 27, 2024

#[derive(Debug, Clone, PartialEq, Eq)]
pub struct OffsetsBuilder {
offsets: Vec<usize>,
Copy link
Contributor

@kylebarron kylebarron Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this builds a Vec<usize>, that means it's not O(1) to an OffsetBuffer? Should this be Vec<O> instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. OffsetSizeTrait is proper but it's in crate arrow-array. If we want to bound O with OffsetSizeTrait we need move it into arrow-buffer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having O: ArrowNativeType is fine, but you should store a Vec<O>, not Vec<usize>, and only use usize in the API. That also means you get overflow errors when pushing new elements, not on finish.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OffsetSizeTrait extends std::ops::AddAssign so that we can get next offset by offset + length.

Copy link
Contributor

@kylebarron kylebarron Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need AddAssign... You can copy to a new variable and use plain std::ops::Add right?

@doki23 doki23 marked this pull request as ready for review March 4, 2024 04:03

/// takes the builder itself and returns an [`OffsetBuffer`]
pub fn finish(self) -> OffsetBuffer<O> {
OffsetBuffer::new(ScalarBuffer::from(self.offsets))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As written this will validate the offsets again

#[inline]
pub fn push_length(&mut self, length: O) {
let last_offset = self.offsets.last().unwrap();
let next_offset = *last_offset + length;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be checked arithmetic if we plan to elide the check on finish

@tustvold
Copy link
Contributor

tustvold commented Mar 4, 2024

So my understanding is there are two potential motivations for this functionality:

  1. To provide a more idiomatic API for constructing an OffsetBuffer
  2. To provide a more efficient API for constructing an OffsetBuffer

Currently this PR I believe addresses 1, but not really 2. I would perhaps suggest taking a look at the way OffsetBuffer::from_lengths is implemented. In particular, tracking the in-progress size using usize, incrementing it using checked arithmetic, and converting to O with usize_as. Then at the end, i.e. in OffsetsBuilder::finish, checking if the value of usize overflows O. I'd also be very interested in any benchmarks of this approach.

Finally I think I would reduce the number of methods we provide to just the bare minimum, it is always easy to add methods in future, but removing methods is hard, so we should only commit to methods with clear use-cases


/// push a length into the builder.
#[inline]
pub fn push_length(&mut self, length: O) {
Copy link
Contributor

@tustvold tustvold Mar 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend only providing a single push_length method that takes a usize, and panics on overflow. I can't think of many use-cases for appending a length specified in O, as offsets are typically derived from slices which have a length measured in usize

}

/// get an iterator of lengths from builder's underlying offsets
pub fn lengths(&self) -> impl IntoIterator<Item = O> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use-case for this?

/// create from offsets
/// caller guarantees that offsets are monotonically increasing values
#[inline]
pub fn new_unchecked(offsets: Vec<O>) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub fn new_unchecked(offsets: Vec<O>) -> Self {
pub unsafe fn new_unchecked(offsets: Vec<O>) -> Self {

Presuming we change to using unchecked in finish

}
}

impl<IntoIter: IntoIterator<Item = O>, O: ArrowNativeType + Add<Output = O> + Sub<Output = O>>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Extend or FromIterator would be more idiomatic, but I'm not sure of the use-case for this tbh

use crate::{ArrowNativeType, OffsetBuffer, ScalarBuffer};

#[derive(Debug, Clone, PartialEq, Eq)]
pub struct OffsetsBuilder<O: ArrowNativeType> {
Copy link
Contributor

@tustvold tustvold Mar 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub struct OffsetsBuilder<O: ArrowNativeType> {
pub struct OffsetBufferBuilder<O: ArrowNativeType> {

For consistency with the other naming

offsets: Vec<O>,
}

/// builder for [`OffsetBuffer`]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// builder for [`OffsetBuffer`]
/// Builder for [`OffsetBuffer`]

Can we capitalize the first word of doc comments please

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the APIs in terms of O, they're confusing imo and given the intended use with slices which have usize length, I think they're hard to use

I also wonder if you saw this

In particular, tracking the in-progress size using usize, incrementing it using checked arithmetic, and converting to O with usize_as. Then at the end, i.e. in OffsetsBuilder::finish, checking if the value of usize overflows O. I'd also be very interested in any benchmarks of this approach.

}

/// Push a length of usize type into builder
pub fn push_length(&mut self, length: O) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub fn push_length(&mut self, length: O) {
pub fn push_length(&mut self, length: usize) {


/// Extend from another OffsetsBuilder.
/// It gets a lengths iterator from another builder and extend the builder with the iter.
pub fn extend_with_builder(&mut self, another: OffsetBufferBuilder<O>) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this method, I'm not sure of the use-case for it

/// It gets a lengths iterator from another builder and extend the builder with the iter.
pub fn extend_with_builder(&mut self, another: OffsetBufferBuilder<O>) {
self.reserve(another.len() - 1);
let mut last_offset = another.offsets[0];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw windows(2) is a nicer pattern for this

@doki23
Copy link
Contributor Author

doki23 commented Mar 15, 2024

Can we remove the APIs in terms of O, they're confusing imo and given the intended use with slices which have usize length, I think they're hard to use

I also wonder if you saw this

In particular, tracking the in-progress size using usize, incrementing it using checked arithmetic, and converting to O with usize_as. Then at the end, i.e. in OffsetsBuilder::finish, checking if the value of usize overflows O. I'd also be very interested in any benchmarks of this approach.

I get offset by adding last offset with length, I have to convert length to O. A push_length(&mut self, length: O) helps reduce reduplicate code. @tustvold

arrow-buffer/src/builder/offset.rs Outdated Show resolved Hide resolved
arrow-buffer/src/builder/offset.rs Outdated Show resolved Hide resolved
arrow-buffer/src/lib.rs Outdated Show resolved Hide resolved
arrow-buffer/src/builder/offset.rs Outdated Show resolved Hide resolved
arrow-buffer/src/builder/offset.rs Outdated Show resolved Hide resolved
@doki23
Copy link
Contributor Author

doki23 commented Mar 16, 2024

The extend function doesn't return a Result type, so I add a try_extend method. PTAL @tustvold

@doki23 doki23 changed the title Mutable OffsetsBuilder Mutable OffsetBufferBuilder Mar 16, 2024
@tustvold
Copy link
Contributor

I think we're getting our wires crossed slightly, I'll get a PR up showing what I'm looking for in the next couple of days

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a mutable builder for Offsets
3 participants