Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a mutable builder for Offsets #5384

Closed
alamb opened this issue Feb 11, 2024 · 8 comments · Fixed by #5532
Closed

Add a mutable builder for Offsets #5384

alamb opened this issue Feb 11, 2024 · 8 comments · Fixed by #5532
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented Feb 11, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Part of a discussion in Discord: https://discord.com/channels/885562378132000778/885562378132000781/1205552581422878730

@kylebarron notes that arrow-rs has an immutable OffsetBuffer but no mutable builder counterpart.

arrow2 had a mutable builder called Offsets to help building up the offsets, with helpers like try_push_usize that would enforce the monotonically increasing invariant

As part of his arrow2 -> arrow-rs conversion he vendored that builder in my own code https://github.com/geoarrow/geoarrow-rs/blob/011e77736ac25c55141608c135852c1f5d3198dd/src/array/offset_builder.rs

Describe the solution you'd like
Consider upstreaming OffsetBuilder into arrow-rs

Describe alternatives you've considered
We can leave the current situation alone.

Additional context

@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Feb 11, 2024
@alamb
Copy link
Contributor Author

alamb commented Feb 11, 2024

FWIW I built the offsets directly via Vec<O> while working on a native string grouping accumulator in apache/datafusion#8827 but having an Offsets builder would likely have made the code more convenient and easier to follow.

@doki23
Copy link
Contributor

doki23 commented Feb 19, 2024

Maybe we need a PR which makes the trait ArrowNativeType extends std::ops::Add first, so that we can get next offset by offset + length when implement try_push_length.

@alamb
Copy link
Contributor Author

alamb commented Feb 19, 2024

Maybe we need a PR which makes the trait ArrowNativeType extends std::ops::Add first, so that we can get next offset by offset + length when implement try_push_length.

Maybe @kylebarron has some thoughts. I think @tustvold is out for the next week or two so his responses may be delayed)

@alamb
Copy link
Contributor Author

alamb commented Feb 19, 2024

The code in geoarrow seems to use to from_usize / to_usize to convert to/from usize and O:

https://github.com/geoarrow/geoarrow-rs/blob/011e77736ac25c55141608c135852c1f5d3198dd/src/array/offset_builder.rs#L122-L130

@doki23
Copy link
Contributor

doki23 commented Feb 20, 2024

The code in geoarrow seems to use to from_usize / to_usize to convert to/from usize and O

As an arg of the function try_push_usize, length is usize so that we can convert it to usize. But it doesn't work with try_push.

I notice that geoarrow-rs uses OffsetSizeTrait to constrain generic O which implements std::ops::AddAssign. Given that we should implement OffsetBuilder in arrow_buffer, we could not directly use it that comes from arrow_array.

@tustvold
Copy link
Contributor

tustvold commented Feb 20, 2024

Given OffsetBuffer is used to track offsets within what is effectively a Vec<usize>, and the offets must be monotonically increasing, I would recommend consistently using usize in the interface. This is not only simpler, but also easier to use.

@doki23
Copy link
Contributor

doki23 commented Feb 28, 2024

As the discussion, Vec<usize> causes the conversion from builder to buffer to require O(n) complexity. Maybe we should move OffsetSizeTrait to crate arorw-buffer and use Vec<O> where O: OffsetSizeTrait

@tustvold
Copy link
Contributor

label_issue.py automatically added labels {'arrow'} from #5440

@tustvold tustvold added the arrow Changes to the arrow crate label Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants