-
Notifications
You must be signed in to change notification settings - Fork 803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add Allocator
type param to MutableBuffer
#6336
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
return; | ||
} | ||
|
||
#[cfg(feature = "allocator_api")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The entire MutableBuffer
has been gated under allocator_api
. Do we still need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may misread this. Only those new public APIs are limited under the feature gate.
@@ -28,6 +31,23 @@ use crate::{ | |||
|
|||
use super::Buffer; | |||
|
|||
#[cfg(not(feature = "allocator_api"))] | |||
pub trait Allocator: private::Sealed {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first glance, it's somewhat confusing to define a sealed trait here. Maybe add some comments here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, please take a look
@@ -28,6 +31,23 @@ use crate::{ | |||
|
|||
use super::Buffer; | |||
|
|||
#[cfg(not(feature = "allocator_api"))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it's a good idea to introduce allocator-api2 as a compatibility layer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your information. It looks like allocator-api2
provides many things we don't need. And I'm hesitant to add a new dep for the normal case. We can add it in the future if it turns out that allocator-api2
suits our use case well. What we need here is
- Placeholders to standard library things that are not available without unstable features (
Allocator
andGlobal
) - Wrapped
alloc
anddealloc
methods
As for now we can define them by ourselves in few lines.
Signed-off-by: Ruihang Xia <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for your work. Overall, it looks good to me.
By the way, I still believe introducing allocator_api2
is a good idea, as it eliminates the need for a feature that requires nightly Rust.
The decision is up to you.
This PR also makes me wonder if we can introduce an allocator
for all similar APIs.
Thanks for your review @Xuanwo
I plan to add this for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @waynexia and @Xuanwo for the reviews
I think this idea is quite cool 🙏 However, in order to get it ready to merge a few more things are needed:
- Documentation (specifically document the new feature here https://crates.io/crates/arrow)
- Tests
In terms of testing, what I think would be the most useful would be an "end to end" test / example of how to use this feature. For example to create a MutableBuffer
with a custom allocator and report on the allocations performed or something.
Perhaps we could add such an example to the examples directory https://github.com/apache/arrow-rs/tree/master/arrow/examples and add a pointer to that that to the documentation?
@@ -61,8 +61,8 @@ jobs: | |||
submodules: true | |||
- name: Setup Rust toolchain | |||
uses: ./.github/actions/setup-builder | |||
- name: Test arrow-buffer with all features | |||
run: cargo test -p arrow-buffer --all-features | |||
- name: Test arrow-buffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found it confusing at first why this line doesn't follow the model of the rest of the tests in this workflow which run with --all-features
I am also concerned this may remove coverage for some of the other features such as prettyprint, ffi and pyarrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arrow-buffer
only has one feature gate allocator_api
from this PR. This specific change won't affect other tests like prettyprint
as the command -p arrow-buffer
only runs on this sub-crate. But I agree it's easy to forget when the new feature comes in the future... Sadly, cargo seems not to support opt-out one feature in CLI 😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, the same problem comes to the main arrow
lib because I added a delegate feature allocator_api = ["arrow-buffer/allocator_api"]
. Looks like we have to enumerate all available features except allocator_api
in CI?
Signed-off-by: Ruihang Xia <[email protected]>
Thanks for reviewing @alamb. I've created a new example
I tried to use |
arrow-buffer/src/buffer/mutable.rs
Outdated
#[cfg(not(feature = "allocator_api"))] | ||
allocator: Global, | ||
#[cfg(feature = "allocator_api")] | ||
allocator: *value.allocator(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The From
impl would benefit from having a type parameter for the allocator, otherwise the allocator here would always implicitly be the global one. That probably requires two separate impl blocks depending on the feature flag.
The same would be useful for Buffer::from_vec(Vec)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will default to Global
without allocator_api
, and inherit from the vector via Vec::allocator()
API. Is this behavior ideal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got what you mean https://github.com/apache/arrow-rs/pull/6336/files#diff-371342744df1b634b0bd9d90f4fe38c1eb0096df322fd3cc2fbc513f3428046cR692-R696
We should have two impl blocks indeed for different type parameters
A minimal allocator implementation that tracks the memory usage shouldn't be too big to include directly in the example, without dependencies. |
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
I am depressed about the large review backlog in this crate. We are looking for more help from the community reviewing PRs -- see #6418 for more |
I plan to look into this in more detail as it might be useful in our codebase, which makes good use of a custom allocator to track memory usage. A unit test with a custom allocator, to be run with miri, would be very nice. I have a concern that after turning the |
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
Good catch 👍 I've limited those
New test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I understand this API correctly that if a user directly creates Arrays from a Vec then they could provide their own allocator, but otherwise the default allocator will be used?
I am thinking about a usecase where
- A user creates a new
Array
that uses a custom allocator - Then calls some kernel like
take
orfilter
that creates a newArray
It seems like the new Array
will use the default system allocator rather than the custom allocator which will limit the effectiveness of this approach
Would it be possible with this API to support changing the global allocator for all new Arrays that are created?
100, | ||
allocator_tracker.clone(), | ||
); | ||
let mut buffer = MutableBuffer::from(vector); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very cool
Would it be possible to open that followup PR already in draft mode? I think it would be easier to review the impact of this PR if I could already see how it would affect the |
Yes, it will be
Still take I don't wish for the full support of
Here it is #6455, I only changed the |
I've not had time to look at this PR in detail, but my initial response is to echo the concern Andrew raised regarding the general utility of this abstraction. As I understand it this mechanism would require making every codepath that allocates arrays be generic on an Allocator and take an Allocator as a parameter. Even putting aside the fact that this would only be possible on nightly, this is a monumental amount of work effectively doubling the API surface. Taking a step back, memory allocation in arrow should rarely be on a hot path. As such having monomorphised access to the allocator implementation is perhaps not as important as for say a standard library collection type, we can afford to have a moderate additional overhead per allocation. Additionally the proposed use-cases appear to largely center around memory tracking, as opposed to altering the behaviour of the system allocator. As such we could potentially just store a If this sounds like an avenue worth exploring I can probably find time to write the idea up in more detail / file a PR with this |
IMO this is not that horrible in two aspects:
There is still much work to do to adapt those methods. But they are not that tightly coupled together like
If we don't need per-object tracking, simply replace the global allocator as @alamb mentioned before should be okay. And this is already viable in the current version as we don't need to do any change in the arrow side. And considering the arrow is a large group of libs, we don't have to take this |
This is the concern, by going down this path are we signing up for a long slog that we already suspect isn't going to ever be finished? I think we should approach this from first principles, specifically what are we trying to achieve, e.g. allow tracking per-query memory usage reliably, and then discuss and propose concrete designs to achieve this. As it stands this PR adds some functionality without really articulating a practical vision for how this might be used in a downstream library like datafusion to achieve a stated goal |
I agree. I think we should continue the big picture discussion on #3960 |
Alternate proposal - #6590 |
Which issue does this PR close?
Part of #3960
Rationale for this change
I find #3960 for the same reason -- tracking memory consumption with
Array
s. The most straightforward way is to use the unstableAllocator
API.By adding a decoration layer to
Allocator
we can keep track of the real-time memory consumption of each container instance. There is a tiny tool that implements this https://github.com/waynexia/unkaiWhat changes are included in this PR?
The most important changes in this PR are
allocator_api
, same name as the unstable rust featureA: Allocator = Global
toMutableBuffer
likeVec
Since this requires an unstable rust feature,
allocator_api
can only be used in the nightly toolchain. To keep this library still working in the stable toolchain when the feature gate is disabled, this PR defines many dummy structs likeGlobal
andAllocator
to substitute those from std lib as they are not referencable without unstable feature.When
allocator_api
is enabled, users can either appoint theirAllocator
by new APIsnew_in
andwith_capacity_in
or invokingFrom<Vec<T, A>>
which will inherit the allocator from vec.This PR is tested with MIRI:
Are there any user-facing changes?
New APIs as described above