-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement virtual arrays #3364
base: main
Are you sure you want to change the base?
Conversation
e76182b
to
3b9ad2f
Compare
…uffers working in a notebook
1eb5c3f
to
166fcb0
Compare
Oh I take that back actually, it may be best to do the opposite. Raise in error in the lookup and prompt the user to intentionally materialize. |
I like that! 👍 |
@pfackeldey @ianna I marked this as ready for review 😃 |
I'll try to get to this on Monday if it's still open! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is excellent!
From my side, this PR is ready to be merged 🎉
We have ideas for further improvements, especially related to the length
handling of VirtualArrays during construction. These can be followed up in a separate PR - given the size of this PR already.
I should add that the level of testing is very sophisticated here:
- You added >6k lines of tests alone for VirtualArrays
- You ran the whole (other) test-suite of awkward (for both cpu & cuda backend) against VirtualArrays by auto-wrapping any buffer at construction in a VirtualArray
- You ran the coffea adl-benchmarks with VirtualArrays
- You ran the AGC (for 1 chunk) with VirtualArrays
All of these tests passed, and also didn't show signs of over-materialization. I wanted to highlight @ikrommyd's testing effort here! 👍
I'll actually take a look tomorrow (I'm on leave today). Really looking forward to it! |
Analysis-like benchmarks repo: https://github.com/ikrommyd/coffea-virtual-array-tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ikrommyd - Good job! I have a few questions, requests, and concerns. Please see the comments for details.
There are a few style changes in the code that are unrelated to the new feature. According to Google's best practices, it's recommended to keep modifications minimal and focused to improve code review efficiency. This helps reviewers concentrate on the new changes without unnecessary distractions.
To keep the Array API in line with the standard, I recommend removing tolist
from it to maintain consistency and adherence to best practices.
It looks like we have an extra call to materialize the arrays. What is the performance overhead of this additional operation? Could it be optimized or deferred to the user to control when materialization occurs?
@@ -44,6 +44,9 @@ def nbytes(self) -> ShapeItem: ... | |||
@abstractmethod | |||
def T(self) -> Self: ... | |||
|
|||
@abstractmethod | |||
def tolist(self) -> list: ... | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -35,10 +36,7 @@ def _nplike_concatenate_has_casting(module: Any) -> bool: | |||
return True | |||
|
|||
|
|||
ArrayLikeT = TypeVar("ArrayLikeT", bound=ArrayLike) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't we going to loose lose specific subclass information without TypeVar
in functions that return ArrayLike
?
return self.zeros(x.shape, dtype=dtype or x.dtype) | ||
else: | ||
return self._module.zeros_like(x, dtype=dtype) | ||
return self._module.zeros_like(*materialize_if_virtual(x), dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is materialization needed here given that zeros_like should return an array of zeros with the same shape and type as a given array?
return self.ones(x.shape, dtype=dtype or x.dtype) | ||
else: | ||
return self._module.ones_like(x, dtype=dtype) | ||
return self._module.ones_like(*materialize_if_virtual(x), dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same question: is materialization needed?
return self.full(x.shape, fill_value, dtype=dtype or x.dtype) | ||
else: | ||
return self._module.full_like( | ||
x, self._module.array(fill_value), dtype=dtype | ||
*materialize_if_virtual(x), self._module.array(fill_value), dtype=dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same question: is materialization needed?
@@ -359,6 +373,8 @@ def _reconstitute( | |||
backend.index_nplike.max(index) + 1 | |||
) | |||
) | |||
# free memory | |||
free(index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it needed?
@@ -405,6 +421,8 @@ def _reconstitute( | |||
next_length = ( | |||
0 if len(starts) == 0 else backend.index_nplike.max(reduced_stops) | |||
) | |||
# free memory | |||
free(starts, stops) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it needed?
@@ -437,6 +455,8 @@ def _reconstitute( | |||
next_length = unknown_length | |||
else: | |||
next_length = 0 if len(offsets) == 1 else offsets[-1] | |||
# free memory | |||
free(offsets) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it needed?
@@ -522,6 +542,8 @@ def _reconstitute( | |||
lengths.append(0) | |||
else: | |||
lengths.append(backend.index_nplike.max(selected_index) + 1) | |||
# free memory | |||
free(index, tags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it needed?
def free(*arrays: ArrayLike) -> None: | ||
""" | ||
Free any materialized resources associated with virtual array(s). | ||
""" | ||
for array in arrays: | ||
if isinstance(array, VirtualArray): | ||
array._array = UNMATERIALIZED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def free(*arrays: ArrayLike) -> None: | |
""" | |
Free any materialized resources associated with virtual array(s). | |
""" | |
for array in arrays: | |
if isinstance(array, VirtualArray): | |
array._array = UNMATERIALIZED | |
def free(*arrays: ArrayLike) -> None: | |
"""Free any materialized resources associated with virtual array(s).""" | |
for array in filter(lambda x: isinstance(x, VirtualArray), arrays): | |
array._array = UNMATERIALIZED |
Implement Virtual Buffers in Awkward Array
This PR introduces the
VirtualArray
class, which enables virtual buffers at the lowest level of an Awkward Array's structure. Awkward Arrays are fundamentally tree-like, with data stored in 1D array buffers (e.g., NumPy, CuPy, or JAX). TheVirtualArray
class acts as a virtual buffer, allowing deferred materialization of data when needed.Key Features
VirtualArray
requires a knowndtype
andshape
upon instantiation, along with a generator function responsible for producing the actual buffer data. This generator function is designed for use cases where materialization is expensive, such as disk reads.VirtualArray
is tied to a specificnplike
(NumPy or CuPy). Switching between differentnplike
backends is not allowed unless the array is first materialized.Adjustments to the Codebase
array_module
ensure that certain operations remain virtual when possible while others materialize data when necessary.ak.to_*
are updated to ensure that virtual arrays are materialized before conversion.This should be reviewed in combination with ikrommyd#1 which runs the whole awkward ci by creating a virtual array inside every
NumpyArray
and everyIndex
and materializing them on the spot and also in combination with this branch: https://github.com/ikrommyd/awkward/tree/test-virtual-arrays-without-materialize where most of the tests pass without even needing to materialize inside theNumpyArray
. You can compare this branch with the branch from ikrommyd#1 for the changes.