Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement virtual arrays #3364

Open
wants to merge 101 commits into
base: main
Choose a base branch
from

Conversation

ikrommyd
Copy link
Collaborator

@ikrommyd ikrommyd commented Jan 9, 2025

Implement Virtual Buffers in Awkward Array

This PR introduces the VirtualArray class, which enables virtual buffers at the lowest level of an Awkward Array's structure. Awkward Arrays are fundamentally tree-like, with data stored in 1D array buffers (e.g., NumPy, CuPy, or JAX). The VirtualArray class acts as a virtual buffer, allowing deferred materialization of data when needed.

Key Features

  • Lazy Data Generation: A VirtualArray requires a known dtype and shape upon instantiation, along with a generator function responsible for producing the actual buffer data. This generator function is designed for use cases where materialization is expensive, such as disk reads.
  • NPLike Consistency: Each VirtualArray is tied to a specific nplike (NumPy or CuPy). Switching between different nplike backends is not allowed unless the array is first materialized.
  • Optimized Materialization: Some operations, such as trivial slicing, can be performed without triggering materialization, preserving the virtual nature of the array. Other operations that inherently require data access will trigger materialization on demand.

Adjustments to the Codebase

  • Array Operations: Updates to array_module ensure that certain operations remain virtual when possible while others materialize data when necessary.
  • Kernel Execution: Awkward Array kernels now enforce materialization before processing virtual arrays.
  • Conversion Functions: Functions of the form ak.to_* are updated to ensure that virtual arrays are materialized before conversion.
  • Layout Enhancements: Additional helper methods are introduced in each layout to efficiently check whether an array contains any virtual buffers.

This should be reviewed in combination with ikrommyd#1 which runs the whole awkward ci by creating a virtual array inside every NumpyArray and every Index and materializing them on the spot and also in combination with this branch: https://github.com/ikrommyd/awkward/tree/test-virtual-arrays-without-materialize where most of the tests pass without even needing to materialize inside the NumpyArray. You can compare this branch with the branch from ikrommyd#1 for the changes.

@ikrommyd ikrommyd marked this pull request as draft January 9, 2025 21:48
@ikrommyd ikrommyd force-pushed the virtual-arrays branch 3 times, most recently from e76182b to 3b9ad2f Compare January 21, 2025 13:41
@ikrommyd ikrommyd requested a review from pfackeldey February 21, 2025 04:11
@ikrommyd
Copy link
Collaborator Author

Oh I take that back actually, it may be best to do the opposite. Raise in error in the lookup and prompt the user to intentionally materialize.

@pfackeldey
Copy link
Collaborator

Oh I take that back actually, it may be best to do the opposite. Raise in error in the lookup and prompt the user to intentionally materialize.

I like that! 👍

@ikrommyd ikrommyd marked this pull request as ready for review February 28, 2025 22:58
@ikrommyd
Copy link
Collaborator Author

@pfackeldey @ianna I marked this as ready for review 😃

@agoose77
Copy link
Collaborator

I'll try to get to this on Monday if it's still open!

Copy link
Collaborator

@pfackeldey pfackeldey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is excellent!
From my side, this PR is ready to be merged 🎉

We have ideas for further improvements, especially related to the length handling of VirtualArrays during construction. These can be followed up in a separate PR - given the size of this PR already.

I should add that the level of testing is very sophisticated here:

  • You added >6k lines of tests alone for VirtualArrays
  • You ran the whole (other) test-suite of awkward (for both cpu & cuda backend) against VirtualArrays by auto-wrapping any buffer at construction in a VirtualArray
  • You ran the coffea adl-benchmarks with VirtualArrays
  • You ran the AGC (for 1 chunk) with VirtualArrays

All of these tests passed, and also didn't show signs of over-materialization. I wanted to highlight @ikrommyd's testing effort here! 👍

@agoose77
Copy link
Collaborator

agoose77 commented Mar 3, 2025

I'll actually take a look tomorrow (I'm on leave today). Really looking forward to it!

@ikrommyd
Copy link
Collaborator Author

ikrommyd commented Mar 3, 2025

Analysis-like benchmarks repo: https://github.com/ikrommyd/coffea-virtual-array-tests

Copy link
Collaborator

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ikrommyd - Good job! I have a few questions, requests, and concerns. Please see the comments for details.

There are a few style changes in the code that are unrelated to the new feature. According to Google's best practices, it's recommended to keep modifications minimal and focused to improve code review efficiency. This helps reviewers concentrate on the new changes without unnecessary distractions.

To keep the Array API in line with the standard, I recommend removing tolist from it to maintain consistency and adherence to best practices.

It looks like we have an extra call to materialize the arrays. What is the performance overhead of this additional operation? Could it be optimized or deferred to the user to control when materialization occurs?

@@ -44,6 +44,9 @@ def nbytes(self) -> ShapeItem: ...
@abstractmethod
def T(self) -> Self: ...

@abstractmethod
def tolist(self) -> list: ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ikrommyd - IMHO, we should stick to the Array API here. I think, we already have functions to convert an array to list. Why is it needed here? Thanks.

@@ -35,10 +36,7 @@ def _nplike_concatenate_has_casting(module: Any) -> bool:
return True


ArrayLikeT = TypeVar("ArrayLikeT", bound=ArrayLike)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we going to loose lose specific subclass information without TypeVar in functions that return ArrayLike?

return self.zeros(x.shape, dtype=dtype or x.dtype)
else:
return self._module.zeros_like(x, dtype=dtype)
return self._module.zeros_like(*materialize_if_virtual(x), dtype=dtype)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is materialization needed here given that zeros_like should return an array of zeros with the same shape and type as a given array?

return self.ones(x.shape, dtype=dtype or x.dtype)
else:
return self._module.ones_like(x, dtype=dtype)
return self._module.ones_like(*materialize_if_virtual(x), dtype=dtype)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same question: is materialization needed?

return self.full(x.shape, fill_value, dtype=dtype or x.dtype)
else:
return self._module.full_like(
x, self._module.array(fill_value), dtype=dtype
*materialize_if_virtual(x), self._module.array(fill_value), dtype=dtype
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same question: is materialization needed?

@@ -359,6 +373,8 @@ def _reconstitute(
backend.index_nplike.max(index) + 1
)
)
# free memory
free(index)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it needed?

@@ -405,6 +421,8 @@ def _reconstitute(
next_length = (
0 if len(starts) == 0 else backend.index_nplike.max(reduced_stops)
)
# free memory
free(starts, stops)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it needed?

@@ -437,6 +455,8 @@ def _reconstitute(
next_length = unknown_length
else:
next_length = 0 if len(offsets) == 1 else offsets[-1]
# free memory
free(offsets)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it needed?

@@ -522,6 +542,8 @@ def _reconstitute(
lengths.append(0)
else:
lengths.append(backend.index_nplike.max(selected_index) + 1)
# free memory
free(index, tags)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it needed?

Comment on lines +575 to +581
def free(*arrays: ArrayLike) -> None:
"""
Free any materialized resources associated with virtual array(s).
"""
for array in arrays:
if isinstance(array, VirtualArray):
array._array = UNMATERIALIZED
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def free(*arrays: ArrayLike) -> None:
"""
Free any materialized resources associated with virtual array(s).
"""
for array in arrays:
if isinstance(array, VirtualArray):
array._array = UNMATERIALIZED
def free(*arrays: ArrayLike) -> None:
"""Free any materialized resources associated with virtual array(s)."""
for array in filter(lambda x: isinstance(x, VirtualArray), arrays):
array._array = UNMATERIALIZED

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants