feat: implement virtual arrays #3364

ikrommyd · 2025-01-09T21:48:09Z

Implement Virtual Buffers in Awkward Array

This PR introduces the VirtualArray class, which enables virtual buffers at the lowest level of an Awkward Array's structure. Awkward Arrays are fundamentally tree-like, with data stored in 1D array buffers (e.g., NumPy, CuPy, or JAX). The VirtualArray class acts as a virtual buffer, allowing deferred materialization of data when needed.

Key Features

Lazy Data Generation: A VirtualArray requires a known dtype and shape upon instantiation, along with a generator function responsible for producing the actual buffer data. This generator function is designed for use cases where materialization is expensive, such as disk reads.
NPLike Consistency: Each VirtualArray is tied to a specific nplike (NumPy or CuPy). Switching between different nplike backends is not allowed unless the array is first materialized.
Optimized Materialization: Some operations, such as trivial slicing, can be performed without triggering materialization, preserving the virtual nature of the array. Other operations that inherently require data access will trigger materialization on demand.

Adjustments to the Codebase

Array Operations: Updates to array_module ensure that certain operations remain virtual when possible while others materialize data when necessary.
Kernel Execution: Awkward Array kernels now enforce materialization before processing virtual arrays.
Conversion Functions: Functions of the form ak.to_* are updated to ensure that virtual arrays are materialized before conversion.
Layout Enhancements: Additional helper methods are introduced in each layout to efficiently check whether an array contains any virtual buffers.

This should be reviewed in combination with ikrommyd#1 which runs the whole awkward ci by creating a virtual array inside every NumpyArray and every Index and materializing them on the spot and also in combination with this branch: https://github.com/ikrommyd/awkward/tree/test-virtual-arrays-without-materialize where most of the tests pass without even needing to materialize inside the NumpyArray. You can compare this branch with the branch from ikrommyd#1 for the changes.

…level

…uffers working in a notebook

ikrommyd · 2025-02-21T04:13:48Z

Oh I take that back actually, it may be best to do the opposite. Raise in error in the lookup and prompt the user to intentionally materialize.

…uilder kernel

pfackeldey · 2025-02-21T12:56:50Z

Oh I take that back actually, it may be best to do the opposite. Raise in error in the lookup and prompt the user to intentionally materialize.

I like that! 👍

ikrommyd · 2025-02-28T22:58:34Z

@pfackeldey @ianna I marked this as ready for review 😃

agoose77 · 2025-02-28T23:01:20Z

I'll try to get to this on Monday if it's still open!

pfackeldey

This is excellent!
From my side, this PR is ready to be merged 🎉

We have ideas for further improvements, especially related to the length handling of VirtualArrays during construction. These can be followed up in a separate PR - given the size of this PR already.

I should add that the level of testing is very sophisticated here:

You added >6k lines of tests alone for VirtualArrays
You ran the whole (other) test-suite of awkward (for both cpu & cuda backend) against VirtualArrays by auto-wrapping any buffer at construction in a VirtualArray
You ran the coffea adl-benchmarks with VirtualArrays
You ran the AGC (for 1 chunk) with VirtualArrays

All of these tests passed, and also didn't show signs of over-materialization. I wanted to highlight @ikrommyd's testing effort here! 👍

agoose77 · 2025-03-03T14:32:40Z

I'll actually take a look tomorrow (I'm on leave today). Really looking forward to it!

ikrommyd · 2025-03-03T18:39:21Z

Analysis-like benchmarks repo: https://github.com/ikrommyd/coffea-virtual-array-tests

ianna

@ikrommyd - Good job! I have a few questions, requests, and concerns. Please see the comments for details.

There are a few style changes in the code that are unrelated to the new feature. According to Google's best practices, it's recommended to keep modifications minimal and focused to improve code review efficiency. This helps reviewers concentrate on the new changes without unnecessary distractions.

To keep the Array API in line with the standard, I recommend removing tolist from it to maintain consistency and adherence to best practices.

It looks like we have an extra call to materialize the arrays. What is the performance overhead of this additional operation? Could it be optimized or deferred to the user to control when materialization occurs?

ianna · 2025-03-03T15:18:50Z

src/awkward/_nplikes/array_like.py

@@ -44,6 +44,9 @@ def nbytes(self) -> ShapeItem: ...
    @abstractmethod
    def T(self) -> Self: ...

+    @abstractmethod
+    def tolist(self) -> list: ...
+


@ikrommyd - IMHO, we should stick to the Array API here. I think, we already have functions to convert an array to list. Why is it needed here? Thanks.

ianna · 2025-03-03T15:37:27Z

src/awkward/_nplikes/array_module.py

@@ -35,10 +36,7 @@ def _nplike_concatenate_has_casting(module: Any) -> bool:
        return True


-ArrayLikeT = TypeVar("ArrayLikeT", bound=ArrayLike)


Aren't we going to loose lose specific subclass information without TypeVar in functions that return ArrayLike?

ianna · 2025-03-03T16:24:50Z

src/awkward/_nplikes/array_module.py

            return self.zeros(x.shape, dtype=dtype or x.dtype)
        else:
-            return self._module.zeros_like(x, dtype=dtype)
+            return self._module.zeros_like(*materialize_if_virtual(x), dtype=dtype)


is materialization needed here given that zeros_like should return an array of zeros with the same shape and type as a given array?

ianna · 2025-03-03T16:25:16Z

src/awkward/_nplikes/array_module.py

            return self.ones(x.shape, dtype=dtype or x.dtype)
        else:
-            return self._module.ones_like(x, dtype=dtype)
+            return self._module.ones_like(*materialize_if_virtual(x), dtype=dtype)


the same question: is materialization needed?

ianna · 2025-03-03T16:26:38Z

src/awkward/_nplikes/array_module.py

            return self.full(x.shape, fill_value, dtype=dtype or x.dtype)
        else:
            return self._module.full_like(
-                x, self._module.array(fill_value), dtype=dtype
+                *materialize_if_virtual(x), self._module.array(fill_value), dtype=dtype


the same question: is materialization needed?

ianna · 2025-03-04T08:09:26Z