Adds partial `cast` support for run-end encoded arrays #6752

RyanMarcus · 2024-11-18T21:27:03Z

Which issue does this PR close?

Helps address part of " Support REE in cast kernels" in #3520

Rationale for this change

Adds support for (limited) casting/conversion of REE arrays.

What changes are included in this PR?

Adds casting to change the type of the run ends in an REE array (e.g., convert from storing each run end as a Int16 to a Int32)
Adds casting from REE arrays of all primitive types to a "flat" array (which is a PrimitiveArray under the hood).

We've added these specific casts because we needed them for our own project, and figured we could contribute them back.

Are there any user-facing changes?

Currently, can_cast_type basically returns false for anything involving an REE array. This PR implements a subset of important casts, but not all of them. Users may be surprised that, for example, a REE array of Int32 can be converted to an array of Int32, but an REE array of Utf8 can't be converted to a Utf8 array (yet).

…e values to primitive arrays Added copyright header

tustvold

I worry a bit about the amount of codegen this approach will lead to, as it relies on downcasting the values. This also prevents it generalising to nested types.

I wonder if you considered using MutableArrayData or one of the existing selection kernels (e.g. take)?

RyanMarcus · 2024-11-19T00:29:53Z

I know you just gave it as an example, but take would be significantly slower since it would require materializing an index array, which can't be trivially computed from the run ends (unlike for dictionaries, where the values are the indices).

I took a brief look at the other existing kernels and didn't see anything that looked promising. If you can point me to something specific I'm happy to take a look!

I'm not currently familiar with MutableArrayData, let me take a look and get back to you. Edit: Just looked at MutableArrayData which seems like a perfect candidate for Copy types. Let me try to change my implementation to that, I agree it has potential to cut down a lot on codegen.

One mitigating factor to the extra codegen is the fact that this is limited to the cast subcrate. We could also add an arrow-cast-ree or similar, if that's a concern.

* Added `extend_n` (with dummy implementation) to MutableArrayData

RyanMarcus · 2024-11-19T01:55:38Z

As you suspected @tustvold , MutableArrayData enabled an implementation that was much smaller! And should reuse all the existing generics codegen from transform.

In order to make this efficient, we need a extend_n function on MutableArrayData that copies the same bits many times. Here is a dummy (working) implementation:

    pub fn extend_n(&mut self, index: usize, start: usize, end: usize, n: usize) {
        for _ in 0..n {
            self.extend(index, start, end);
        }
    }

I can see that MutableArrayData uses some kind of dynamic dispatch magic for each call to extend which I don't fully understand, but I understand enough to know that my current extend_n is no good. I'll mark this PR as a draft until I figure it out.

…ng an `n` (count) parameter to each fn ptr

RyanMarcus · 2024-11-19T19:25:52Z

I've added a more efficient extend_n function that works by passing the count parameter n to each closure. But I realized that in the worst case scenario, where each run length is 1, this approach will still do interpretation for every value.

A simple benchmark converting a REE array of i32s to a primitive array:

With PrimitiveBuilder:
cast run end to flat    time:   [34.395 ms 34.453 ms 34.519 ms]

With MutableArrayData:
cast run end to flat    time:   [48.710 ms 48.869 ms 49.067 ms]

... so the avoiding the interpretation overhead seems to cause a 30% speedup. Does a 30% performance improvement in the worst case justify the extra codegen, @tustvold ?

Perhaps a reasonable middle ground would be to use the PrimitiveBuilder for primitive types, but use MutableArrayData for non-primitive types?

tustvold · 2024-11-19T20:16:11Z

What is the average run length in that case?

We could specialise primitives, and there are ways we can reduce codegen - e.g. only specialize i32 run ends and only specialise on ArrowNativeType not ArrowPrimitiveType like we do for dictionaries.

However, given how few people use RunArray, there are relatively few scenarios it makes sense, we need to err on the side of keeping codegen manageable

RyanMarcus · 2024-11-19T23:05:51Z

My benchmark is a worst-case scenario, so every run length is 1, and thus the average is 1 as well. Not a realistic scenario, but illustrative of the worst case.

If we increase all run lengths to 10 (which is the average in my application, at least) and keeping the logical data size the same, the results are:

With PrimitiveBuilder:
cast run end to flat    time:   [11.740 ms 11.778 ms 11.818 ms]

With MutableArrayData:
cast run end to flat    time:   [21.837 ms 21.917 ms 22.000 ms]

Both approaches get faster, but the relative gap is larger.

The current version hits the compromise you mentioned: the specialized kernel is used for {i/u/f}{8/16/32/64}, and the interpretation-powered one is used for the rest.

I am not fully confident I correctly modified all of the MutableArrayData to handle the extend_n function, but the tests at least pass. If you want, I can swap back to the "dummy" version I proposed earlier, since the specialized kernel should hit the most common cases.

tustvold · 2024-11-20T08:38:40Z

If you want, I can swap back to the "dummy" version I proposed earlier, since the specialized kernel should hit the most common cases.

Let's swap it back

tustvold · 2024-11-20T08:41:15Z

arrow-cast/src/cast/mod.rs

@@ -759,6 +763,12 @@ pub fn cast_with_options(
                "Casting from type {from_type:?} to dictionary type {to_type:?} not supported",
            ))),
        },
+        (RunEndEncoded(re_t, _dt), _) => match re_t.data_type() {


Do we need all 3 or could we just handle int32 and cast Int8 and Int16 to this. This is what we do for DictionaryArray?

I imagine copying Int16 data up to Int32 would have a pretty large performance penalty in this case. For DictionaryArray, each access to the values is potentially random and therefore a cache miss, but for REE we will always read the run ends and the values sequentially. So my guess is that an extra data copy like that would 2x the runtime. Does that make sense or should I write a bench?

I imagine copying Int16 data up to Int32 would have a pretty large performance penalty in this case

I think we have to be pragmatic here, ultimately Int8 and Int16 are somewhat useless, only able to support arrays of maximum length 128 and 32768 respectively. I'm honestly not entirely sure why they were standardised...

Another option might be to fallback to ArrayData for the Int8, Int16 and Int64 cases. We really need to try to keep this change small

Just to be clear, the standard is only Int16, Int32, and Int64, no Int8 (which is reflected in the code). I think Int16 / the smaller integer types are actually the most common (this is also the case in Parquet's dictionary run length encoding).

I didn't realize that even this amount of extra codegen was an issue. If I am literally the only user of this feature, we simply shouldn't do this at all. Honestly, this won't even solve my problem, since a 30%-2x slowdown on some types means I will just have to implement the full thing in my own code anyway. Luckily in my project, we are significantly less constrained by code size.

Probably the "right thing" to do is somehow split up cast into pieces, so people can opt into what they need, either with feature flags or with more subcrates, but I think that's a pretty large refactor.

this is also the case in Parquet's dictionary run length encoding

Parquet is run length, this is run end, and consequently the primitive type constrains the maximum length of the array, as opposed to the maximum length of a run.

I'm sorry that we can't just add more codegen, but it has been a perennial issue for us that the full arrow specification is a combinatorial explosion, and so we have to pick some compute-optimised versions, and accept that space-optimised variants may incur a slight performance penalty. We should probably do a better job documenting this

code size.

It's actually build time that is the biggest pain point, there was a time when half the build time of the entire workspace was just the dictionary comparison kernels 😅

tustvold · 2024-11-20T08:41:38Z

arrow-cast/src/cast/runend.rs

+    cast_options: &CastOptions,
+) -> Result<ArrayRef, ArrowError> {
+    let ree_array = array
+        .as_any()


Using AsArray may be more concise

You mean this trait? I don't think there's an as_run_array or similar. I can add it, but the overall line count will go up.

I think that would be a good addition, happy for that to be a separate PR though, there are many other places that could use AsArray though in this diff

tustvold · 2024-11-20T08:45:36Z

arrow-cast/src/cast/runend.rs

+        // Potentially convert to a new value or run end type
+        (_, DataType::RunEndEncoded(re_t, dt)) => {
+            let values = cast_with_options(ree_array.values(), dt.data_type(), cast_options)?;
+            let re = PrimitiveArray::<K>::new(ree_array.run_ends().inner().clone(), None);


Why is this necessary?

This enables converting REE arrays of one type to another, for example from Int32 runs and Float64 data to Int16 runs and Float32 data. See test_run_end_to_run_end for an example.

Suggested change

let re = PrimitiveArray::<K>::new(ree_array.run_ends().inner().clone(), None);

let re = ree_array.run_ends().clone();

?

Or even just pass re_array.run_ends() to the cast kernel directly...

tustvold · 2024-11-20T08:47:02Z

arrow-cast/src/cast/runend.rs

+                )))?,
+            };
+
+            Ok(result.slice(ree_array.run_ends().offset(), ree_array.run_ends().len()))


Why is this slicing needed?

When we construct a PrimitiveArray from the run ends buffer, we have to construct the PrimitiveArray from the inner buffer, which doesn't have the offset information. So we have to re-slice it here. Added some code to test_run_end_to_run_end to confirm this.

arrow-cast/src/cast/runend.rs

tustvold · 2024-11-20T08:57:40Z

arrow-cast/src/cast/runend.rs

+        result = PrimitiveArray::<T>::new(result.values().clone(), nb);
+    }
+
+    // TODO: this slice could be optimized by only copying the relevant parts of


I think this should be as simple as clamping the run end in the loop to be in the sliced range

I was having a hard time getting the logic right for a case like this:

run ends: [5, 10]
slice offset: 3 len: 5

... because you have to take 2 of the first value and then 3 of the second value. I can give it a shot if you want but I think it'll just be a source of bugs.

Not tested but I think this should work. My concern with the slicing logic, is such things have a storied history of subtle bugs.

let min_end = arr.offset(); let max_end = arr.offset() + arr.length(); let last = min_end; for (run_end, val) in arr.run_ends().values().iter().zip(null_buffer.iter()) { let run_end = clamp(run_end.as_usize(), min_end, max_end); let run_length = run_end - last; if val { nbb.append_n_non_nulls(run_length); } else { nbb.append_n_nulls(run_length); } last = run_end; }

…of vec

RyanMarcus · 2024-11-20T21:49:30Z

See #6752 (comment) -- I don't think this is worthwhile if minimizing codegen is such a high priority. If someone else is looking at this PR, feel free to use my kernels for your REE arrays!

github-actions bot added the arrow Changes to the arrow crate label Nov 18, 2024

RyanMarcus marked this pull request as ready for review November 18, 2024 21:30

RyanMarcus marked this pull request as draft November 18, 2024 23:23

RyanMarcus force-pushed the master branch from e538526 to ca62ac9 Compare November 18, 2024 23:27

Casting between REE arrays with different run types and from primitiv…

2b1931a

…e values to primitive arrays Added copyright header

RyanMarcus force-pushed the master branch from ca62ac9 to 2b1931a Compare November 18, 2024 23:32

RyanMarcus marked this pull request as ready for review November 18, 2024 23:33

tustvold reviewed Nov 19, 2024

View reviewed changes

* Used MutableArrayData instead of generics

4dc9245

* Added `extend_n` (with dummy implementation) to MutableArrayData

RyanMarcus marked this pull request as draft November 19, 2024 01:55

RyanMarcus added 6 commits November 18, 2024 21:11

Allow casting between all other castable types, including nested types

5526135

Added benchmark for run end to primitive cast

7eb0687

Added test with nulls

40e2c99

Added more efficient extend_n for mutable data, which required addi…

b4d0c4d

…ng an `n` (count) parameter to each fn ptr

Fixed bug in build_extend_view

4e49db8

Increased benchmark size to 10M

a588a5a

RyanMarcus marked this pull request as ready for review November 19, 2024 19:25

Use specialized kernel for native types, interp for everything else

6a14ef4

tustvold reviewed Nov 20, 2024

View reviewed changes

RyanMarcus added 2 commits November 20, 2024 10:31

Reverted back to dummy implementation of extend_n

21828df

Simplifications to slicing logic and removing array builder in favor …

4a742e6

…of vec

RyanMarcus closed this Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds partial `cast` support for run-end encoded arrays #6752

Adds partial `cast` support for run-end encoded arrays #6752

RyanMarcus commented Nov 18, 2024 •

edited

Loading

tustvold left a comment

RyanMarcus commented Nov 19, 2024 •

edited

Loading

RyanMarcus commented Nov 19, 2024

RyanMarcus commented Nov 19, 2024 •

edited

Loading

tustvold commented Nov 19, 2024 •

edited

Loading

RyanMarcus commented Nov 19, 2024 •

edited

Loading

tustvold commented Nov 20, 2024

tustvold Nov 20, 2024

RyanMarcus Nov 20, 2024

tustvold Nov 20, 2024 •

edited

Loading

RyanMarcus Nov 20, 2024

tustvold Nov 20, 2024 •

edited

Loading

tustvold Nov 20, 2024

RyanMarcus Nov 20, 2024

tustvold Nov 20, 2024

tustvold Nov 20, 2024

RyanMarcus Nov 20, 2024

tustvold Nov 20, 2024

tustvold Nov 20, 2024

tustvold Nov 20, 2024

RyanMarcus Nov 20, 2024

tustvold Nov 20, 2024

RyanMarcus Nov 20, 2024

tustvold Nov 20, 2024

RyanMarcus commented Nov 20, 2024

	let re = PrimitiveArray::<K>::new(ree_array.run_ends().inner().clone(), None);
	let re = ree_array.run_ends().clone();

Adds partial cast support for run-end encoded arrays #6752

Adds partial cast support for run-end encoded arrays #6752

Conversation

RyanMarcus commented Nov 18, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold left a comment

Choose a reason for hiding this comment

RyanMarcus commented Nov 19, 2024 • edited Loading

RyanMarcus commented Nov 19, 2024

RyanMarcus commented Nov 19, 2024 • edited Loading

tustvold commented Nov 19, 2024 • edited Loading

RyanMarcus commented Nov 19, 2024 • edited Loading

tustvold commented Nov 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RyanMarcus commented Nov 20, 2024

Adds partial `cast` support for run-end encoded arrays #6752

Adds partial `cast` support for run-end encoded arrays #6752

RyanMarcus commented Nov 18, 2024 •

edited

Loading

RyanMarcus commented Nov 19, 2024 •

edited

Loading

RyanMarcus commented Nov 19, 2024 •

edited

Loading

tustvold commented Nov 19, 2024 •

edited

Loading

RyanMarcus commented Nov 19, 2024 •

edited

Loading

tustvold Nov 20, 2024 •

edited

Loading

tustvold Nov 20, 2024 •

edited

Loading