-
Notifications
You must be signed in to change notification settings - Fork 810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds partial cast
support for run-end encoded arrays
#6752
Conversation
…e values to primitive arrays Added copyright header
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry a bit about the amount of codegen this approach will lead to, as it relies on downcasting the values. This also prevents it generalising to nested types.
I wonder if you considered using MutableArrayData or one of the existing selection kernels (e.g. take)?
I know you just gave it as an example, but I took a brief look at the other existing kernels and didn't see anything that looked promising. If you can point me to something specific I'm happy to take a look! I'm not currently familiar with MutableArrayData, let me take a look and get back to you. Edit: Just looked at One mitigating factor to the extra codegen is the fact that this is limited to the |
* Added `extend_n` (with dummy implementation) to MutableArrayData
As you suspected @tustvold , In order to make this efficient, we need a pub fn extend_n(&mut self, index: usize, start: usize, end: usize, n: usize) {
for _ in 0..n {
self.extend(index, start, end);
}
} I can see that |
…ng an `n` (count) parameter to each fn ptr
I've added a more efficient A simple benchmark converting a REE array of i32s to a primitive array:
... so the avoiding the interpretation overhead seems to cause a 30% speedup. Does a 30% performance improvement in the worst case justify the extra codegen, @tustvold ? Perhaps a reasonable middle ground would be to use the |
What is the average run length in that case? We could specialise primitives, and there are ways we can reduce codegen - e.g. only specialize i32 run ends and only specialise on ArrowNativeType not ArrowPrimitiveType like we do for dictionaries. However, given how few people use RunArray, there are relatively few scenarios it makes sense, we need to err on the side of keeping codegen manageable |
My benchmark is a worst-case scenario, so every run length is 1, and thus the average is 1 as well. Not a realistic scenario, but illustrative of the worst case. If we increase all run lengths to 10 (which is the average in my application, at least) and keeping the logical data size the same, the results are:
Both approaches get faster, but the relative gap is larger. The current version hits the compromise you mentioned: the specialized kernel is used for {i/u/f}{8/16/32/64}, and the interpretation-powered one is used for the rest. I am not fully confident I correctly modified all of the |
Let's swap it back |
@@ -759,6 +763,12 @@ pub fn cast_with_options( | |||
"Casting from type {from_type:?} to dictionary type {to_type:?} not supported", | |||
))), | |||
}, | |||
(RunEndEncoded(re_t, _dt), _) => match re_t.data_type() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need all 3 or could we just handle int32 and cast Int8 and Int16 to this. This is what we do for DictionaryArray?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine copying Int16 data up to Int32 would have a pretty large performance penalty in this case. For DictionaryArray
, each access to the values is potentially random and therefore a cache miss, but for REE we will always read the run ends and the values sequentially. So my guess is that an extra data copy like that would 2x the runtime. Does that make sense or should I write a bench?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine copying Int16 data up to Int32 would have a pretty large performance penalty in this case
I think we have to be pragmatic here, ultimately Int8 and Int16 are somewhat useless, only able to support arrays of maximum length 128 and 32768 respectively. I'm honestly not entirely sure why they were standardised...
Another option might be to fallback to ArrayData for the Int8, Int16 and Int64 cases. We really need to try to keep this change small
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, the standard is only Int16
, Int32
, and Int64
, no Int8
(which is reflected in the code). I think Int16
/ the smaller integer types are actually the most common (this is also the case in Parquet's dictionary run length encoding).
I didn't realize that even this amount of extra codegen was an issue. If I am literally the only user of this feature, we simply shouldn't do this at all. Honestly, this won't even solve my problem, since a 30%-2x slowdown on some types means I will just have to implement the full thing in my own code anyway. Luckily in my project, we are significantly less constrained by code size.
Probably the "right thing" to do is somehow split up cast into pieces, so people can opt into what they need, either with feature flags or with more subcrates, but I think that's a pretty large refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is also the case in Parquet's dictionary run length encoding
Parquet is run length, this is run end, and consequently the primitive type constrains the maximum length of the array, as opposed to the maximum length of a run.
I'm sorry that we can't just add more codegen, but it has been a perennial issue for us that the full arrow specification is a combinatorial explosion, and so we have to pick some compute-optimised versions, and accept that space-optimised variants may incur a slight performance penalty. We should probably do a better job documenting this
code size.
It's actually build time that is the biggest pain point, there was a time when half the build time of the entire workspace was just the dictionary comparison kernels 😅
cast_options: &CastOptions, | ||
) -> Result<ArrayRef, ArrowError> { | ||
let ree_array = array | ||
.as_any() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using AsArray may be more concise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean this trait? I don't think there's an as_run_array
or similar. I can add it, but the overall line count will go up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would be a good addition, happy for that to be a separate PR though, there are many other places that could use AsArray though in this diff
// Potentially convert to a new value or run end type | ||
(_, DataType::RunEndEncoded(re_t, dt)) => { | ||
let values = cast_with_options(ree_array.values(), dt.data_type(), cast_options)?; | ||
let re = PrimitiveArray::<K>::new(ree_array.run_ends().inner().clone(), None); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This enables converting REE arrays of one type to another, for example from Int32 runs and Float64 data to Int16 runs and Float32 data. See test_run_end_to_run_end
for an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let re = PrimitiveArray::<K>::new(ree_array.run_ends().inner().clone(), None); | |
let re = ree_array.run_ends().clone(); |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or even just pass re_array.run_ends() to the cast kernel directly...
)))?, | ||
}; | ||
|
||
Ok(result.slice(ree_array.run_ends().offset(), ree_array.run_ends().len())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this slicing needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we construct a PrimitiveArray
from the run ends buffer, we have to construct the PrimitiveArray
from the inner buffer, which doesn't have the offset information. So we have to re-slice it here. Added some code to test_run_end_to_run_end
to confirm this.
result = PrimitiveArray::<T>::new(result.values().clone(), nb); | ||
} | ||
|
||
// TODO: this slice could be optimized by only copying the relevant parts of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be as simple as clamping the run end in the loop to be in the sliced range
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was having a hard time getting the logic right for a case like this:
run ends: [5, 10]
slice offset: 3 len: 5
... because you have to take 2 of the first value and then 3 of the second value. I can give it a shot if you want but I think it'll just be a source of bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not tested but I think this should work. My concern with the slicing logic, is such things have a storied history of subtle bugs.
let min_end = arr.offset();
let max_end = arr.offset() + arr.length();
let last = min_end;
for (run_end, val) in arr.run_ends().values().iter().zip(null_buffer.iter()) {
let run_end = clamp(run_end.as_usize(), min_end, max_end);
let run_length = run_end - last;
if val {
nbb.append_n_non_nulls(run_length);
} else {
nbb.append_n_nulls(run_length);
}
last = run_end;
}
See #6752 (comment) -- I don't think this is worthwhile if minimizing codegen is such a high priority. If someone else is looking at this PR, feel free to use my kernels for your REE arrays! |
Which issue does this PR close?
Helps address part of " Support REE in cast kernels" in #3520
Rationale for this change
Adds support for (limited) casting/conversion of REE arrays.
What changes are included in this PR?
PrimitiveArray
under the hood).We've added these specific casts because we needed them for our own project, and figured we could contribute them back.
Are there any user-facing changes?
Currently,
can_cast_type
basically returns false for anything involving an REE array. This PR implements a subset of important casts, but not all of them. Users may be surprised that, for example, a REE array of Int32 can be converted to an array of Int32, but an REE array of Utf8 can't be converted to a Utf8 array (yet).