Optimize take/filter from multiple input arrays to a single large output array #6692

alamb · 2024-11-05T19:17:06Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Upstream in DataFusion, there is a common common pattern where we have multiple input RecordBatches and want to produce an output RecordBatch with some subset of the rows from the input batches. This happens in

FilterExec --> CoalesceBatchesExec when filtering
RepartitionExec --> CoalesceBatchesExec

The kernels used here are:

FilterExec uses filter, takes a single input Array and produces a single output Array
RepartitionExec uses take, which also takes a single input Array and produces a single output Array``RepartitionExeceach take a single input batch and produce a single output Array
CoalesceBatchesExec calls concat which takes multple Arrays and produces a single Array as output

The use of these kernels and patterns has two downsides:

Performance overhead due to a second copy: Calling filter/take immediately copies the data, which is copied again in CoalesceBatches (see illustration below)
Memory Overhead / Performance Overhead for GarbageCollecting StringView: Buffering up several RecordBatches with StringView may consume significant amounts of memory for mostly filtered rows, which requires us to run gc periodically which actually slows some things down (see Reduce copying in CoalesceBatchesExec for StringViews datafusion#11628)

Here is an ascii art picture (from apache/datafusion#7957) that shows the extra copy in action


┌────────────────────┐        Filter                                                                          
│                    │                    ┌────────────────────┐            Coalesce                          
│                    │    ─ ─ ─ ─ ─ ─ ▶   │    RecordBatch     │             Batches                          
│    RecordBatch     │                    │   num_rows = 234   │─ ─ ─ ─ ─ ┐                                   
│  num_rows = 8000   │                    └────────────────────┘                                              
│                    │                                                    │                                   
│                    │                                                                ┌────────────────────┐  
└────────────────────┘                                                    │           │                    │  
┌────────────────────┐                    ┌────────────────────┐                      │                    │  
│                    │        Filter      │                    │          │           │                    │  
│                    │                    │    RecordBatch     │           ─ ─ ─ ─ ─ ▶│                    │  
│    RecordBatch     │    ─ ─ ─ ─ ─ ─ ▶   │   num_rows = 500   │─ ─ ─ ─ ─ ┐           │                    │  
│  num_rows = 8000   │                    │                    │                      │    RecordBatch     │  
│                    │                    │                    │          └ ─ ─ ─ ─ ─▶│  num_rows = 8000   │  
│                    │                    └────────────────────┘                      │                    │  
└────────────────────┘                                                                │                    │  
                                                    ...                    ─ ─ ─ ─ ─ ▶│                    │  
          ...                   ...                                       │           │                    │  
                                                                                      │                    │  
┌────────────────────┐                                                    │           └────────────────────┘  
│                    │                    ┌────────────────────┐                                              
│                    │       Filter       │                    │          │                                   
│    RecordBatch     │                    │    RecordBatch     │                                              
│  num_rows = 8000   │   ─ ─ ─ ─ ─ ─ ▶    │   num_rows = 333   │─ ─ ─ ─ ─ ┘                                   
│                    │                    │                    │                                              
│                    │                    └────────────────────┘                                              
└────────────────────┘                                                                                        
                                                                                                              
                      FilterExec                                          RepartitonExec copies the data      
                      creates output batches with copies                  *again* to form final large         
                      of  the matching rows (calls take()                 RecordBatches                       
                      to make a copy)

Describe the solution you'd like

I would like to apply filter/take to each incoming RecordBatch as it arrives, copying the data to an in progress output array, in a way that is as fast as the filter and take operations. This would reduce the extra copy that is currently required.

Note this is somewhat like the interleave kernel, except that

We only need the output rows to be in the same order as the input batches (so the second usize batch index is not needed)
We don't want to have to buffer all the input

Describe alternatives you've considered

One thing I have thought about is extending the builders so they can append more than one row at a time. For example:

Builder::append_filtered
Builder::append_take

So for example, to filter a stream of StringViewArrays I might do something like;

let mut builder = StringViewBuilder::new();
while let Some(input) = stream.next() {
  // compute some subset of input rows that make it to the output
  let filter: BooleanArray = compute_filter(&input, ....); 
  // append all rows from input where filter[i] is true
  builder.append_filtered(&input, &filter);
}

And also add an equivalent for append_take

I think if we did this right, it wouldn't be a lot of new code, we could just refactor the existing filter/take implementations. For example, I would expect that the filter kernel would then devolve into something like

fn filter(..) {
  match data_type {
    DataType::Int8 => Int8Builder::with_capacity(...)
      .append_filter(input, filter)
      .build()
...
}

Additional context

The text was updated successfully, but these errors were encountered:

alamb · 2024-11-05T19:36:31Z

FYI

@XiangpengHao who I think I mentioned this to and
@tustvold in case you have some thoughts about why this is crazy
@Rachelint / @jayzhan211 / @Dandandan because you guys were talking about this on Support vectorized append and compare for multi group by datafusion#12996

devanbenz · 2024-11-07T01:44:40Z

@alamb just to clarify your idea is to modify the existing take & filter kernels. Not create a new one right?

jayzhan211 · 2024-11-07T01:50:20Z

@alamb just to clarify your idea is to modify the existing take & filter kernels. Not create a new one right?

If creating a new one helps, no reason not to do it.

tustvold · 2024-11-07T04:39:40Z

I think the idea is sound in principle, but needs a concrete API proposal.

I'm not sure the proposed builder API makes sense, as the typing for nested types like ListBuilder and DictionaryBuilder is not what we want here, and they can't easily be type erased. We also ideally want to avoid overly bloating the arrow-array crate with kernel logic. This isn't even touching on the fact these kernels don't use the builders for performance reasons.

I think we'd need to introduce a new type-erased MutableArray abstraction or something, potentially replacing the rather problematic MutableArrayData.

The only remaining challenge concerns dictionaries, as the output dictionary needs to be computed up front. Simply not supporting dictonaries could potentially be a valid workaround though.

jayzhan211 · 2024-11-07T06:46:05Z

Not sure about other type but for StringView, I can only think of iterating all the filtered row and append_value one by one. If there is no further optimization we can do, I think we can implement the append logic in datafusion

devanbenz · 2024-11-07T12:38:15Z

Not sure about other type but for StringView, I can only think of iterating all the filtered row and append_value one by one. If there is no further optimization we can do, I think we can implement the append logic in datafusion

'append_value' does a copy though. Wouldn't that effectively still be a large amount of copies?

jayzhan211 · 2024-11-07T12:49:54Z

Not sure about other type but for StringView, I can only think of iterating all the filtered row and append_value one by one. If there is no further optimization we can do, I think we can implement the append logic in datafusion

'append_value' does a copy though. Wouldn't that effectively still be a large amount of copies?

With this approach, we reduce the copying to a single step. Compared to the current approach, where copying happens in multiple stages (filtering, garbage collection, and coalescing), my proposal combines these steps into one. While benchmarks are needed to confirm any performance gains, this method should, at the very least, not perform worse than the existing one

devanbenz · 2024-11-07T13:07:01Z

Not sure about other type but for StringView, I can only think of iterating all the filtered row and append_value one by one. If there is no further optimization we can do, I think we can implement the append logic in datafusion

'append_value' does a copy though. Wouldn't that effectively still be a large amount of copies?

With this approach, we reduce the copying to a single step. Compared to the current approach, where copying happens in multiple stages (filtering, garbage collection, and coalescing), my proposal combines these steps into one. While benchmarks are needed to confirm any performance gains, this method should, at the very least, not perform worse than the existing one

That makes sense. This would be a change downstream within Datafusion then correct?

jayzhan211 · 2024-11-07T13:25:16Z

Not sure about other type but for StringView, I can only think of iterating all the filtered row and append_value one by one. If there is no further optimization we can do, I think we can implement the append logic in datafusion

'append_value' does a copy though. Wouldn't that effectively still be a large amount of copies?

With this approach, we reduce the copying to a single step. Compared to the current approach, where copying happens in multiple stages (filtering, garbage collection, and coalescing), my proposal combines these steps into one. While benchmarks are needed to confirm any performance gains, this method should, at the very least, not perform worse than the existing one

That makes sense. This would be a change downstream within Datafusion then correct?

yes, you can work on it if you want to

alamb · 2024-11-07T15:31:13Z

So it sounds like the consensus is to work out how this might look downstream in DataFusion (maybe starting with StringView as that is what is giving us the most trouble now) and then use some of that knowledge to propose something upstream in Arrow -- sounds like a good idea to me

alamb · 2024-11-07T15:33:09Z

Not sure about other type but for StringView, I can only think of iterating all the filtered row and append_value one by one. If there is no further optimization we can do, I think we can implement the append logic in datafusion

@jayzhan211 yes I think this is effectively what would happen -- however the actual iteration over filtered values is quite optimized in the filter kernel (checkout what the FilterBuilder does) based on how many values are filtered and other aspect

The fact that filter is so fast in arrow means it is quite hard to get as good / faster :)

alamb · 2024-11-07T15:36:34Z

@tustvold

I'm not sure the proposed builder API makes sense, as the typing for nested types like ListBuilder and DictionaryBuilder is not what we want here, and they can't easily be type erased. We also ideally want to avoid overly bloating the arrow-array crate with kernel logic.

This is reasonable -- though I could imagine adding type erased builders like DynListBuilder for this usecase

This isn't even touching on the fact these kernels don't use the builders for performance reasons.

Is there some fundamental reason the builders can't made faster? If we could make the builders fast enough to use for filter that would seem to be valuable in its own right. But I am likely just dreaming here

The only remaining challenge concerns dictionaries, as the output dictionary needs to be computed up front. Simply not supporting dictonaries could potentially be a valid workaround though.

A builder based approach could help (e.g. optimize for the case where the input batches had the same dictionary and handle the case where they didn't -- either via deferred computation or on the fly or something else)

Rachelint · 2024-11-07T15:58:49Z

Not sure about other type but for StringView, I can only think of iterating all the filtered row and append_value one by one. If there is no further optimization we can do, I think we can implement the append logic in datafusion

Does approach seems like that for filter?

loop the input array and input predicate
get the value of needed row(true in input predicate), and append_value to the builder

And at least, we can avoid generate multiple small batches, and concat them(a ton of copies) when big enough.

tustvold · 2024-11-07T16:07:09Z

This is reasonable -- though I could imagine adding type erased builders like DynListBuilder for this usecase

This sort of partially type-erased API seems like the worst of both worlds, you either want something that is completely type-erased (e.g. MutableArrayData), or fully typed (e.g. ListBuilder).

I could see us adding some sort of MutableArray abstraction to arrow-select that allows appending values from arrays based on a mask and/or selection. This would be useful not just for this use-case, but potentially as a MutableBuffer abstraction for databases, etc... However, it would be very complex to implement, especially for dictionaries.

Is there some fundamental reason the builders can't made faster

Not without changing their APIs 😅. For the primitive builders one could simply move the current kernel implementations into the builders, but this doesn't really achieve much IMO.

A builder based approach could help (e.g. optimize for the case where the input batches had the same dictionary and handle the case where they didn't -- either via deferred computation or on the fly or something else)

Yeah, it gets very complicated and fiddly. A similar challenge likely exists for StringView, although I'm not sure what level of sophistication we've reached w.r.t automatic GC.

Does approach seems like that for filter?

That would be a very naive way to implement the filter kernel, I would encourage looking at what the selection kernels actually do.

Rachelint · 2024-11-07T16:44:28Z

That would be a very naive way to implement the filter kernel, I would encourage looking at what the selection kernels actually do.

I agree with it seems a naive version for filter.

Is it possible to public something like filter_native but return a iterator, then we can reduce some copied in downstream and reuse the well optimized filter in arrow:

// current
filter --> intemediate buffer in array --> final buffer

// optimized
filter --> final buffer

jayzhan211 · 2024-11-08T06:10:17Z

arrow-rs/arrow-select/src/filter.rs

Line 740 in e9bf8aa

.add_buffers(array.data_buffers().to_vec());

This line of code extends buffer (byte copied) regardless of the filtered result, I think it is the reason why we need gc. If we do append_value here, we have additional hash lookup and insert, but less buffer copied especially in low selectivity and no gc required later on.

alamb · 2024-11-08T12:30:11Z

arrow-rs/arrow-select/src/filter.rs

Line 740 in e9bf8aa

.add_buffers(array.data_buffers().to_vec());

This line of code extends buffer (byte copied) regardless of the filtered result, I think it is the reason why we need gc. If we do append_value here, we have additional hash lookup and insert, but less buffer copied especially in low selectivity and no gc required later on.

To be clear -- I think the copy is of a Vec<Buffer> (which is a Vec of pointers to the data)

jayzhan211 · 2024-11-13T05:17:05Z

This line of code extends buffer (byte copied) regardless of the filtered result, I think it is the reason why we need gc. If we do append_value here, we have additional hash lookup and insert, but less buffer copied especially in low selectivity and no gc required later on.

I would try to implement a builder like in datafusion

jayzhan211 · 2024-11-16T14:05:03Z

This line of code extends buffer (byte copied) regardless of the filtered result, I think it is the reason why we need gc. If we do append_value here, we have additional hash lookup and insert, but less buffer copied especially in low selectivity and no gc required later on.

Didn't see improvement on this approach apache/datafusion#13450

alamb · 2024-11-18T15:57:24Z

This line of code extends buffer (byte copied) regardless of the filtered result, I think it is the reason why we need gc. If we do append_value here, we have additional hash lookup and insert, but less buffer copied especially in low selectivity and no gc required later on.

Didn't see improvement on this approach apache/datafusion#13450

I didn't have a chance to full look at apache/datafusion#13450 -- can you summarize what approach it implemented?

jayzhan211 · 2024-11-19T00:26:42Z

The idea is to append the string view array as early as possible in the optimal memory usage to eliminate the need of garbage collection and aggregate those filtered rows in a single large batches instead of many small batches.

Current implementation

We first do Filter then Coalesce in two different operator. The coalescer push_batch try to do gc for string view type. And then concat those small batches again.

This PR

I try to combine Filter and Coalesce in Filter operator. And append string view type to the coalescer until the batch size (8192) or no more incoming batch. Then sent it to Coalesce which ideally there should do nothing and pass to the next operator.

This approach has additional cost of computing each view and lookup view's hash again, but eliminate the need of gc.

The result meet my assumption that has no much difference, but currently no further improvement in my mind too.

alamb added the enhancement Any new improvement worthy of a entry in the changelog label Nov 5, 2024

alamb changed the title ~~Optimize take/filter from multiple input arrays to a single output array~~ Optimize take/filter from multiple input arrays to a single large output array Nov 5, 2024

This was referenced Nov 5, 2024

Potential performance regression for TPCH q18 apache/datafusion#13188

Open

Support vectorized append and compare for multi group by apache/datafusion#12996

Merged

alamb mentioned this issue Nov 19, 2024

[not improved] Filter coalesce apache/datafusion#13450

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize take/filter from multiple input arrays to a single large output array #6692

Optimize take/filter from multiple input arrays to a single large output array #6692

alamb commented Nov 5, 2024 •

edited

Loading

alamb commented Nov 5, 2024

devanbenz commented Nov 7, 2024

jayzhan211 commented Nov 7, 2024

tustvold commented Nov 7, 2024

jayzhan211 commented Nov 7, 2024 •

edited

Loading

devanbenz commented Nov 7, 2024

jayzhan211 commented Nov 7, 2024

devanbenz commented Nov 7, 2024

jayzhan211 commented Nov 7, 2024 •

edited

Loading

alamb commented Nov 7, 2024

alamb commented Nov 7, 2024

alamb commented Nov 7, 2024

Rachelint commented Nov 7, 2024 •

edited

Loading

tustvold commented Nov 7, 2024 •

edited

Loading

Rachelint commented Nov 7, 2024

jayzhan211 commented Nov 8, 2024 •

edited

Loading

alamb commented Nov 8, 2024

jayzhan211 commented Nov 13, 2024

jayzhan211 commented Nov 16, 2024

alamb commented Nov 18, 2024

jayzhan211 commented Nov 19, 2024 •

edited

Loading

Optimize take/filter from multiple input arrays to a single large output array #6692

Optimize take/filter from multiple input arrays to a single large output array #6692

Comments

alamb commented Nov 5, 2024 • edited Loading

alamb commented Nov 5, 2024

devanbenz commented Nov 7, 2024

jayzhan211 commented Nov 7, 2024

tustvold commented Nov 7, 2024

jayzhan211 commented Nov 7, 2024 • edited Loading

devanbenz commented Nov 7, 2024

jayzhan211 commented Nov 7, 2024

devanbenz commented Nov 7, 2024

jayzhan211 commented Nov 7, 2024 • edited Loading

alamb commented Nov 7, 2024

alamb commented Nov 7, 2024

alamb commented Nov 7, 2024

Rachelint commented Nov 7, 2024 • edited Loading

tustvold commented Nov 7, 2024 • edited Loading

Rachelint commented Nov 7, 2024

jayzhan211 commented Nov 8, 2024 • edited Loading

alamb commented Nov 8, 2024

jayzhan211 commented Nov 13, 2024

jayzhan211 commented Nov 16, 2024

alamb commented Nov 18, 2024

jayzhan211 commented Nov 19, 2024 • edited Loading

Current implementation

This PR

alamb commented Nov 5, 2024 •

edited

Loading

jayzhan211 commented Nov 7, 2024 •

edited

Loading

jayzhan211 commented Nov 7, 2024 •

edited

Loading

Rachelint commented Nov 7, 2024 •

edited

Loading

tustvold commented Nov 7, 2024 •

edited

Loading

jayzhan211 commented Nov 8, 2024 •

edited

Loading

jayzhan211 commented Nov 19, 2024 •

edited

Loading