Batching support for row-based bounded window functions #9973

mythrocks · 2023-12-06T07:20:42Z

This commit adds support for batched processing of window aggregations where the window-extents are row-based and (finitely) bounded.

Example query:

SELECT 
  COUNT(1) OVER (PARTITION BY part ORDER BY ord ROWS BETWEEN 10 PRECEDING AND 20 FOLLOWING),
  MIN(col) OVER (PARTITION BY part ORDER BY ord ROWS BETWEEN 10 PRECEDING AND CURRENT ROW),
  AVG(nuther) OVER (PARTITION BY part ORDER BY ord ROWS BETWEEN CURRENT ROW AND 20 FOLLOWING)
FROM my_table;

The algorithm is described at length in #1860. In brief:

A new exec GpuBatchedBoundedWindowExec is used to batch the input into chunks that fit into GPU memory.
Depending on the window specification, some rows towards the end of the input batch might not have the context (i.e. "following" rows necessary) to finalize their output. Those rows are carried over to the next batch for recomputation.
At every stage, enough rows from the previous batch are carried forward to provide the "preceding" context for the window computation.

Note that window bounds might be specified with negative offsets. These are also supported. As a consequence, LEAD() and LAG() are supported as well.

SELECT
  COUNT(1)  OVER (PARTITION BY part ORDER BY ord ROWS BETWEEN 5 PRECEDING AND -3 FOLLOWING),
  LAG(col, 10)  OVER (PARTITION BY part ORDER BY ord),
  LEAD(col, 5) OVER (PARTITION BY part ORDER BY ord) ...

This implementation falls back to unbatched processing (via GpuWindowExec) if a window's preceding/following bounds exceeds a configurable maximum (defaulting to 100 rows in either direction). This may be reconfigured via:

spark.conf.set("spark.rapids.sql.window.batched.bounded.row.extent", 500)

GpuBatchedBoundedWindowExec is currently identical to GpuWindowExec, in that it does no batching yet. After rerouting, the tests all seem to still pass.

Compiling. Yet to test. Signed-off-by: MithunR <[email protected]>

Also built safety guards to disable optimization for very large window extents.

Plus, some minor reformatting.

mythrocks · 2023-12-06T07:27:02Z

Still WIP. Need to sort out the outputBatching spec, and add tests for fallback.

integration_tests/src/main/python/window_function_test.py

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuBatchedBoundedWindowExec.scala

revans2 · 2023-12-06T16:06:15Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuBatchedBoundedWindowExec.scala

+  override def next(): ColumnarBatch = {
+    var outputBatch: ColumnarBatch = null
+    while (outputBatch == null  &&  hasNext) {
+      withResource(getNextInputBatch) { inputCbSpillable =>


We need to have withRetry put in here somewhere. The hard part is making sure that we can roll back any of the caching.

We can calculate/get the inputRowCount, noMoreInput and numUnprocessedInCache without needing to get the input batch from inputCbSpillable so that might make it simpler to add in the retry logic.

I am fine if this is a follow on issue, but we need it fixed at some point.

I'll have to address this in a follow-up. I'm still trying to sort out the missing rows problem.

#10046 will address the withRetry part of the problem.

Signed-off-by: MithunR <[email protected]>

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuBatchedBoundedWindowExec.scala

This now allows for `LEAD()`, `LAG()`, and regular window functions with negative values for `preceding`,`following` window bounds.

This commit fixes the batching. The new exec should not have to receive batched input.

mythrocks · 2023-12-07T02:37:16Z

Build

revans2

Looks good to me

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

1. Renamed config. '.extent' to '.max'. 2. Fixed documentation for said config. 3. Removed TODOs that were already handled.

…ed-windows

mythrocks · 2023-12-07T18:54:45Z

I'll rebase and retest this shortly, after regenerating docs.

mythrocks · 2023-12-07T19:26:03Z

Build

mythrocks · 2023-12-07T22:43:44Z

There seems to be a bug in the handling for LEAD()/LAG(). I'm trying to sort this out now.

mythrocks · 2023-12-11T22:30:18Z

The failing test is a weird one. I've boiled it down to the following:

@ignore_order(local=True)
@approximate_float
@pytest.mark.parametrize('batch_size', ['1g'], ids=idfn)
@pytest.mark.parametrize('a_b_gen', [long_gen], ids=meta_idfn('partAndOrderBy:'))
@pytest.mark.parametrize('c_gen', [StructGen(children=[['child_int', IntegerGen()]])], ids=idfn)
@allow_non_gpu(*non_utc_allow)
def test_myth_repro(a_b_gen, c_gen, batch_size):
    conf = {'spark.rapids.sql.batchSizeBytes': batch_size}
    base_window_spec = Window.partitionBy('a').orderBy('b', 'c')

    def do_it(spark):
        # Repro FAIL!
        df = spark.read.parquet("/tmp/repro_input_cpu_parquet")
        df = df.withColumn('row_num', f.row_number().over(base_window_spec))
        df = df.withColumn('lead_def_c', f.lead('c', 2, None).over(base_window_spec))
        return df

        # Repro: WORKING!
        # return df.selectExpr(
        #     "ROW_NUMBER() OVER (PARTITION BY a ORDER BY b, c) row_num",
        #     "LEAD(C, 2, NULL)   OVER (PARTITION BY a ORDER BY b, c) lead_def_c"
        # )

    assert_gpu_and_cpu_are_equal_collect(do_it, conf=conf)

Calling ROW_NUMBER() and LEAD(default) leads to the GPU seemingly not returning any rows, compared to CPU:

FAILED ../../src/main/python/window_function_test.py::test_myth_repro[Struct(['child_int', Integer])-partAndOrderBy:Long-1g][DATAGEN_SEED=1702330945, IGNORE_ORDER({'local': True}), APPROXIMATE_FLOAT] - AssertionError: CPU and GPU list have different lengths at [] CPU: 20 GPU: 0

This is very odd. Instrumentation in the GpuBatchedBoundedWindowIterator code indicates that the right number of rows is being returned. Somewhere after the iterator has returned, the rows don't make it to the output.

This does not happen from SQL, for the same query:

    SELECT ROW_NUMBER() OVER (PARTITION BY a ORDER BY b, c) row_num,
           LEAD(c, 2, NULL) OVER(PARTITION BY a ORDER BY b, c) lead_def_c
    FROM my_repro_table

Nor does this repro from the command line. Nor with LEAD() called without a default.

revans2 · 2023-12-11T23:01:27Z

This is really odd because ROW_NUMBER is supposed to be a running window agg, so they should not be in the same window operation at all. Unless it is something to do with lead by itself being a problem.

mythrocks · 2023-12-12T19:15:27Z

This is really odd because ROW_NUMBER is supposed to be a running window agg, so they should not be in the same window operation at all.

You're right about that. And the plan does indicate that these operations are addressed in separate execs:

== Physical Plan ==
GpuColumnarToRow false
+- GpuBatchedBoundedWindow [a#212L, b#213L, c#214, gpulead(c#214, 2, null) gpuwindowspecdefinition(a#212L, b#213L ASC NULLS FIRST, c#214 ASC NULLS FIRST, gpuspecifiedwindowframe(RowFrame, 2, 2)) AS lead_def_c#225, row_num#219], [a#212L], [b#213L ASC NULLS FIRST, c#214 ASC NULLS FIRST]
   +- GpuRunningWindow [a#212L, b#213L, c#214, gpurownumber$() gpuwindowspecdefinition(a#212L, b#213L ASC NULLS FIRST, c#214 ASC NULLS FIRST, gpuspecifiedwindowframe(RowFrame, gpuspecialframeboundary(unboundedpreceding$()), gpuspecialframeboundary(currentrow$()))) AS row_num#219], [a#212L], [b#213L ASC NULLS FIRST, c#214 ASC NULLS FIRST]
...

The progress has been slow, but I was wrong about the following:

Nor does this repro from the command line...

I have a repro from the shell, not just from pytest. The df.show() does return the right number of rows. df.collect() seems to return 0 rows, for this combination of operations.

The operations work fine individually. 😕

…ed-windows

mythrocks · 2023-12-13T18:16:03Z

@revans2 has cracked it: Looks like reordering the execs causes the output columns to be reordered as well.

I'm testing out the fix. I should have an update to this PR shortly.

mythrocks · 2023-12-13T19:31:23Z

Build

revans2

My only request is that we have a follow on issue to add retry to this.

mythrocks · 2023-12-13T20:27:23Z

I'll raise one and start on it shortly.

Edit: I have filed #10046 for the follow on.

mythrocks · 2023-12-13T22:50:32Z

Thank you for the review and guidance, @revans2. This has been merged.

mythrocks added 15 commits November 14, 2023 13:53

Initial swipe: Batched, bounded windows.

ae3eec0

Bounded windows routed to BatchedBoundedWindowExec.

2aa3f5d

GpuBatchedBoundedWindowExec is currently identical to GpuWindowExec, in that it does no batching yet. After rerouting, the tests all seem to still pass.

First swipe at new version.

cb45866

Compiling. Yet to test. Signed-off-by: MithunR <[email protected]>

Mostly working, but buggy. Losing some rows in the output.

c3cb1a4

Fixed up the math. Looks to be working at 150M.

1230fed

Minor refactor/cleanup.

159dc62

Clearing cache on task completion.

392e7c6

Fixed leak from trim().

5dbbdd2

Document onTaskCompletion.

d00747f

Optimization: Skip window kernel if no output for current batch.

b5bd065

Removed commented code, prints.

94d9bc4

Switched to exposing negative minPreceding.

60245fe

Also built safety guards to disable optimization for very large window extents.

Removed incorrect error message.

682afdc

Tests for varying finite window combinations.

4767080

Tests for unpartitioned cases.

fa5ab16

Plus, some minor reformatting.

mythrocks added feature request New feature or request reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Dec 6, 2023

mythrocks self-assigned this Dec 6, 2023

mythrocks marked this pull request as draft December 6, 2023 07:20

revans2 reviewed Dec 6, 2023

View reviewed changes

integration_tests/src/main/python/window_function_test.py Show resolved Hide resolved

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuBatchedBoundedWindowExec.scala Outdated Show resolved Hide resolved

revans2 reviewed Dec 6, 2023

View reviewed changes

mythrocks added 3 commits December 6, 2023 11:04

Fixed leak in concat.

3f50224

Test that large extents fall back to GpuWindowExec.

e08cfa2

Signed-off-by: MithunR <[email protected]>

Fix build break with Scala 2.13.

e0581c0

revans2 reviewed Dec 6, 2023

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuBatchedBoundedWindowExec.scala Outdated Show resolved Hide resolved

Support for negative offsets.

0ba8be6

This now allows for `LEAD()`, `LAG()`, and regular window functions with negative values for `preceding`,`following` window bounds.

mythrocks changed the title ~~[WIP] Batching support for row-based bounded window functions (with non-negative offsets)~~ [WIP] Batching support for row-based bounded window functions Dec 7, 2023

mythrocks changed the title ~~[WIP] Batching support for row-based bounded window functions~~ Batching support for row-based bounded window functions Dec 7, 2023

Removed erroneous batching.

b5fda09

This commit fixes the batching. The new exec should not have to receive batched input.

mythrocks marked this pull request as ready for review December 7, 2023 02:36

revans2 previously approved these changes Dec 7, 2023

View reviewed changes

revans2 reviewed Dec 7, 2023

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala Show resolved Hide resolved

This was referenced Dec 7, 2023

[FEA] explore maximum memory usage for full window operations #9986

Open

[FEA] Make GpuBatchedBoundedWindowIterator aware of the partition boundaries #9988

Open

Config changes:

f9de13a

1. Renamed config. '.extent' to '.max'. 2. Fixed documentation for said config. 3. Removed TODOs that were already handled.

mythrocks dismissed revans2’s stale review via f9de13a December 7, 2023 18:53

Merge remote-tracking branch 'origin/branch-24.02' into batched-bound…

31896e3

…ed-windows

Docs update for batched row window config.

4a5f5c5

Merge remote-tracking branch 'origin/branch-24.02' into batched-bound…

4246cbb

…ed-windows

Fixed output column order. This fixes the empty output problem.

5d475ca

mythrocks requested a review from revans2 December 13, 2023 19:05

revans2 approved these changes Dec 13, 2023

View reviewed changes

mythrocks mentioned this pull request Dec 13, 2023

[Task] Use withRetry for resilience in GpuBatchedBoundedWindowIterator #10046

Open

mythrocks merged commit 3720faf into NVIDIA:branch-24.02 Dec 13, 2023
36 of 37 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batching support for row-based bounded window functions #9973

Batching support for row-based bounded window functions #9973

mythrocks commented Dec 6, 2023 •

edited

Loading

mythrocks commented Dec 6, 2023

revans2 Dec 6, 2023

mythrocks Dec 12, 2023

mythrocks Dec 14, 2023

mythrocks commented Dec 7, 2023

revans2 left a comment

mythrocks commented Dec 7, 2023

mythrocks commented Dec 7, 2023

mythrocks commented Dec 7, 2023

mythrocks commented Dec 11, 2023

revans2 commented Dec 11, 2023

mythrocks commented Dec 12, 2023

mythrocks commented Dec 13, 2023

mythrocks commented Dec 13, 2023

revans2 left a comment

mythrocks commented Dec 13, 2023 •

edited

Loading

mythrocks commented Dec 13, 2023

Batching support for row-based bounded window functions #9973

Batching support for row-based bounded window functions #9973

Conversation

mythrocks commented Dec 6, 2023 • edited Loading

mythrocks commented Dec 6, 2023

revans2 Dec 6, 2023

Choose a reason for hiding this comment

mythrocks Dec 12, 2023

Choose a reason for hiding this comment

mythrocks Dec 14, 2023

Choose a reason for hiding this comment

mythrocks commented Dec 7, 2023

revans2 left a comment

Choose a reason for hiding this comment

mythrocks commented Dec 7, 2023

mythrocks commented Dec 7, 2023

mythrocks commented Dec 7, 2023

mythrocks commented Dec 11, 2023

revans2 commented Dec 11, 2023

mythrocks commented Dec 12, 2023

mythrocks commented Dec 13, 2023

mythrocks commented Dec 13, 2023

revans2 left a comment

Choose a reason for hiding this comment

mythrocks commented Dec 13, 2023 • edited Loading

mythrocks commented Dec 13, 2023

mythrocks commented Dec 6, 2023 •

edited

Loading

mythrocks commented Dec 13, 2023 •

edited

Loading