forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-50788][TESTS] Add Benchmark for Large-Row Dataframe
### What changes were proposed in this pull request? This PR introduces LargeRowBenchmark, a micro benchmark to the suite of spark.sql.execution.benchmark. A corresponding function is also added to create large-row dataframes during the benchmark running time. ### Why are the changes needed? Large-row dataframes, especially dataframes with large string cells are becoming common with business like online customer chatting. However, it is unknown how well/bad Spark would be able to support them. This benchmark aims to provide a baseline to indicate Spark's performance and limitation on large-row dataframes. It will also be included in future performance regression check. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? It is tested in Github Action and manual reviewed. https://github.com/yhuang-db/spark/actions/runs/12716337093 (Java 17) https://github.com/yhuang-db/spark/actions/runs/12716339158 (Java 21) ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#49447 from yhuang-db/large-row-benchmark. Authored-by: Yuchuan Huang <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>
- Loading branch information
1 parent
1fd8362
commit 488c362
Showing
3 changed files
with
137 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
================================================================================================ | ||
Large Row Benchmark | ||
================================================================================================ | ||
|
||
OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.8.0-1017-azure | ||
AMD EPYC 7763 64-Core Processor | ||
#rows: 100, #cols: 10, cell: 1.3 MB: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative | ||
------------------------------------------------------------------------------------------------------------------------ | ||
built-in UPPER 5909 6154 347 0.0 59088236.5 1.0X | ||
udf UPPER 4106 4364 364 0.0 41062501.9 1.4X | ||
|
||
OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.8.0-1017-azure | ||
AMD EPYC 7763 64-Core Processor | ||
#rows: 1, #cols: 1, cell: 300.0 MB: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative | ||
------------------------------------------------------------------------------------------------------------------------ | ||
built-in UPPER 1317 1319 3 0.0 1317449498.0 1.0X | ||
udf UPPER 954 975 25 0.0 953744994.0 1.4X | ||
|
||
OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.8.0-1017-azure | ||
AMD EPYC 7763 64-Core Processor | ||
#rows: 1, #cols: 200, cell: 1.0 MB: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative | ||
------------------------------------------------------------------------------------------------------------------------ | ||
built-in UPPER 1118 1138 28 0.0 1117901962.0 1.0X | ||
udf UPPER 1145 1210 91 0.0 1145234313.0 1.0X | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
================================================================================================ | ||
Large Row Benchmark | ||
================================================================================================ | ||
|
||
OpenJDK 64-Bit Server VM 17.0.13+11-LTS on Linux 6.8.0-1017-azure | ||
AMD EPYC 7763 64-Core Processor | ||
#rows: 100, #cols: 10, cell: 1.3 MB: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative | ||
------------------------------------------------------------------------------------------------------------------------ | ||
built-in UPPER 6610 6651 58 0.0 66101681.9 1.0X | ||
udf UPPER 4289 4291 3 0.0 42892607.0 1.5X | ||
|
||
OpenJDK 64-Bit Server VM 17.0.13+11-LTS on Linux 6.8.0-1017-azure | ||
AMD EPYC 7763 64-Core Processor | ||
#rows: 1, #cols: 1, cell: 300.0 MB: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative | ||
------------------------------------------------------------------------------------------------------------------------ | ||
built-in UPPER 1492 1510 26 0.0 1492292577.0 1.0X | ||
udf UPPER 1033 1034 1 0.0 1032584220.0 1.4X | ||
|
||
OpenJDK 64-Bit Server VM 17.0.13+11-LTS on Linux 6.8.0-1017-azure | ||
AMD EPYC 7763 64-Core Processor | ||
#rows: 1, #cols: 200, cell: 1.0 MB: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative | ||
------------------------------------------------------------------------------------------------------------------------ | ||
built-in UPPER 1271 1290 28 0.0 1270654457.0 1.0X | ||
udf UPPER 1397 1558 228 0.0 1396607518.0 0.9X | ||
|
||
|
85 changes: 85 additions & 0 deletions
85
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/LargeRowBenchmark.scala
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,85 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one or more | ||
* contributor license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright ownership. | ||
* The ASF licenses this file to You under the Apache License, Version 2.0 | ||
* (the "License"); you may not use this file except in compliance with | ||
* the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.spark.sql.execution.benchmark | ||
|
||
import org.apache.spark.benchmark.Benchmark | ||
import org.apache.spark.sql.functions.lit | ||
|
||
/** | ||
* Benchmark to measure performance for large row table. | ||
* {{{ | ||
* To run this benchmark: | ||
* 1. without sbt: bin/spark-submit --class <this class> | ||
* --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar> | ||
* 2. build/sbt "sql/Test/runMain <this class>" | ||
* 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain <this class>" | ||
* Results will be written to "benchmarks/LargeRowBenchmark-results.txt". | ||
* }}} | ||
*/ | ||
object LargeRowBenchmark extends SqlBasedBenchmark { | ||
|
||
/** | ||
* Prepares a table with large row for benchmarking. The table will be written into | ||
* the given path. | ||
*/ | ||
private def writeLargeRow(path: String, rowsNum: Int, numCols: Int, cellSizeMb: Double): Unit = { | ||
val stringLength = (cellSizeMb * 1024 * 1024).toInt | ||
spark.range(rowsNum) | ||
.select(Seq.tabulate(numCols)(i => lit("a" * stringLength).as(s"col$i")): _*) | ||
.write.parquet(path) | ||
} | ||
|
||
private def runLargeRowBenchmark(rowsNum: Int, numCols: Int, cellSizeMb: Double): Unit = { | ||
withTempPath { path => | ||
val benchmark = new Benchmark( | ||
s"#rows: $rowsNum, #cols: $numCols, cell: $cellSizeMb MB", rowsNum, output = output) | ||
writeLargeRow(path.getAbsolutePath, rowsNum, numCols, cellSizeMb) | ||
val df = spark.read.parquet(path.getAbsolutePath) | ||
df.createOrReplaceTempView("T") | ||
benchmark.addCase("built-in UPPER") { _ => | ||
val sqlSelect = df.columns.map(c => s"UPPER($c) as $c").mkString(", ") | ||
spark.sql(s"SELECT $sqlSelect FROM T").noop() | ||
} | ||
benchmark.addCase("udf UPPER") { _ => | ||
val sqlSelect = df.columns.map(c => s"udfUpper($c) as $c").mkString(", ") | ||
spark.sql(s"SELECT $sqlSelect FROM T").noop() | ||
} | ||
benchmark.run() | ||
} | ||
} | ||
|
||
override def runBenchmarkSuite(mainArgs: Array[String]): Unit = { | ||
runBenchmark("Large Row Benchmark") { | ||
val udfUpper = (s: String) => s.toUpperCase() | ||
spark.udf.register("udfUpper", udfUpper(_: String): String) | ||
|
||
val benchmarks = Array( | ||
Map("rows" -> 100, "cols" -> 10, "cellSizeMb" -> 1.3), // OutOfMemory @ 100, 10, 1.4 | ||
Map("rows" -> 1, "cols" -> 1, "cellSizeMb" -> 300.0), // OutOfMemory @ 1, 1, 400 | ||
Map("rows" -> 1, "cols" -> 200, "cellSizeMb" -> 1.0) // OutOfMemory @ 1, 300, 1 | ||
) | ||
|
||
benchmarks.foreach { b => | ||
val rows = b("rows").asInstanceOf[Int] | ||
val cols = b("cols").asInstanceOf[Int] | ||
val cellSizeMb = b("cellSizeMb").asInstanceOf[Double] | ||
runLargeRowBenchmark(rows, cols, cellSizeMb) | ||
} | ||
} | ||
} | ||
} |