Skip to content

yhuang-db/large-row

Repository files navigation

large-row

Test Data

Raw data: https://github.com/openai/gpt-2-output-dataset

Generating script: generate_string_table.py

# generate test parquet file with 100 rows, 10 columns, where each cell is a 10 MB string
python generate_string_table.py -r 100 -c 10 -s 10

Generated data: large_string_100row_10col_10m.parquet

string_0 string_1 string_2 ...
Is this restaurant family-friendly ? Yes No Unsure ... House Majority Whip Steve Scalise has been ... BY JENNIE MCNULTY\n\nLesbian.com\n\nYou know ... ...
Clinton talks about her time of 'reflection' during ... Insight Course: Lesson 14\n\nControl ... The Buddha's Teaching As It Is\n\nIn the fall of 1979 ... ...
... ... ... ...

Test Jobs

Test scala file: scala/src/main/scala/UDFUpper.scala

import org.apache.spark.sql.SparkSession

object BuiltinUpper {
  def udf_upper(text: String): String = {
    return text.toUpperCase()
  }

  def main(args: Array[String]): Unit = {
    // dataset
    val file = s"data/${dataset}.parquet"
    
    // read parquet file
    val spark = SparkSession.builder.appName("UDF-Upper").getOrCreate()
    spark.udf.register("udf_upper", udf_upper(_: String): String)
    val df = spark.read.parquet(file)
    df.createOrReplaceTempView("T")
    df.printSchema()
    
    // run sql and write to parquet
    val sql_projection = df.columns.map(c => s"udf_upper(${c}) as ${c}").mkString(", ")
    val sql = s"SELECT ${sql_projection} FROM T"
    val df_out = spark.sql(sql)
    df_out.write.mode("overwrite").parquet("output/udf_upper.parquet")
    spark.stop()
  }
}

Test spark-submit commend line: according to https://github.com/dongjoon-hyun/spark/blob/master/.github/workflows/benchmark.yml

spark-3.5.3-bin-hadoop3-scala2.13/bin/spark-submit \
  --class "BuiltinUpper" \
  --master local[1] \
  --driver-memory 6g \
  target/scala-2.13/simple-project_2.13-1.0.jar large_string_100row_10col_10m

Result

builtin_upper udf_upper builtin_length udf_length
100row_10col_1m
100row_10col_5m
100row_10col_10m
builtin_upper udf_upper builtin_length udf_length
1row_1col_500m ✅ (failed w/ pyspark)
1row_1col_1000m
builtin_upper udf_upper builtin_length udf_length
200row_1col_10m ✅ (failed w/ pyspark)
250row_1col_10m
builtin_upper udf_upper builtin_length udf_length
1row_50col_10m
1row_100col_10m

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published