Skip to content

Latest commit

 

History

History
31 lines (21 loc) · 799 Bytes

TODO.md

File metadata and controls

31 lines (21 loc) · 799 Bytes

Known issues

  • Not tested with Spark 3.

  • Problem running with Spark 2.4.1 on some platforms: Pandas generates null values. See this example

import numpy as np
import pandas as pd
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

from pyspark.sql.functions import pandas_udf, PandasUDFType

df = spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    ("id", "v"))

@pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP)
def subtract_mean(pdf):
    # pdf is a pandas.DataFrame
    v = pdf.v
    return pdf.assign(v=v - v.mean())

df.groupby("id").apply(subtract_mean).show()
  • Problem with Python (< 3.7): SyntaxError: more than 255 arguments if p is too large with pandas udf.

    • Update Python to 3.7 resolves this problem.