Use sparkMeasure to instrument Python/PySpark code

SparkMeasure can be used to instrument your Python code to measure Apache Spark workload. Use this for example for performance troubleshooting, application instrumentation, workload studies, etc.

Example code

You can find an example of how to instrument a Scala application running Apache Spark jobs at this link:
link to example Python application

How to run the example:

bin/spark-submit --packages ch.cern.sparkmeasure:spark-measure_2.12:0.23 <path_to_examples>/test_sparkmeasure_python.py

Some relevant snippet of code are:

   from pyspark.sql import SparkSession
   spark = (SparkSession.builder
            .appName("test")
            .getOrCreate()
           )

   from sparkmeasure import StageMetrics
   stagemetrics = StageMetrics(spark)

   stagemetrics.begin()
   spark.sql("select count(*) from range(1000) cross join range(1000) cross join range(1000)").show()
   stagemetrics.end()

   # print report to standard output
   stagemetrics.print_report()
   
   # get metric values as a dictionary
   metrics = stagemetrics.aggregate_stagemetrics()
   print(f"metrics elapsedTime = {metrics.get('elapsedTime')}")
   
   # Introduced in sparkMeasure v0.21, memory metrics report:
   stageMetrics.print_memory_report()

   # save session metrics data in json format (default)
   df = stagemetrics.create_stagemetrics_DF("PerfStageMetrics")
   stagemetrics.save_data(df.orderBy("jobId", "stageId"), "/tmp/stagemetrics_test1")

   aggregatedDF = stagemetrics.aggregate_stagemetrics_DF("PerfStageMetrics")
   stagemetrics.save_data(aggregatedDF, "/tmp/stagemetrics_report_test2")

Note: if you want to collect metrics at the task execution level, you can use TaskMetrics instead of StamgeMetrics. The details are discussed in the examples for Python shell and notebook.

Run sparkMeasure using the packaged version from Maven Central

This is how to run sparkMeasure using a packaged version in Maven Central

bin/spark-submit --packages ch.cern.sparkmeasure:spark-measure_2.12:0.24 your_python_code.py

// alternative: just download and use the jar (it is only needed in the driver) as in:
bin/spark-submit --conf spark.driver.extraClassPath=<path>/spark-measure_2.12-0.24.jar ...

Download and build sparkMeasure (optional)

If you want to build from the latest development version:

git clone https://github.com/lucacanali/sparkmeasure
cd sparkmeasure
sbt +package
ls -l target/scala-2.12/spark-measure*.jar  # location of the compiled jar

cd python
pip install .

# Run as in one of these examples:
bin/spark-submit --jars path>/spark-measure_2.12-0.25-SNAPSHOT.jar ...

# alternative, set classpath for the driver (sparkmeasure code runs only in the driver)
bin/spark-submit --conf spark.driver.extraClassPath=<path>/spark-measure_2.12-0.25-SNAPSHOT.jar ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instrument_Python_code.md

Instrument_Python_code.md

Use sparkMeasure to instrument Python/PySpark code

Example code

Run sparkMeasure using the packaged version from Maven Central

Download and build sparkMeasure (optional)

Files

Instrument_Python_code.md

Latest commit

History

Instrument_Python_code.md

File metadata and controls

Use sparkMeasure to instrument Python/PySpark code

Example code

Run sparkMeasure using the packaged version from Maven Central

Download and build sparkMeasure (optional)