Skip to content

Material for the course "Introduction to Apache Spark APIs for Data Processing" https://sparktraining.web.cern.ch/

License

Notifications You must be signed in to change notification settings

cerndb/SparkTraining

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Training material for the course "Introduction to Spark APIs for Data Processing"

SWAN Binder

Course website with videos and slides: https://sparktraining.web.cern.ch/

Contents

See also the notebooks on display in the CERN SWAN Gallery

Contact: [email protected]


Notebooks

Session 1

Tutorial-DataFrame.ipynb
Solutions-DataFrame.ipynb
Examples-Pandas on Spark

Session 2

Tutorial-SparkSQL.ipynb
HandsOn-SparkSQL_exercises.ipynb
HandsOn-SparkSQL_with_solutions.ipynb

Session 3

Tutorial-SparkStreaming.ipynb
ML_Demo1_Classifier.ipynb
ML_Demo2_Regression.ipynb
Spark_JDBC_Oracle.ipynb

Session 4

Demo_Spark_on_Hadoop.ipynb
Demo_Dimuon_mass_spectrum.ipynb
NXCals-example.ipynb
NXCals-example_bis.ipynb
TPCDS_PySpark_CERN_SWAN_getstarted.ipynb

Additional SWAN gallery notebooks

LHCb_OpenData_Spark.ipynb
Dimuon_Spark_ROOT_RDataFrame.ipynb


How to run the notebooks from CERN SWAN Notebook Service

  • Open SWAN and clone the repo: SWAN
    • note this can take a couple of minutes
    • as an alternative you can clone the repo from the SWAN GUI https://swan.web.cern.ch
      • find and click the button "Download project from git"
      • when prompted, clone the repo https://github.com/cerndb/SparkTraining.git
  • Open the tutorial notebooks at SparkTraining -> notebooks

How to run the notebooks from private Jupyter installations or other notebook services (Colab, Binder, etc)

  • pip install pyspark
  • git clone https://github.com/cerndb/SparkTraining
    • or clone the image at https://gitlab.cern.ch/db/SparkTraining
  • Start jupyter: jupyter-notebook
  • Run the notebooks on Colab:
    • With this option you will need also to download the data folder and pip install pyspark
  • Run on binder: Binder

Releases

No releases published

Packages

No packages published