Code supporting JupyterCon 2020 talk "High performance Jupyter: Faster workloads with Dask and RAPIDS".
Slides are here
Dask is a parallel computing framework that scales from your laptop to a cluster of thousands of machines. RAPIDS is a GPU-computing framework that pushes traditional CPU workloads to the GPU. Dask and RAPIDS together allow you to scale both up and out! There are several notebooks in this repo that progressively tell the story of accelerating Jupyter with these two tools:
Notebook | Hardware | Tools | Data size, compute time |
---|---|---|---|
laptop.ipynb | Laptop | Pandas/Scikit | 1x, baseline 🔴 |
dask.ipynb | Laptop | Dask | 10x, slow 🟡 |
dask-cluster.ipynb | Cluster (CPU) | Dask | 10x, fast 🟢 |
rapids.ipynb | GPU | RAPIDS | 1x, super fast ⚡️⚡️ |
rapids-cluster.ipynb | Cluster (GPU) | RAPIDS+Dask | 10x, super fast 🤯🤯 |
The laptop.ipynb and dask.ipynb notebooks can be run on any machine that has >4GB RAM. The rapids.ipynb notebook can be run on any machine with a CUDA-enabled GPU. The dask-cluster.ipynb and rapids-cluster.ipynb need to be run on clusters of machines (most easily obtained by renting from a cloud provider).
Here are some timing comparisons from the different notebooks. Please note that repeated experiments were not performed, and hardware specifications were different for each notebook. This is meant to serve as a rough overview of the speedups from the different tools. Also note that the multi-node environments would continue to show speed improvements by adding more machines.
(all times reported in seconds)
Small data size (~7 million rows)
Task | Single-node CPU (Pandas/Scikit) | Single GPU (RAPIDS) |
---|---|---|
.read_csv() | 80 | 8.85 |
.describe() | 4.33 | 0.82 |
train_test_split() | 2.77 | 0.64 |
Random forest | 149 | 1.31 |
Large data size (~85 million rows)
Task | CPU cluster (Dask) | GPU cluster (RAPIDS+Dask) |
---|---|---|
Row count/size | 70 | 32.6 |
.describe() | 48.7 | |
Feature eng (persist()) | 40.5 | 17.6 |
Random forest | 3.82 |
Grid search (~300,000 rows)
Single-node CPU (Pandas/Scikit) | CPU cluster (Dask) | |
---|---|---|
Grid search | 226 | 17.2 |
The environment.yml
file has the necessary packages required to run the laptop.ipynb and dask.ipynb notebooks. There are a couple commands necessary after creating the environment to initialize the Dask extension for JupyterLab, and then you can fire up JupyterLab!
conda env create -f environment.yml
conda activate dask
jupyter labextension install dask-labextension
jupyter serverextension enable dask_labextension
jupyter lab
RAPIDS requires a Linux OS and CUDA-enabled GPU. As such the installation will be different depending on your hardware. RAPIDS has a handy guide here that gives you the conda install
command to run! There is also a JupyterLab extension for monitoring GPU usage included with RAPIDS, so you can run a command to enable that.
conda activate dask
conda install ... # command from RAPIDS guide
conda install -c conda-forge jupyterlab-nvdashboard
jupyter labextension install jupyterlab-nvdashboard
jupyter lab
In the talk I utilize Saturn Cloud for the rapids.ipynb, dask-cluster.ipynb, and rapids-cluster.ipynb notebooks (disclosure: I work at Saturn Cloud). Saturn makes it easy to configure Python environments and launch machines (and clusters!) that support Dask and RAPIDS. You can get going pretty quickly with a free trial on the Hosted version and run the notebooks there.
There needs to be two separate Projects for the CPU (Dask) and GPU (RAPIDS) notebooks:
- Name: "dask"
- Size: "Large - 2 cores - 16 GB RAM"
- Image: "saturncloud/saturn:*"
- Name: "rapids"
- Size: "T4-XLarge - 4 cores - 16 GB RAM - 1 GPU"
- Image: "saturncloud/saturn-gpu:*"
- Start script:
conda install -c conda-forge -n base -y jupyterlab-nvdashboard jupyter labextension install jupyterlab-nvdashboard
Once you start the Jupyter server and jump into JupyterLab, open a new Terminal window to grab the code:
git clone https://github.com/rikturr/high-performance-jupyter /tmp/high-performance-jupyter
cp -r /tmp/high-performance-jupyter/* /home/jovyan/project/
The notebooks will take care of the rest!