@@ -14,55 +14,26 @@ The number of GPUs per node dictates the number of Spark executors that can run
14
14
15
15
Start A Databricks Cluster
16
16
--------------------------
17
-
18
- Create a Databricks cluster by going to "Compute", then clicking ` + Create compute ` . Ensure the
19
- cluster meets the prerequisites above by configuring it as follows:
17
+ Before creating the cluster, we will need to create an [ initialization script] ( https://docs.databricks.com/clusters/init-scripts.html ) for the
18
+ cluster to install the RAPIDS jars. Databricks recommends storing all cluster-scoped init scripts using workspace files.
19
+ Each user has a Home directory configured under the /Users directory in the workspace.
20
+ Navigate to your home directory in the UI and select ** Create** > ** File** from the menu,
21
+ create an ` init.sh ` scripts with contents:
22
+ ``` bash
23
+ #! /bin/bash
24
+ sudo wget -O /databricks/jars/rapids-4-spark_2.12-23.06.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/23.06.0/rapids-4-spark_2.12-23.06.0.jar
25
+ ```
20
26
1 . Select the Databricks Runtime Version from one of the supported runtimes specified in the
21
27
Prerequisites section.
22
28
2 . Choose the number of workers that matches the number of GPUs you want to use.
23
29
3 . Select a worker type. On AWS, use nodes with 1 GPU each such as ` p3.2xlarge ` or ` g4dn.xlarge ` .
24
- p2 nodes do not meet the architecture requirements (Pascal or higher) for the Spark worker
25
- (although they can be used for the driver node). For Azure, choose GPU nodes such as
26
- Standard_NC6s_v3. For GCP, choose N1 or A2 instance types with GPUs.
30
+ For Azure, choose GPU nodes such as Standard_NC6s_v3. For GCP, choose N1 or A2 instance types with GPUs.
27
31
4 . Select the driver type. Generally this can be set to be the same as the worker.
28
- 5 . Start the cluster.
29
-
30
- Advanced Cluster Configuration
31
- --------------------------
32
-
33
- We will need to create an initialization script for the cluster that installs the RAPIDS jars to the
34
- cluster.
35
-
36
- 1 . To create the initialization script, import the initialization script notebook from the repo to
37
- your workspace. See [ Managing
38
- Notebooks] ( https://docs.databricks.com/notebooks/notebooks-manage.html#id2 ) for instructions on
39
- how to import a notebook.
40
- Select the version of the RAPIDS Accelerator for Apache Spark based on the Databricks runtime
41
- version:
42
- - [ Databricks 10.4 LTS
43
- ML] ( https://docs.databricks.com/release-notes/runtime/10.4ml.html#system-environment ) has CUDA 11
44
- installed. Users will need to use 22.04.0 or later on Databricks 10.4 LTS ML.
45
- - [ Databricks 11.3 LTS
46
- ML] ( https://docs.databricks.com/release-notes/runtime/11.3ml.html#system-environment ) has CUDA 11
47
- installed. Users will need to use 23.04.0 or later on Databricks 11.3 LTS ML.
48
-
49
- In both cases use
50
- [ generate-init-script.ipynb] ( ./generate-init-script.ipynb ) which will install
51
- the RAPIDS Spark plugin.
52
-
53
- 2 . Once you are in the notebook, click the “Run All” button.
54
- 3 . Ensure that the newly created init.sh script is present in the output from cell 2 and that the
55
- contents of the script are correct.
56
- 4 . Go back and edit your cluster to configure it to use the init script. To do this, click the
57
- “Compute” button on the left panel, then select your cluster.
58
- 5 . Click the “Edit” button, then navigate down to the “Advanced Options” section. Select the “Init
59
- Scripts” tab in the advanced options section, and paste the initialization script:
60
- ` dbfs:/databricks/init_scripts/init.sh ` , then click “Add”.
61
-
62
- ![ Init Script] ( ../../../../img/databricks/initscript.png )
63
-
32
+ 5 . Click the “Edit” button, then navigate down to the “Advanced Options” section. Select the “Init Scripts” tab in
33
+ the advanced options section, and paste the workspace path to the initialization script:` /Users/user@domain/init.sh ` , then click “Add”.
34
+ ![ Init Script] ( ../../../../img/databricks/initscript.png )
64
35
6 . Now select the “Spark” tab, and paste the following config options into the Spark Config section.
65
- Change the config values based on the workers you choose. See Apache Spark
36
+ Change the config values based on the workers you choose. See Apache Spark
66
37
[ configuration] ( https://spark.apache.org/docs/latest/configuration.html ) and RAPIDS Accelerator
67
38
for Apache Spark [ descriptions] ( https://nvidia.github.io/spark-rapids/docs/configs.html ) for each config.
68
39
@@ -74,18 +45,36 @@ cluster.
74
45
like the CPU side. Having the value smaller is fine as well.
75
46
Note: Please remove the ` spark.task.resource.gpu.amount ` config for a single-node Databricks
76
47
cluster because Spark local mode does not support GPU scheduling.
77
-
48
+
78
49
``` bash
79
- spark.plugins com.nvidia.spark.SQLPlugin
80
- spark.task.resource.gpu.amount 0.1
81
- spark.rapids.memory.pinnedPool.size 2G
82
- spark.rapids.sql.concurrentGpuTasks 2
50
+ spark.plugins com.nvidia.spark.SQLPlugin
51
+ spark.task.resource.gpu.amount 0.1
52
+ spark.rapids.memory.pinnedPool.size 2G
53
+ spark.rapids.sql.concurrentGpuTasks 2
83
54
```
84
55
85
56
! [Spark Config](../../../../img/databricks/sparkconfig.png)
86
57
87
- 7. Once you’ve added the Spark config, click “Confirm and Restart”.
88
- 8. Once the cluster comes back up, it is now enabled for GPU-accelerated Spark with RAPIDS and cuDF.
58
+ If running Pandas UDFs with GPU support from the plugin, at least three additional options
59
+ as below are required. The ` spark.python.daemon.module` option is to choose the right daemon module
60
+ of python for Databricks. On Databricks, the python runtime requires different parameters than the
61
+ Spark one, so a dedicated python demon module ` rapids.daemon_databricks` is created and should
62
+ be specified here. Set the config
63
+ [` spark.rapids.sql.python.gpu.enabled` ](https://nvidia.github.io/spark-rapids/docs/configs.html#sql.python.gpu.enabled) to ` true` to
64
+ enable GPU support for python. Add the path of the plugin jar (supposing it is placed under
65
+ ` /databricks/jars/` ) to the ` spark.executorEnv.PYTHONPATH` option. For more details please go to
66
+ [GPU Scheduling For Pandas UDF](https://nvidia.github.io/spark-rapids/docs/additional-functionality/rapids-udfs.html#gpu-support-for-pandas-udf)
67
+
68
+ ` ` ` bash
69
+ spark.rapids.sql.python.gpu.enabled true
70
+ spark.python.daemon.module rapids.daemon_databricks
71
+ spark.executorEnv.PYTHONPATH /databricks/jars/rapids-4-spark_2.12-23.06.0.jar:/databricks/spark/python
72
+ ` ` `
73
+ Note that since python memory pool require installing the cudf library, so you need to install cudf library in
74
+ each worker nodes ` pip install cudf-cu11 --extra-index-url=https://pypi.nvidia.com` or disable python memory pool
75
+ ` spark.rapids.python.memory.gpu.pooling.enabled=false` .
76
+
77
+ 7. Click ` Create Cluster` , it is now enabled for GPU-accelerated Spark.
89
78
90
79
Install the xgboost4j_spark jar in the cluster
91
80
---------------------------
0 commit comments