Skip to content

Commit 0cb527c

Browse files
authored
Merge pull request #306 from nvliyuan/main-v2306-release
merge branch-23.06 to main branch
2 parents 3cff617 + 5a69221 commit 0cb527c

File tree

70 files changed

+17633
-613
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

70 files changed

+17633
-613
lines changed

.github/workflows/auto-merge.yml

+4-4
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ name: auto-merge HEAD to BASE
1818
on:
1919
pull_request_target:
2020
branches:
21-
- branch-23.04
21+
- branch-23.06
2222
types: [closed]
2323

2424
jobs:
@@ -29,14 +29,14 @@ jobs:
2929
steps:
3030
- uses: actions/checkout@v3
3131
with:
32-
ref: branch-23.04 # force to fetch from latest upstream instead of PR ref
32+
ref: branch-23.06 # force to fetch from latest upstream instead of PR ref
3333

3434
- name: auto-merge job
3535
uses: ./.github/workflows/auto-merge
3636
env:
3737
OWNER: NVIDIA
3838
REPO_NAME: spark-rapids-examples
39-
HEAD: branch-23.04
40-
BASE: branch-23.06
39+
HEAD: branch-23.06
40+
BASE: branch-23.08
4141
AUTOMERGE_TOKEN: ${{ secrets.AUTOMERGE_TOKEN }} # use to merge PR
4242

docs/get-started/xgboost-examples/csp/databricks/databricks.md

+39-50
Original file line numberDiff line numberDiff line change
@@ -14,55 +14,26 @@ The number of GPUs per node dictates the number of Spark executors that can run
1414

1515
Start A Databricks Cluster
1616
--------------------------
17-
18-
Create a Databricks cluster by going to "Compute", then clicking `+ Create compute`. Ensure the
19-
cluster meets the prerequisites above by configuring it as follows:
17+
Before creating the cluster, we will need to create an [initialization script](https://docs.databricks.com/clusters/init-scripts.html) for the
18+
cluster to install the RAPIDS jars. Databricks recommends storing all cluster-scoped init scripts using workspace files.
19+
Each user has a Home directory configured under the /Users directory in the workspace.
20+
Navigate to your home directory in the UI and select **Create** > **File** from the menu,
21+
create an `init.sh` scripts with contents:
22+
```bash
23+
#!/bin/bash
24+
sudo wget -O /databricks/jars/rapids-4-spark_2.12-23.06.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/23.06.0/rapids-4-spark_2.12-23.06.0.jar
25+
```
2026
1. Select the Databricks Runtime Version from one of the supported runtimes specified in the
2127
Prerequisites section.
2228
2. Choose the number of workers that matches the number of GPUs you want to use.
2329
3. Select a worker type. On AWS, use nodes with 1 GPU each such as `p3.2xlarge` or `g4dn.xlarge`.
24-
p2 nodes do not meet the architecture requirements (Pascal or higher) for the Spark worker
25-
(although they can be used for the driver node). For Azure, choose GPU nodes such as
26-
Standard_NC6s_v3. For GCP, choose N1 or A2 instance types with GPUs.
30+
For Azure, choose GPU nodes such as Standard_NC6s_v3. For GCP, choose N1 or A2 instance types with GPUs.
2731
4. Select the driver type. Generally this can be set to be the same as the worker.
28-
5. Start the cluster.
29-
30-
Advanced Cluster Configuration
31-
--------------------------
32-
33-
We will need to create an initialization script for the cluster that installs the RAPIDS jars to the
34-
cluster.
35-
36-
1. To create the initialization script, import the initialization script notebook from the repo to
37-
your workspace. See [Managing
38-
Notebooks](https://docs.databricks.com/notebooks/notebooks-manage.html#id2) for instructions on
39-
how to import a notebook.
40-
Select the version of the RAPIDS Accelerator for Apache Spark based on the Databricks runtime
41-
version:
42-
- [Databricks 10.4 LTS
43-
ML](https://docs.databricks.com/release-notes/runtime/10.4ml.html#system-environment) has CUDA 11
44-
installed. Users will need to use 22.04.0 or later on Databricks 10.4 LTS ML.
45-
- [Databricks 11.3 LTS
46-
ML](https://docs.databricks.com/release-notes/runtime/11.3ml.html#system-environment) has CUDA 11
47-
installed. Users will need to use 23.04.0 or later on Databricks 11.3 LTS ML.
48-
49-
In both cases use
50-
[generate-init-script.ipynb](./generate-init-script.ipynb) which will install
51-
the RAPIDS Spark plugin.
52-
53-
2. Once you are in the notebook, click the “Run All” button.
54-
3. Ensure that the newly created init.sh script is present in the output from cell 2 and that the
55-
contents of the script are correct.
56-
4. Go back and edit your cluster to configure it to use the init script. To do this, click the
57-
“Compute” button on the left panel, then select your cluster.
58-
5. Click the “Edit” button, then navigate down to the “Advanced Options” section. Select the “Init
59-
Scripts” tab in the advanced options section, and paste the initialization script:
60-
`dbfs:/databricks/init_scripts/init.sh`, then click “Add”.
61-
62-
![Init Script](../../../../img/databricks/initscript.png)
63-
32+
5. Click the “Edit” button, then navigate down to the “Advanced Options” section. Select the “Init Scripts” tab in
33+
the advanced options section, and paste the workspace path to the initialization script:`/Users/user@domain/init.sh`, then click “Add”.
34+
![Init Script](../../../../img/databricks/initscript.png)
6435
6. Now select the “Spark” tab, and paste the following config options into the Spark Config section.
65-
Change the config values based on the workers you choose. See Apache Spark
36+
Change the config values based on the workers you choose. See Apache Spark
6637
[configuration](https://spark.apache.org/docs/latest/configuration.html) and RAPIDS Accelerator
6738
for Apache Spark [descriptions](https://nvidia.github.io/spark-rapids/docs/configs.html) for each config.
6839

@@ -74,18 +45,36 @@ cluster.
7445
like the CPU side. Having the value smaller is fine as well.
7546
Note: Please remove the `spark.task.resource.gpu.amount` config for a single-node Databricks
7647
cluster because Spark local mode does not support GPU scheduling.
77-
48+
7849
```bash
79-
spark.plugins com.nvidia.spark.SQLPlugin
80-
spark.task.resource.gpu.amount 0.1
81-
spark.rapids.memory.pinnedPool.size 2G
82-
spark.rapids.sql.concurrentGpuTasks 2
50+
spark.plugins com.nvidia.spark.SQLPlugin
51+
spark.task.resource.gpu.amount 0.1
52+
spark.rapids.memory.pinnedPool.size 2G
53+
spark.rapids.sql.concurrentGpuTasks 2
8354
```
8455

8556
![Spark Config](../../../../img/databricks/sparkconfig.png)
8657

87-
7. Once you’ve added the Spark config, click “Confirm and Restart”.
88-
8. Once the cluster comes back up, it is now enabled for GPU-accelerated Spark with RAPIDS and cuDF.
58+
If running Pandas UDFs with GPU support from the plugin, at least three additional options
59+
as below are required. The `spark.python.daemon.module` option is to choose the right daemon module
60+
of python for Databricks. On Databricks, the python runtime requires different parameters than the
61+
Spark one, so a dedicated python demon module `rapids.daemon_databricks` is created and should
62+
be specified here. Set the config
63+
[`spark.rapids.sql.python.gpu.enabled`](https://nvidia.github.io/spark-rapids/docs/configs.html#sql.python.gpu.enabled) to `true` to
64+
enable GPU support for python. Add the path of the plugin jar (supposing it is placed under
65+
`/databricks/jars/`) to the `spark.executorEnv.PYTHONPATH` option. For more details please go to
66+
[GPU Scheduling For Pandas UDF](https://nvidia.github.io/spark-rapids/docs/additional-functionality/rapids-udfs.html#gpu-support-for-pandas-udf)
67+
68+
```bash
69+
spark.rapids.sql.python.gpu.enabled true
70+
spark.python.daemon.module rapids.daemon_databricks
71+
spark.executorEnv.PYTHONPATH /databricks/jars/rapids-4-spark_2.12-23.06.0.jar:/databricks/spark/python
72+
```
73+
Note that since python memory pool require installing the cudf library, so you need to install cudf library in
74+
each worker nodes `pip install cudf-cu11 --extra-index-url=https://pypi.nvidia.com` or disable python memory pool
75+
`spark.rapids.python.memory.gpu.pooling.enabled=false`.
76+
77+
7. Click `Create Cluster`, it is now enabled for GPU-accelerated Spark.
8978

9079
Install the xgboost4j_spark jar in the cluster
9180
---------------------------

docs/get-started/xgboost-examples/csp/databricks/generate-init-script-10.4.ipynb

-166
This file was deleted.

0 commit comments

Comments
 (0)