Skip to content

Commit 239389e

Browse files
committed
Merge remote-tracking branch 'origin/branch-22.12' into main-2212-release
# Conflicts: # .github/workflows/auto-merge.yml # docs/get-started/xgboost-examples/csp/databricks/generate-init-script-10.4.ipynb # docs/get-started/xgboost-examples/csp/databricks/generate-init-script.ipynb # docs/get-started/xgboost-examples/on-prem-cluster/kubernetes-scala.md # docs/get-started/xgboost-examples/prepare-package-data/preparation-python.md # docs/get-started/xgboost-examples/prepare-package-data/preparation-scala.md # examples/ML+DL-Examples/Spark-cuML/pca/Dockerfile # examples/ML+DL-Examples/Spark-cuML/pca/README.md # examples/ML+DL-Examples/Spark-cuML/pca/pom.xml # examples/ML+DL-Examples/Spark-cuML/pca/spark-submit.sh # examples/SQL+DF-Examples/micro-benchmarks/notebooks/micro-benchmarks-gpu.ipynb # examples/UDF-Examples/RAPIDS-accelerated-UDFs/README.md # examples/UDF-Examples/RAPIDS-accelerated-UDFs/pom.xml # examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/CMakeLists.txt # examples/UDF-Examples/Spark-cuSpatial/Dockerfile # examples/UDF-Examples/Spark-cuSpatial/Dockerfile.awsdb # examples/UDF-Examples/Spark-cuSpatial/README.md # examples/UDF-Examples/Spark-cuSpatial/gpu-run.sh # examples/UDF-Examples/Spark-cuSpatial/notebooks/cuspatial_sample_standalone.ipynb # examples/UDF-Examples/Spark-cuSpatial/pom.xml # examples/UDF-Examples/Spark-cuSpatial/src/main/native/CMakeLists.txt # examples/XGBoost-Examples/README.md # examples/XGBoost-Examples/mortgage/notebooks/python/MortgageETL+XGBoost.ipynb # examples/XGBoost-Examples/mortgage/notebooks/python/MortgageETL.ipynb # examples/XGBoost-Examples/mortgage/notebooks/scala/mortgage-ETL.ipynb # examples/XGBoost-Examples/taxi/notebooks/python/taxi-ETL.ipynb # examples/XGBoost-Examples/taxi/notebooks/scala/taxi-ETL.ipynb
2 parents c1af0cd + 1bea5c9 commit 239389e

File tree

65 files changed

+2239
-1038
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

65 files changed

+2239
-1038
lines changed

.github/workflows/auto-merge.yml

+5-5
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ name: auto-merge HEAD to BASE
1818
on:
1919
pull_request_target:
2020
branches:
21-
- branch-22.10
21+
- branch-22.12
2222
types: [closed]
2323

2424
jobs:
@@ -27,15 +27,15 @@ jobs:
2727
runs-on: ubuntu-latest
2828

2929
steps:
30-
- uses: actions/checkout@v2
30+
- uses: actions/checkout@v3
3131
with:
32-
ref: branch-22.10 # force to fetch from latest upstream instead of PR ref
32+
ref: branch-22.12 # force to fetch from latest upstream instead of PR ref
3333

3434
- name: auto-merge job
3535
uses: ./.github/workflows/auto-merge
3636
env:
3737
OWNER: NVIDIA
3838
REPO_NAME: spark-rapids-examples
39-
HEAD: branch-22.10
40-
BASE: branch-22.12
39+
HEAD: branch-22.12
40+
BASE: branch-23.02
4141
AUTOMERGE_TOKEN: ${{ secrets.AUTOMERGE_TOKEN }} # use to merge PR

README.md

+1
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ There are broadly four categories of examples in this repo:
1010
2. [Spark XGBoost](./examples/XGBoost-Examples)
1111
3. [Deep Learning/Machine Learning](./examples/ML+DL-Examples)
1212
4. [RAPIDS UDF](./examples/UDF-Examples)
13+
5. [Databricks Tools demo notebooks](./tools/databricks)
1314

1415
For more information on each of the examples please look into respective categories.
1516

docs/get-started/xgboost-examples/csp/aws/ec2.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -177,8 +177,8 @@ spark-submit --master spark://$HOSTNAME:7077 \
177177
${SAMPLE_JAR} \
178178
-num_workers=${NUM_EXECUTORS} \
179179
-format=csv \
180-
-dataPath="train::s3a://spark-xgboost-mortgage-dataset/csv/train/2000Q1" \
181-
-dataPath="trans::s3a://spark-xgboost-mortgage-dataset/csv/eval/2000Q1" \
180+
-dataPath="train::your-train-data-path" \
181+
-dataPath="trans::your-eval-data-path" \
182182
-numRound=100 -max_depth=8 -nthread=$NUM_EXECUTOR_CORES -showFeatures=0 \
183183
-tree_method=gpu_hist
184184
```

docs/get-started/xgboost-examples/csp/databricks/generate-init-script-10.4.ipynb

+8-8
Original file line numberDiff line numberDiff line change
@@ -24,9 +24,9 @@
2424
"source": [
2525
"%sh\n",
2626
"cd ../../dbfs/FileStore/jars/\n",
27-
"sudo wget -O rapids-4-spark_2.12-22.10.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.10.0/rapids-4-spark_2.12-22.10.0.jar\n",
28-
"sudo wget -O xgboost4j-gpu_2.12-1.6.1.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-gpu_2.12/1.6.1/xgboost4j-gpu_2.12-1.6.1.jar\n",
29-
"sudo wget -O xgboost4j-spark-gpu_2.12-1.6.1.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-spark-gpu_2.12/1.6.1/xgboost4j-spark-gpu_2.12-1.6.1.jar\n",
27+
"sudo wget -O rapids-4-spark_2.12-22.12.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.12.0/rapids-4-spark_2.12-22.12.0.jar\n",
28+
"sudo wget -O xgboost4j-gpu_2.12-1.7.1.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-gpu_2.12/1.7.1/xgboost4j-gpu_2.12-1.7.1.jar\n",
29+
"sudo wget -O xgboost4j-spark-gpu_2.12-1.7.1.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-spark-gpu_2.12/1.7.1/xgboost4j-spark-gpu_2.12-1.7.1.jar\n",
3030
"ls -ltr\n",
3131
"\n",
3232
"# Your Jars are downloaded in dbfs:/FileStore/jars directory"
@@ -59,9 +59,9 @@
5959
"sudo rm -f /databricks/jars/spark--maven-trees--ml--10.x--xgboost-gpu--ml.dmlc--xgboost4j-gpu_2.12--ml.dmlc__xgboost4j-gpu_2.12__1.5.2.jar\n",
6060
"sudo rm -f /databricks/jars/spark--maven-trees--ml--10.x--xgboost-gpu--ml.dmlc--xgboost4j-spark-gpu_2.12--ml.dmlc__xgboost4j-spark-gpu_2.12__1.5.2.jar\n",
6161
"\n",
62-
"sudo cp /dbfs/FileStore/jars/xgboost4j-gpu_2.12-1.6.1.jar /databricks/jars/\n",
63-
"sudo cp /dbfs/FileStore/jars/rapids-4-spark_2.12-22.10.0.jar /databricks/jars/\n",
64-
"sudo cp /dbfs/FileStore/jars/xgboost4j-spark-gpu_2.12-1.6.1.jar /databricks/jars/\"\"\", True)"
62+
"sudo cp /dbfs/FileStore/jars/xgboost4j-gpu_2.12-1.7.1.jar /databricks/jars/\n",
63+
"sudo cp /dbfs/FileStore/jars/rapids-4-spark_2.12-22.12.0.jar /databricks/jars/\n",
64+
"sudo cp /dbfs/FileStore/jars/xgboost4j-spark-gpu_2.12-1.7.1.jar /databricks/jars/\"\"\", True)"
6565
]
6666
},
6767
{
@@ -132,8 +132,8 @@
132132
"\n",
133133
"1. Edit your cluster, adding an initialization script from `dbfs:/databricks/init_scripts/init.sh` in the \"Advanced Options\" under \"Init Scripts\" tab\n",
134134
"2. Reboot the cluster\n",
135-
"3. Go to \"Libraries\" tab under your cluster and install `dbfs:/FileStore/jars/xgboost4j-spark-gpu_2.12-1.6.1.jar` in your cluster by selecting the \"DBFS\" option for installing jars\n",
136-
"4. Import the mortgage example notebook from `https://github.com/NVIDIA/spark-rapids-examples/blob/branch-22.10/examples/XGBoost-Examples/mortgage/notebooks/python/mortgage-gpu.ipynb`\n",
135+
"3. Go to \"Libraries\" tab under your cluster and install `dbfs:/FileStore/jars/xgboost4j-spark-gpu_2.12-1.7.1.jar` in your cluster by selecting the \"DBFS\" option for installing jars\n",
136+
"4. Import the mortgage example notebook from `https://github.com/NVIDIA/spark-rapids-examples/blob/branch-22.12/examples/XGBoost-Examples/mortgage/notebooks/python/mortgage-gpu.ipynb`\n",
137137
"5. Inside the mortgage example notebook, update the data paths\n",
138138
" `train_data = reader.schema(schema).option('header', True).csv('/data/mortgage/csv/small-train.csv')`\n",
139139
" `trans_data = reader.schema(schema).option('header', True).csv('/data/mortgage/csv/small-trans.csv')`"

docs/get-started/xgboost-examples/csp/databricks/generate-init-script.ipynb

+8-8
Original file line numberDiff line numberDiff line change
@@ -24,9 +24,9 @@
2424
"source": [
2525
"%sh\n",
2626
"cd ../../dbfs/FileStore/jars/\n",
27-
"sudo wget -O rapids-4-spark_2.12-22.10.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.10.0/rapids-4-spark_2.12-22.10.0.jar\n",
28-
"sudo wget -O xgboost4j-gpu_2.12-1.6.1.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-gpu_2.12/1.6.1/xgboost4j-gpu_2.12-1.6.1.jar\n",
29-
"sudo wget -O xgboost4j-spark-gpu_2.12-1.6.1.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-spark-gpu_2.12/1.6.1/xgboost4j-spark-gpu_2.12-1.6.1.jar\n",
27+
"sudo wget -O rapids-4-spark_2.12-22.12.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.12.0/rapids-4-spark_2.12-22.12.0.jar\n",
28+
"sudo wget -O xgboost4j-gpu_2.12-1.7.1.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-gpu_2.12/1.7.1/xgboost4j-gpu_2.12-1.7.1.jar\n",
29+
"sudo wget -O xgboost4j-spark-gpu_2.12-1.7.1.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-spark-gpu_2.12/1.7.1/xgboost4j-spark-gpu_2.12-1.7.1.jar\n",
3030
"ls -ltr\n",
3131
"\n",
3232
"# Your Jars are downloaded in dbfs:/FileStore/jars directory"
@@ -59,9 +59,9 @@
5959
"sudo rm -f /databricks/jars/spark--maven-trees--ml--9.x--xgboost-gpu--ml.dmlc--xgboost4j-gpu_2.12--ml.dmlc__xgboost4j-gpu_2.12__1.4.1.jar\n",
6060
"sudo rm -f /databricks/jars/spark--maven-trees--ml--9.x--xgboost-gpu--ml.dmlc--xgboost4j-spark-gpu_2.12--ml.dmlc__xgboost4j-spark-gpu_2.12__1.4.1.jar\n",
6161
"\n",
62-
"sudo cp /dbfs/FileStore/jars/xgboost4j-gpu_2.12-1.6.1.jar /databricks/jars/\n",
63-
"sudo cp /dbfs/FileStore/jars/rapids-4-spark_2.12-22.10.0.jar /databricks/jars/\n",
64-
"sudo cp /dbfs/FileStore/jars/xgboost4j-spark-gpu_2.12-1.6.1.jar /databricks/jars/\"\"\", True)"
62+
"sudo cp /dbfs/FileStore/jars/xgboost4j-gpu_2.12-1.7.1.jar /databricks/jars/\n",
63+
"sudo cp /dbfs/FileStore/jars/rapids-4-spark_2.12-22.12.0.jar /databricks/jars/\n",
64+
"sudo cp /dbfs/FileStore/jars/xgboost4j-spark-gpu_2.12-1.7.1.jar /databricks/jars/\"\"\", True)"
6565
]
6666
},
6767
{
@@ -132,8 +132,8 @@
132132
"\n",
133133
"1. Edit your cluster, adding an initialization script from `dbfs:/databricks/init_scripts/init.sh` in the \"Advanced Options\" under \"Init Scripts\" tab\n",
134134
"2. Reboot the cluster\n",
135-
"3. Go to \"Libraries\" tab under your cluster and install `dbfs:/FileStore/jars/xgboost4j-spark-gpu_2.12-1.6.1.jar` in your cluster by selecting the \"DBFS\" option for installing jars\n",
136-
"4. Import the mortgage example notebook from `https://github.com/NVIDIA/spark-rapids-examples/blob/branch-22.10/examples/XGBoost-Examples/mortgage/notebooks/python/mortgage-gpu.ipynb`\n",
135+
"3. Go to \"Libraries\" tab under your cluster and install `dbfs:/FileStore/jars/xgboost4j-spark-gpu_2.12-1.7.1.jar` in your cluster by selecting the \"DBFS\" option for installing jars\n",
136+
"4. Import the mortgage example notebook from `https://github.com/NVIDIA/spark-rapids-examples/blob/branch-22.12/examples/XGBoost-Examples/mortgage/notebooks/python/mortgage-gpu.ipynb`\n",
137137
"5. Inside the mortgage example notebook, update the data paths\n",
138138
" `train_data = reader.schema(schema).option('header', True).csv('/data/mortgage/csv/small-train.csv')`\n",
139139
" `trans_data = reader.schema(schema).option('header', True).csv('/data/mortgage/csv/small-trans.csv')`"

docs/get-started/xgboost-examples/notebook/python-notebook.md

-4
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,3 @@ and the home directory for Apache Spark respectively.
6767
- Mortgage ETL Notebook: [Python](../../../../examples/XGBoost-Examples/mortgage/notebooks/python/MortgageETL.ipynb)
6868
- Taxi ETL Notebook: [Python](../../../../examples/XGBoost-Examples/taxi/notebooks/python/taxi-ETL.ipynb)
6969
- Note: Agaricus does not have ETL part.
70-
71-
For PySpark based XGBoost, please refer to the
72-
[Spark-RAPIDS-examples 22.04 branch](https://github.com/NVIDIA/spark-rapids-examples/blob/branch-22.04/docs/get-started/xgboost-examples/notebook/python-notebook.md)
73-
that uses [NVIDIA’s Spark XGBoost version](https://repo1.maven.org/maven2/com/nvidia/xgboost4j-spark_3.0/).

docs/get-started/xgboost-examples/on-prem-cluster/kubernetes-scala.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ export SPARK_DOCKER_IMAGE=<gpu spark docker image repo and name>
4040
export SPARK_DOCKER_TAG=<spark docker image tag>
4141

4242
pushd ${SPARK_HOME}
43-
wget https://github.com/NVIDIA/spark-rapids-examples/raw/branch-22.10/dockerfile/Dockerfile
43+
wget https://github.com/NVIDIA/spark-rapids-examples/raw/branch-22.12/dockerfile/Dockerfile
4444

4545
# Optionally install additional jars into ${SPARK_HOME}/jars/
4646

docs/get-started/xgboost-examples/on-prem-cluster/standalone-python.md

+26-6
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,13 @@ Prerequisites
1212
* Multi-node clusters with homogenous GPU configuration
1313
* Software Requirements
1414
* Ubuntu 18.04, 20.04/CentOS7, CentOS8
15-
* CUDA 11.0+
15+
* CUDA 11.5+
1616
* NVIDIA driver compatible with your CUDA
1717
* NCCL 2.7.8+
18-
* Python 3.6+
18+
* Python 3.8 or 3.9
1919
* NumPy
20+
* XGBoost 1.7.0+
21+
* cudf-cu11
2022

2123
The number of GPUs in each host dictates the number of Spark executors that can run there.
2224
Additionally, cores per Spark executor and cores per Spark task must match, such that each executor can run 1 task at any given time.
@@ -47,6 +49,14 @@ And here are the steps to enable the GPU resources discovery for Spark 3.1+.
4749
spark.worker.resource.gpu.amount 1
4850
spark.worker.resource.gpu.discoveryScript ${SPARK_HOME}/examples/src/main/scripts/getGpusResources.sh
4951
```
52+
3. Install the XGBoost, cudf-cu11, numpy libraries on all nodes before running XGBoost application.
53+
54+
``` bash
55+
pip install xgboost
56+
pip install cudf-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
57+
pip install numpy
58+
pip install scikit-learn
59+
```
5060

5161
Get Application Files, Jar and Dataset
5262
-------------------------------
@@ -182,6 +192,10 @@ export EXAMPLE_CLASS=com.nvidia.spark.examples.mortgage.gpu_main
182192
183193
# tree construction algorithm
184194
export TREE_METHOD=gpu_hist
195+
196+
# if you enable archive python environment
197+
export PYSPARK_DRIVER_PYTHON=python
198+
export PYSPARK_PYTHON=./environment/bin/python
185199
```
186200

187201
Run spark-submit:
@@ -197,8 +211,9 @@ ${SPARK_HOME}/bin/spark-submit
197211
--driver-memory ${SPARK_DRIVER_MEMORY} \
198212
--executor-memory ${SPARK_EXECUTOR_MEMORY} \
199213
--conf spark.cores.max=${TOTAL_CORES} \
200-
--jars ${RAPIDS_JAR},${XGBOOST4J_JAR},${XGBOOST4J_SPARK_JAR} \
201-
--py-files ${XGBOOST4J_SPARK_JAR},${SAMPLE_ZIP} \
214+
--archives your_pyspark_venv.tar.gz#environment #if you enabled archive python environment \
215+
--jars ${RAPIDS_JAR} \
216+
--py-files ${SAMPLE_ZIP} \
202217
${MAIN_PY} \
203218
--mainClass=${EXAMPLE_CLASS} \
204219
--dataPath=train::${SPARK_XGBOOST_DIR}/mortgage/output/train/ \
@@ -261,6 +276,10 @@ export EXAMPLE_CLASS=com.nvidia.spark.examples.mortgage.cpu_main
261276
262277
# tree construction algorithm
263278
export TREE_METHOD=hist
279+
280+
# if you enable archive python environment
281+
export PYSPARK_DRIVER_PYTHON=python
282+
export PYSPARK_PYTHON=./environment/bin/python
264283
```
265284

266285
This is the same command as for the GPU example, repeated for convenience:
@@ -271,8 +290,9 @@ ${SPARK_HOME}/bin/spark-submit
271290
--driver-memory ${SPARK_DRIVER_MEMORY} \
272291
--executor-memory ${SPARK_EXECUTOR_MEMORY} \
273292
--conf spark.cores.max=${TOTAL_CORES} \
274-
--jars ${XGBOOST4J_JAR},${XGBOOST4J_SPARK_JAR} \
275-
--py-files ${XGBOOST4J_SPARK_JAR},${SAMPLE_ZIP} \
293+
--archives your_pyspark_venv.tar.gz#environment #if you enabled archive python environment \
294+
--jars ${RAPIDS_JAR} \
295+
--py-files ${SAMPLE_ZIP} \
276296
${SPARK_PYTHON_ENTRYPOINT} \
277297
--mainClass=${EXAMPLE_CLASS} \
278298
--dataPath=train::${DATA_PATH}/mortgage/output/train/ \

docs/get-started/xgboost-examples/on-prem-cluster/yarn-python.md

+45-7
Original file line numberDiff line numberDiff line change
@@ -12,12 +12,14 @@ Prerequisites
1212
* Multi-node clusters with homogenous GPU configuration
1313
* Software Requirements
1414
* Ubuntu 18.04, 20.04/CentOS7, CentOS8
15-
* CUDA 11.0+
15+
* CUDA 11.5+
1616
* NVIDIA driver compatible with your CUDA
1717
* NCCL 2.7.8+
18-
* Python 3.6+
18+
* Python 3.8 or 3.9
1919
* NumPy
20-
20+
* XGBoost 1.7.0+
21+
* cudf-cu11
22+
2123
The number of GPUs per NodeManager dictates the number of Spark executors that can run in that NodeManager.
2224
Additionally, cores per Spark executor and cores per Spark task must match, such that each executor can run 1 task at any given time.
2325

@@ -32,6 +34,32 @@ We use `SPARK_HOME` environment variable to point to the Apache Spark cluster.
3234
And as to how to enable GPU scheduling and isolation for Yarn,
3335
please refer to [here](https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/UsingGpus.html).
3436

37+
Please make sure to install the XGBoost, cudf-cu11, numpy libraries on all nodes before running XGBoost application.
38+
``` bash
39+
pip install xgboost
40+
pip install cudf-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
41+
pip install numpy
42+
pip install scikit-learn
43+
```
44+
You can also create an isolated python environment by using (Virtualenv)[https://virtualenv.pypa.io/en/latest/],
45+
and then directly pass/unpack the archive file and enable the environment on executors
46+
by leveraging the --archives option or spark.archives configuration.
47+
``` bash
48+
# create an isolated python environment and install libraries
49+
python -m venv pyspark_venv
50+
source pyspark_venv/bin/activate
51+
pip install xgboost
52+
pip install cudf-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
53+
pip install numpy
54+
pip install scikit-learn
55+
venv-pack -o pyspark_venv.tar.gz
56+
57+
# enable archive python environment on executors
58+
export PYSPARK_DRIVER_PYTHON=python # Do not set in cluster modes.
59+
export PYSPARK_PYTHON=./environment/bin/python
60+
spark-submit --archives pyspark_venv.tar.gz#environment app.py
61+
```
62+
3563
Get Application Files, Jar and Dataset
3664
-------------------------------
3765

@@ -114,6 +142,10 @@ export EXAMPLE_CLASS=com.nvidia.spark.examples.mortgage.gpu_main
114142

115143
# tree construction algorithm
116144
export TREE_METHOD=gpu_hist
145+
146+
# if you enable archive python environment
147+
export PYSPARK_DRIVER_PYTHON=python
148+
export PYSPARK_PYTHON=./environment/bin/python
117149
```
118150

119151
Run spark-submit:
@@ -129,11 +161,12 @@ ${SPARK_HOME}/bin/spark-submit
129161
--files ${SPARK_HOME}/examples/src/main/scripts/getGpusResources.sh \
130162
--master yarn \
131163
--deploy-mode ${SPARK_DEPLOY_MODE} \
164+
--archives your_pyspark_venv.tar.gz#environment #if you enabled archive python environment \
132165
--num-executors ${SPARK_NUM_EXECUTORS} \
133166
--driver-memory ${SPARK_DRIVER_MEMORY} \
134167
--executor-memory ${SPARK_EXECUTOR_MEMORY} \
135-
--jars ${RAPIDS_JAR},${XGBOOST4J_JAR} \
136-
--py-files ${XGBOOST4J_SPARK_JAR},${SAMPLE_ZIP} \
168+
--jars ${RAPIDS_JAR} \
169+
--py-files ${SAMPLE_ZIP} \
137170
${MAIN_PY} \
138171
--mainClass=${EXAMPLE_CLASS} \
139172
--dataPath=train::${DATA_PATH}/mortgage/out/train/ \
@@ -190,19 +223,24 @@ export EXAMPLE_CLASS=com.nvidia.spark.examples.mortgage.cpu_main
190223

191224
# tree construction algorithm
192225
export TREE_METHOD=hist
226+
227+
# if you enable archive python environment
228+
export PYSPARK_DRIVER_PYTHON=python
229+
export PYSPARK_PYTHON=./environment/bin/python
193230
```
194231

195232
This is the same command as for the GPU example, repeated for convenience:
196233

197234
``` bash
198235
${SPARK_HOME}/bin/spark-submit \
199236
--master yarn \
237+
--archives your_pyspark_venv.tar.gz#environment #if you enabled archive python environment \
200238
--deploy-mode ${SPARK_DEPLOY_MODE} \
201239
--num-executors ${SPARK_NUM_EXECUTORS} \
202240
--driver-memory ${SPARK_DRIVER_MEMORY} \
203241
--executor-memory ${SPARK_EXECUTOR_MEMORY} \
204-
--jars ${XGBOOST4J_JAR},${XGBOOST4J_SPARK_JAR} \
205-
--py-files ${XGBOOST4J_SPARK_JAR},${SAMPLE_ZIP} \
242+
--jars ${RAPIDS_JAR} \
243+
--py-files ${SAMPLE_ZIP} \
206244
${MAIN_PY} \
207245
--mainClass=${EXAMPLE_CLASS} \
208246
--dataPath=train::${DATA_PATH}/mortgage/output/train/ \

docs/get-started/xgboost-examples/prepare-package-data/preparation-python.md

+1-12
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ For simplicity export the location to these jars. All examples assume the packag
99
* [XGBoost4j-Spark Package](https://repo1.maven.org/maven2/com/nvidia/xgboost4j-spark_3.0/1.4.2-0.3.0/)
1010

1111
2. Download the RAPIDS Accelerator for Apache Spark plugin jar
12-
* [RAPIDS Spark Package](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.10.0/rapids-4-spark_2.12-22.10.0.jar)
12+
* [RAPIDS Spark Package](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.12.0/rapids-4-spark_2.12-22.12.0.jar)
1313

1414
### Build XGBoost Python Examples
1515

@@ -21,14 +21,3 @@ You need to copy the dataset to `/opt/xgboost`. Use the following links to downl
2121
1. [Mortgage dataset](/docs/get-started/xgboost-examples/dataset/mortgage.md)
2222
2. [Taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
2323
3. [Agaricus dataset](https://gust.dev/r/xgboost-agaricus)
24-
25-
### Setup environments
26-
27-
``` bash
28-
export SPARK_XGBOOST_DIR=/opt/xgboost
29-
export RAPIDS_JAR=${SPARK_XGBOOST_DIR}/rapids-4-spark_2.12-22.10.0.jar
30-
export XGBOOST4J_JAR=${SPARK_XGBOOST_DIR}/xgboost4j_3.0-1.4.2-0.3.0.jar
31-
export XGBOOST4J_SPARK_JAR=${SPARK_XGBOOST_DIR}/xgboost4j-spark_3.0-1.4.2-0.3.0.jar
32-
export SAMPLE_ZIP=${SPARK_XGBOOST_DIR}/samples.zip
33-
export MAIN_PY=${SPARK_XGBOOST_DIR}/main.py
34-
```

docs/get-started/xgboost-examples/prepare-package-data/preparation-scala.md

+1-9
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ For simplicity export the location to these jars. All examples assume the packag
55
### Download the jars
66

77
1. Download the RAPIDS Accelerator for Apache Spark plugin jar
8-
* [RAPIDS Spark Package](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.10.0/rapids-4-spark_2.12-22.10.0.jar)
8+
* [RAPIDS Spark Package](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.12.0/rapids-4-spark_2.12-22.12.0.jar)
99

1010
### Build XGBoost Scala Examples
1111

@@ -17,11 +17,3 @@ You need to copy the dataset to `/opt/xgboost`. Use the following links to downl
1717
1. [Mortgage dataset](/docs/get-started/xgboost-examples/dataset/mortgage.md)
1818
2. [Taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
1919
3. [Agaricus dataset](https://gust.dev/r/xgboost-agaricus)
20-
21-
### Setup environments
22-
23-
``` bash
24-
export SPARK_XGBOOST_DIR=/opt/xgboost
25-
export RAPIDS_JAR=${SPARK_XGBOOST_DIR}/rapids-4-spark_2.12-22.10.0.jar
26-
export SAMPLE_JAR=${SPARK_XGBOOST_DIR}/sample_xgboost_apps-0.2.3-jar-with-dependencies.jar
27-
```

docs/img/guides/mortgage-perf.png

15.7 KB
Loading

examples/ML+DL-Examples/Spark-cuML/pca/Dockerfile

+1-1
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717

1818
ARG CUDA_VER=11.5.1
1919
FROM nvidia/cuda:${CUDA_VER}-devel-ubuntu20.04
20-
ARG BRANCH_VER=22.10
20+
ARG BRANCH_VER=22.12
2121

2222
RUN apt-get update
2323
RUN apt-get install -y wget ninja-build git

0 commit comments

Comments
 (0)