Skip to content

Commit d056339

Browse files
authored
Merge pull request #180 from NVIDIA/branch-22.06
merge branch 22.06 to main branch
2 parents d2cf00b + b16fed6 commit d056339

File tree

219 files changed

+8188
-2246
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

219 files changed

+8188
-2246
lines changed

.github/workflows/auto-merge.yml

+4-4
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ name: auto-merge HEAD to BASE
1818
on:
1919
pull_request_target:
2020
branches:
21-
- branch-22.04
21+
- branch-22.06
2222
types: [closed]
2323

2424
jobs:
@@ -29,13 +29,13 @@ jobs:
2929
steps:
3030
- uses: actions/checkout@v2
3131
with:
32-
ref: branch-22.04 # force to fetch from latest upstream instead of PR ref
32+
ref: branch-22.06 # force to fetch from latest upstream instead of PR ref
3333

3434
- name: auto-merge job
3535
uses: ./.github/workflows/auto-merge
3636
env:
3737
OWNER: NVIDIA
3838
REPO_NAME: spark-rapids-examples
39-
HEAD: branch-22.04
40-
BASE: branch-22.06
39+
HEAD: branch-22.06
40+
BASE: branch-22.08
4141
AUTOMERGE_TOKEN: ${{ secrets.AUTOMERGE_TOKEN }} # use to merge PR
+35
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# Copyright (c) 2022, NVIDIA CORPORATION.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
# A workflow to check if PR got broken hyperlinks
16+
name: Check Markdown links
17+
18+
on:
19+
pull_request:
20+
types: [opened, synchronize, reopened]
21+
22+
jobs:
23+
markdown-link-check:
24+
runs-on: ubuntu-latest
25+
steps:
26+
- name: work around permission issue
27+
run: git config --global --add safe.directory /github/workspace
28+
- uses: actions/checkout@master
29+
- uses: gaurav-nelson/github-action-markdown-link-check@v1
30+
with:
31+
max-depth: -1
32+
use-verbose-mode: 'yes'
33+
check-modified-files-only: 'yes'
34+
config-file: '.github/workflows/markdown-links-check/markdown-links-check-config.json'
35+
base-branch: 'main'
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
{
2+
"timeout": "15s",
3+
"retryOn429": true,
4+
"retryCount":30,
5+
"aliveStatusCodes": [200, 403]
6+
}

.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -21,3 +21,4 @@
2121
.scala_dependencies
2222
.settings
2323
hs_err*.log
24+
target

README.md

+40-77
Original file line numberDiff line numberDiff line change
@@ -1,79 +1,42 @@
11
# spark-rapids-examples
22

3-
A repo for Spark related utilities and examples using the Rapids Accelerator,including ETL, ML/DL, etc.
4-
5-
Enterprise AI is built on ETL pipelines and relies on AI infrastructure to effectively integrate and
6-
process large amounts of data. One of the fundamental purposes of
7-
[RAPIDS Accelerator](https://nvidia.github.io/spark-rapids/Getting-Started/)
8-
is to effectively integrate large ETL and ML/DL pipelines. Rapids Accelerator for [Apache Spark](https://spark.apache.org/)
9-
offers seamless integration with Machine learning frameworks such XGBoost, PCA. Users can leverage the Apache Spark cluster
10-
with NVIDIA GPUs to accelerate the ETL pipelines and then use the same infrastructure to load the data frame
11-
into single or multiple GPUs across multiple nodes to train with GPU accelerated XGBoost or a PCA.
12-
In addition, if you are using a Deep learning framework to train your tabular data with the same Apache Spark cluster,
13-
we have leveraged NVIDIA’s NVTabular library to load and train the data across multiple nodes with GPUs.
14-
NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and
15-
easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
16-
We also add MIG support to YARN to allow CSPs to split an A100/A30 into multiple MIG
17-
devices and have them appear like a normal GPU.
18-
19-
Please see the [Rapids Accelerator for Spark documentation](https://nvidia.github.io/spark-rapids/Getting-Started/) for supported
20-
Spark versions and requirements. It is recommended to set up Spark Cluster with JDK8.
21-
22-
## Getting Started Guides
23-
24-
### 1. Microbenchmark guide
25-
26-
The microbenchmark on [RAPIDS Accelerator For Apache Spark](https://nvidia.github.io/spark-rapids/) is to identify,
27-
test and analyze the best queries which can be accelerated on the GPU. For detail information please refer to this
28-
[guide](/examples/micro-benchmarks).
29-
30-
### 2. Xgboost examples guide
31-
32-
We provide three similar Xgboost benchmarks, Mortgage, Taxi and Agaricus.
33-
Try one of the ["Getting Started Guides"](/examples/Spark-ETL+XGBoost).
34-
Please note that they target the Mortgage dataset as written with a few changes
35-
to `EXAMPLE_CLASS` and `dataPath`, they can be easily adapted with each other with different datasets.
36-
37-
### 3. TensorFlow training on Horovod Spark example guide
38-
39-
We provide a Criteo Benchmark to demo ETL and deep learning training on Horovod Spark, please refer to
40-
this [guide](/examples/Spark-DL/criteo_train).
41-
42-
### 4. PCA example guide
43-
44-
This is an example of the GPU accelerated PCA algorithm running on Spark. For detail information please refer to this
45-
[guide](/examples/Spark-cuML/pca).
46-
47-
### 5. MIG support
48-
We provide some [guides](/examples/MIG-Support) about the Multi-Instance GPU (MIG) feature based on
49-
the NVIDIA Ampere architecture (such as NVIDIA A100 and A30) GPU.
50-
51-
### 6. Spark Rapids UDF examples
52-
This is examples of the GPU accelerated UDF.
53-
refer to this
54-
[guide](/examples/RAPIDS-accelerated-UDFs).
55-
56-
### 7. Spark cuSpatial
57-
This is a RapidsUDF examples to use [cuSpatial](https://github.com/rapidsai/cuspatial) library to solve the point-in-polygon problem. For detail information please refer to this [guide](/examples/Spark-cuSpatial).
58-
59-
## API
60-
### 1. Xgboost examples API
61-
62-
These guides focus on GPU related Scala and python API interfaces.
63-
- [Scala API](/docs/api-docs/xgboost-examples-api-docs/scala.md)
64-
- [Python API](/docs/api-docs/xgboost-examples-api-docs/python.md)
65-
66-
## Troubleshooting
67-
You can trouble-shooting issues according to following guides.
68-
- [Trouble Shooting XGBoost](/docs/trouble-shooting/xgboost-examples-trouble-shooting.md)
69-
70-
## Contributing
71-
See the [Contributing guide](CONTRIBUTING.md).
72-
73-
## Contact Us
74-
75-
Please see the [RAPIDS](https://rapids.ai/community.html) website for contact information.
76-
77-
## License
78-
79-
This content is licensed under the [Apache License 2.0](/LICENSE)
3+
This is the [RAPIDS Accelerator for Apache Spark](https://nvidia.github.io/spark-rapids/) examples repo.
4+
RAPIDS Accelerator for Apache Spark accelerates Spark applications with no code changes.
5+
You can download the latest version of RAPIDS Accelerator [here](https://nvidia.github.io/spark-rapids/docs/download.html).
6+
This repo contains examples and applications that showcases the performance and benefits of using
7+
RAPIDS Accelerator in data processing and machine learning pipelines.
8+
There are broadly four categories of examples in this repo:
9+
1. [SQL/Dataframe](./examples/SQL+DF-Examples)
10+
2. [Spark XGBoost](./examples/XGBoost-Examples)
11+
3. [Deep Learning/Machine Learning](./examples/ML+DL-Examples)
12+
4. [RAPIDS UDF](./examples/UDF-Examples)
13+
14+
For more information on each of the examples please look into respective categories.
15+
16+
Here is the list of notebooks in this repo:
17+
18+
| | Category | Notebook Name | Description
19+
| ------------- | ------------- | ------------- | -------------
20+
| 1 | SQL/DF | Microbenchmark | Spark SQL operations such as expand, hash aggregate, windowing, and cross joins with up to 20x performance benefits
21+
| 2 | SQL/DF | Customer Churn | Data federation for modeling customer Churn with a sample telco customer data
22+
| 3 | XGBoost | Agaricus (Scala) | Uses XGBoost classifier function to create model that can accurately differentiate between edible and poisonous mushrooms with the [agaricus dataset](https://archive.ics.uci.edu/ml/datasets/mushroom)
23+
| 4 | XGBoost | Mortgage (Scala) | End-to-end ETL + XGBoost example to predict mortgage default with [Fannie Mae Single-Family Loan Performance Data](https://capitalmarkets.fanniemae.com/credit-risk-transfer/single-family-credit-risk-transfer/fannie-mae-single-family-loan-performance-data)
24+
| 5 | XGBoost | Taxi (Scala) | End-to-end ETL + XGBoost example to predict taxi trip fare amount with [NYC taxi trips data set](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
25+
| 6 | ML/DL | Criteo Training | ETL and deep learning training of the Criteo 1TB Click Logs dataset
26+
| 7 | ML/DL | PCA End-to-End | Spark MLlib based PCA example to train and transform with a synthetic dataset
27+
| 8 | UDF | cuSpatial - Point in Polygon | Spark cuSpatial example for Point in Polygon function using NYC Taxi pickup location dataset
28+
29+
Here is the list of Apache Spark applications (Scala and PySpark) that
30+
can be built for running on GPU with RAPIDS Accelerator in this repo:
31+
32+
| | Category | Notebook Name | Description
33+
| ------------- | ------------- | ------------- | -------------
34+
| 1 | XGBoost | Agaricus (Scala) | Uses XGBoost classifier function to create model that can accurately differentiate between edible and poisonous mushrooms with the [agaricus dataset](https://archive.ics.uci.edu/ml/datasets/mushroom)
35+
| 2 | XGBoost | Mortgage (Scala) | End-to-end ETL + XGBoost example to predict mortgage default with [Fannie Mae Single-Family Loan Performance Data](https://capitalmarkets.fanniemae.com/credit-risk-transfer/single-family-credit-risk-transfer/fannie-mae-single-family-loan-performance-data)
36+
| 3 | XGBoost | Taxi (Scala) | End-to-end ETL + XGBoost example to predict taxi trip fare amount with [NYC taxi trips data set](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
37+
| 4 | ML/DL | PCA End-to-End | Spark MLlib based PCA example to train and transform with a synthetic dataset
38+
| 5 | UDF | cuSpatial - Point in Polygon | Spark cuSpatial example for Point in Polygon function using NYC Taxi pickup location dataset
39+
| 6 | UDF | URL Decode | Decodes URL-encoded strings using the [Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/stable/)
40+
| 7 | UDF | URL Encode | URL-encodes strings using the [Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/stable/)
41+
| 8 | UDF | [CosineSimilarity](./examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/java/com/nvidia/spark/rapids/udf/java/CosineSimilarity.java) | Computes the cosine similarity between two float vectors using [native code](./examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/src)
42+
| 9 | UDF | [StringWordCount](./examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/java/com/nvidia/spark/rapids/udf/hive/StringWordCount.java) | Implements a Hive simple UDF using [native code](./examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/src) to count words in strings

datasets/customer-churn.tar.gz

162 KB
Binary file not shown.

docs/get-started/xgboost-examples/building-sample-apps/python.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -21,4 +21,4 @@ Two files are required by PySpark:
2121

2222
+ *main.py*
2323

24-
entrypoint for PySpark, you can find it in 'spark-rapids-examples/Spark-ETL+XGBoost/examples' folder
24+
entrypoint for PySpark, you can find it in 'spark-rapids-examples/examples/XGBoost-Examples' folder

docs/get-started/xgboost-examples/building-sample-apps/scala.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -8,13 +8,13 @@ Follow these steps to build the Scala jars:
88

99
``` bash
1010
git clone https://github.com/NVIDIA/spark-rapids-examples.git
11-
cd spark-rapids-examples/examples/Spark-ETL+XGBoost
11+
cd spark-rapids-examples/examples/XGBoost-Examples
1212
mvn package
1313
```
1414

1515
## The generated Jars
1616

17-
Let's assume LATEST_VERSION is **0.2.2**. The build process will generate two jars as belows,
17+
Let's assume LATEST_VERSION is **0.2.3**. The build process will generate two jars as belows,
1818

1919
+ *aggregator/target/sample_xgboost_apps-${LATEST_VERSION}.jar*
2020

docs/get-started/xgboost-examples/csp/aws/ec2.md

+1-2
Original file line numberDiff line numberDiff line change
@@ -132,10 +132,9 @@ $SPARK_HOME/sbin/start-slave.sh <master-spark-URL>
132132

133133
Make sure you have prepared the necessary packages and dataset by following this [guide](/docs/get-started/xgboost-examples/prepare-package-data/preparation-scala.md)
134134

135-
Copy cudf and rapids jars to `$SPARK_HOME/jars`
135+
Copy rapids jars to `$SPARK_HOME/jars`
136136

137137
``` bash
138-
cp $CUDF_JAR $SPARK_HOME/jars/
139138
cp $RAPIDS_JAR $SPARK_HOME/jars/
140139
```
141140

docs/get-started/xgboost-examples/csp/databricks/databricks.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ cluster.
4949

5050
- [Databricks 10.4 LTS
5151
ML](https://docs.databricks.com/release-notes/runtime/9.1ml.html#system-environment) has CUDA 11
52-
installed. Users will need to use 22.04.0 or later on Databricks 10.4 LTS ML. In this case use
52+
installed. Users will need to use 22.06.0 or later on Databricks 10.4 LTS ML. In this case use
5353
[generate-init-script-10.4.ipynb](generate-init-script-10.4.ipynb) which will install
5454
the RAPIDS Spark plugin.
5555

@@ -108,13 +108,13 @@ Import the GPU Mortgage Example Notebook
108108
---------------------------
109109

110110
1. See [Managing Notebooks](https://docs.databricks.com/user-guide/notebooks/notebook-manage.html) on how to import a notebook.
111-
2. Import the example notebook: [XGBoost4j-Spark mortgage notebook](/examples/Spark-ETL+XGBoost/mortgage/notebooks/scala/mortgage-gpu.ipynb)
111+
2. Import the example notebook: [XGBoost4j-Spark mortgage notebook](../../../../../examples/XGBoost-Examples/mortgage/notebooks/scala/mortgage-gpu.ipynb)
112112
3. Inside the mortgage example notebook, update the data paths from
113113
"/data/datasets/mortgage-small/train" to "dbfs:/FileStore/tables/mortgage/csv/train/mortgage_train_merged.csv"
114114
"/data/datasets/mortgage-small/eval" to "dbfs:/FileStore/tables/mortgage/csv/test/mortgage_eval_merged.csv"
115115

116116
The example notebook comes with the following configuration, you can adjust this according to your setup.
117-
See supported configuration options here: [xgboost parameters](/examples/Spark-ETL+XGBoost/app-parameters/supported_xgboost_parameters_python.md)
117+
See supported configuration options here: [xgboost parameters](../../../../../examples/XGBoost-Examples/app-parameters/supported_xgboost_parameters_python.md)
118118

119119
``` bash
120120
params = {

docs/get-started/xgboost-examples/csp/databricks/generate-init-script-10.4.ipynb

+11-10
Original file line numberDiff line numberDiff line change
@@ -24,10 +24,9 @@
2424
"source": [
2525
"%sh\n",
2626
"cd ../../dbfs/FileStore/jars/\n",
27-
"sudo wget -O cudf-22.04.0-cuda11.jar https://repo1.maven.org/maven2/ai/rapids/cudf/22.04.0/cudf-22.04.0-cuda11.jar\n",
28-
"sudo wget -O rapids-4-spark_2.12-22.04.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.04.0/rapids-4-spark_2.12-22.04.0.jar\n",
29-
"sudo wget -O xgboost4j_3.0-1.4.2-0.3.0.jar https://repo1.maven.org/maven2/com/nvidia/xgboost4j_3.0/1.4.2-0.3.0/xgboost4j_3.0-1.4.2-0.3.0.jar\n",
30-
"sudo wget -O xgboost4j-spark_3.0-1.4.2-0.3.0.jar https://repo1.maven.org/maven2/com/nvidia/xgboost4j-spark_3.0/1.4.2-0.3.0/xgboost4j-spark_3.0-1.4.2-0.3.0.jar\n",
27+
"sudo wget -O rapids-4-spark_2.12-22.06.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.06.0/rapids-4-spark_2.12-22.06.0.jar\n",
28+
"sudo wget -O xgboost4j-gpu_2.12-1.6.1.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-gpu_2.12/1.6.1/xgboost4j-gpu_2.12-1.6.1.jar\n",
29+
"sudo wget -O xgboost4j-spark-gpu_2.12-1.6.1.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-spark-gpu_2.12/1.6.1/xgboost4j-spark-gpu_2.12-1.6.1.jar\n",
3130
"ls -ltr\n",
3231
"\n",
3332
"# Your Jars are downloaded in dbfs:/FileStore/jars directory"
@@ -57,10 +56,12 @@
5756
"source": [
5857
"dbutils.fs.put(\"/databricks/init_scripts/init.sh\",\"\"\"\n",
5958
"#!/bin/bash\n",
60-
"sudo cp /dbfs/FileStore/jars/xgboost4j_3.0-1.4.2-0.3.0.jar /databricks/jars/spark--maven-trees--ml--10.x--xgboost-gpu--ml.dmlc--xgboost4j-gpu_2.12--ml.dmlc__xgboost4j-gpu_2.12__1.5.2.jar\n",
61-
"sudo cp /dbfs/FileStore/jars/cudf-22.04.0-cuda11.jar /databricks/jars/\n",
62-
"sudo cp /dbfs/FileStore/jars/rapids-4-spark_2.12-22.04.0.jar /databricks/jars/\n",
63-
"sudo cp /dbfs/FileStore/jars/xgboost4j-spark_3.0-1.4.2-0.3.0.jar /databricks/jars/spark--maven-trees--ml--10.x--xgboost-gpu--ml.dmlc--xgboost4j-spark-gpu_2.12--ml.dmlc__xgboost4j-spark-gpu_2.12__1.5.2.jar\"\"\", True)"
59+
"sudo rm -f /databricks/jars/spark--maven-trees--ml--10.x--xgboost-gpu--ml.dmlc--xgboost4j-gpu_2.12--ml.dmlc__xgboost4j-gpu_2.12__1.5.2.jar\n",
60+
"sudo rm -f /databricks/jars/spark--maven-trees--ml--10.x--xgboost-gpu--ml.dmlc--xgboost4j-spark-gpu_2.12--ml.dmlc__xgboost4j-spark-gpu_2.12__1.5.2.jar\n",
61+
"\n",
62+
"sudo cp /dbfs/FileStore/jars/xgboost4j-gpu_2.12-1.6.1.jar /databricks/jars/\n",
63+
"sudo cp /dbfs/FileStore/jars/rapids-4-spark_2.12-22.06.0.jar /databricks/jars/\n",
64+
"sudo cp /dbfs/FileStore/jars/xgboost4j-spark-gpu_2.12-1.6.1.jar /databricks/jars/\"\"\", True)"
6465
]
6566
},
6667
{
@@ -131,8 +132,8 @@
131132
"\n",
132133
"1. Edit your cluster, adding an initialization script from `dbfs:/databricks/init_scripts/init.sh` in the \"Advanced Options\" under \"Init Scripts\" tab\n",
133134
"2. Reboot the cluster\n",
134-
"3. Go to \"Libraries\" tab under your cluster and install `dbfs:/FileStore/jars/xgboost4j-spark_3.0-1.4.2-0.3.0.jar` in your cluster by selecting the \"DBFS\" option for installing jars\n",
135-
"4. Import the mortgage example notebook from `https://github.com/NVIDIA/spark-rapids-examples/blob/branch-22.04/examples/Spark-ETL+XGBoost/mortgage/notebooks/python/mortgage-gpu.ipynb`\n",
135+
"3. Go to \"Libraries\" tab under your cluster and install `dbfs:/FileStore/jars/xgboost4j-spark-gpu_2.12-1.6.1.jar` in your cluster by selecting the \"DBFS\" option for installing jars\n",
136+
"4. Import the mortgage example notebook from `https://github.com/NVIDIA/spark-rapids-examples/blob/branch-22.06/examples/XGBoost-Examples/mortgage/notebooks/python/mortgage-gpu.ipynb`\n",
136137
"5. Inside the mortgage example notebook, update the data paths\n",
137138
" `train_data = reader.schema(schema).option('header', True).csv('/data/mortgage/csv/small-train.csv')`\n",
138139
" `trans_data = reader.schema(schema).option('header', True).csv('/data/mortgage/csv/small-trans.csv')`"

0 commit comments

Comments
 (0)