Skip to content

Commit 8ae5e2b

Browse files
authored
Merge pull request #431 from nvliyuan/main-v2408-release
update the main branch for 2408 release
2 parents d920adb + 3f57ee8 commit 8ae5e2b

File tree

74 files changed

+360
-17106
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

74 files changed

+360
-17106
lines changed
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
1-
#!/bin/bash
2-
# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
1+
# Copyright (c) 2024, NVIDIA CORPORATION.
32
#
43
# Licensed under the Apache License, Version 2.0 (the "License");
54
# you may not use this file except in compliance with the License.
@@ -12,8 +11,25 @@
1211
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1312
# See the License for the specific language governing permissions and
1413
# limitations under the License.
15-
#
1614

17-
# This script is used to convert a ".cny" points file into parquet.
18-
19-
python to_parquet.py /data/cuspatial/points.cny /data/cuspatial/points/points.parquet
15+
name: Add new issues and pull requests to project
16+
17+
on:
18+
issues:
19+
types:
20+
- opened
21+
pull_request_target:
22+
types:
23+
- opened
24+
25+
jobs:
26+
add-to-project:
27+
if: github.repository == 'NVIDIA/spark-rapids-examples'
28+
name: Add new issues and pull requests to project
29+
runs-on: ubuntu-latest
30+
steps:
31+
- uses: actions/[email protected]
32+
with:
33+
project-url: https://github.com/orgs/NVIDIA/projects/4
34+
github-token: ${{ secrets.PROJECT_TOKEN }}
35+

.github/workflows/auto-merge.yml

+4-4
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ name: auto-merge HEAD to BASE
1818
on:
1919
pull_request_target:
2020
branches:
21-
- branch-24.06
21+
- branch-24.08
2222
types: [closed]
2323

2424
jobs:
@@ -29,14 +29,14 @@ jobs:
2929
steps:
3030
- uses: actions/checkout@v4
3131
with:
32-
ref: branch-24.06 # force to fetch from latest upstream instead of PR ref
32+
ref: branch-24.08 # force to fetch from latest upstream instead of PR ref
3333

3434
- name: auto-merge job
3535
uses: ./.github/workflows/auto-merge
3636
env:
3737
OWNER: NVIDIA
3838
REPO_NAME: spark-rapids-examples
39-
HEAD: branch-24.06
40-
BASE: branch-24.08
39+
HEAD: branch-24.08
40+
BASE: branch-24.10
4141
AUTOMERGE_TOKEN: ${{ secrets.AUTOMERGE_TOKEN }} # use to merge PR
4242

README.md

+5-8
Original file line numberDiff line numberDiff line change
@@ -23,9 +23,7 @@ Here is the list of notebooks in this repo:
2323
| 3 | XGBoost | Agaricus (Scala) | Uses XGBoost classifier function to create model that can accurately differentiate between edible and poisonous mushrooms with the [agaricus dataset](https://archive.ics.uci.edu/ml/datasets/mushroom)
2424
| 4 | XGBoost | Mortgage (Scala) | End-to-end ETL + XGBoost example to predict mortgage default with [Fannie Mae Single-Family Loan Performance Data](https://capitalmarkets.fanniemae.com/credit-risk-transfer/single-family-credit-risk-transfer/fannie-mae-single-family-loan-performance-data)
2525
| 5 | XGBoost | Taxi (Scala) | End-to-end ETL + XGBoost example to predict taxi trip fare amount with [NYC taxi trips data set](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
26-
| 6 | ML/DL | Criteo Training | ETL and deep learning training of the Criteo 1TB Click Logs dataset
27-
| 7 | ML/DL | PCA End-to-End | Spark MLlib based PCA example to train and transform with a synthetic dataset
28-
| 8 | UDF | cuSpatial - Point in Polygon | Spark cuSpatial example for Point in Polygon function using NYC Taxi pickup location dataset
26+
| 6 | ML/DL | PCA End-to-End | Spark MLlib based PCA example to train and transform with a synthetic dataset
2927

3028
Here is the list of Apache Spark applications (Scala and PySpark) that
3129
can be built for running on GPU with RAPIDS Accelerator in this repo:
@@ -36,8 +34,7 @@ can be built for running on GPU with RAPIDS Accelerator in this repo:
3634
| 2 | XGBoost | Mortgage (Scala) | End-to-end ETL + XGBoost example to predict mortgage default with [Fannie Mae Single-Family Loan Performance Data](https://capitalmarkets.fanniemae.com/credit-risk-transfer/single-family-credit-risk-transfer/fannie-mae-single-family-loan-performance-data)
3735
| 3 | XGBoost | Taxi (Scala) | End-to-end ETL + XGBoost example to predict taxi trip fare amount with [NYC taxi trips data set](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
3836
| 4 | ML/DL | PCA End-to-End | Spark MLlib based PCA example to train and transform with a synthetic dataset
39-
| 5 | UDF | cuSpatial - Point in Polygon | Spark cuSpatial example for Point in Polygon function using NYC Taxi pickup location dataset
40-
| 6 | UDF | URL Decode | Decodes URL-encoded strings using the [Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/legacy/)
41-
| 7 | UDF | URL Encode | URL-encodes strings using the [Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/legacy/)
42-
| 8 | UDF | [CosineSimilarity](./examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/java/com/nvidia/spark/rapids/udf/java/CosineSimilarity.java) | Computes the cosine similarity between two float vectors using [native code](./examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/src)
43-
| 9 | UDF | [StringWordCount](./examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/java/com/nvidia/spark/rapids/udf/hive/StringWordCount.java) | Implements a Hive simple UDF using [native code](./examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/src) to count words in strings
37+
| 5 | UDF | URL Decode | Decodes URL-encoded strings using the [Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/legacy/)
38+
| 6 | UDF | URL Encode | URL-encodes strings using the [Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/legacy/)
39+
| 7 | UDF | [CosineSimilarity](./examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/java/com/nvidia/spark/rapids/udf/java/CosineSimilarity.java) | Computes the cosine similarity between two float vectors using [native code](./examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/src)
40+
| 8 | UDF | [StringWordCount](./examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/java/com/nvidia/spark/rapids/udf/hive/StringWordCount.java) | Implements a Hive simple UDF using [native code](./examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/src) to count words in strings

datasets/tpcds-small.tar.gz

77.5 KB
Binary file not shown.

docs/get-started/xgboost-examples/csp/databricks/databricks.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ Navigate to your home directory in the UI and select **Create** > **File** from
2121
create an `init.sh` scripts with contents:
2222
```bash
2323
#!/bin/bash
24-
sudo wget -O /databricks/jars/rapids-4-spark_2.12-24.06.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.06.0/rapids-4-spark_2.12-24.06.0.jar
24+
sudo wget -O /databricks/jars/rapids-4-spark_2.12-24.08.1.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.08.1/rapids-4-spark_2.12-24.08.1.jar
2525
```
2626
1. Select the Databricks Runtime Version from one of the supported runtimes specified in the
2727
Prerequisites section.
@@ -68,7 +68,7 @@ create an `init.sh` scripts with contents:
6868
```bash
6969
spark.rapids.sql.python.gpu.enabled true
7070
spark.python.daemon.module rapids.daemon_databricks
71-
spark.executorEnv.PYTHONPATH /databricks/jars/rapids-4-spark_2.12-24.06.0.jar:/databricks/spark/python
71+
spark.executorEnv.PYTHONPATH /databricks/jars/rapids-4-spark_2.12-24.08.1.jar:/databricks/spark/python
7272
```
7373
Note that since python memory pool require installing the cudf library, so you need to install cudf library in
7474
each worker nodes `pip install cudf-cu11 --extra-index-url=https://pypi.nvidia.com` or disable python memory pool

docs/get-started/xgboost-examples/csp/databricks/init.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
sudo rm -f /databricks/jars/spark--maven-trees--ml--10.x--xgboost-gpu--ml.dmlc--xgboost4j-gpu_2.12--ml.dmlc__xgboost4j-gpu_2.12__1.5.2.jar
22
sudo rm -f /databricks/jars/spark--maven-trees--ml--10.x--xgboost-gpu--ml.dmlc--xgboost4j-spark-gpu_2.12--ml.dmlc__xgboost4j-spark-gpu_2.12__1.5.2.jar
33

4-
sudo wget -O /databricks/jars/rapids-4-spark_2.12-24.06.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.06.0/rapids-4-spark_2.12-24.06.0.jar
4+
sudo wget -O /databricks/jars/rapids-4-spark_2.12-24.08.1.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.08.1/rapids-4-spark_2.12-24.08.1.jar
55
sudo wget -O /databricks/jars/xgboost4j-gpu_2.12-1.7.1.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-gpu_2.12/1.7.1/xgboost4j-gpu_2.12-1.7.1.jar
66
sudo wget -O /databricks/jars/xgboost4j-spark-gpu_2.12-1.7.1.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-spark-gpu_2.12/1.7.1/xgboost4j-spark-gpu_2.12-1.7.1.jar
77
ls -ltr

docs/get-started/xgboost-examples/on-prem-cluster/kubernetes-scala.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ export SPARK_DOCKER_IMAGE=<gpu spark docker image repo and name>
4040
export SPARK_DOCKER_TAG=<spark docker image tag>
4141

4242
pushd ${SPARK_HOME}
43-
wget https://github.com/NVIDIA/spark-rapids-examples/raw/branch-24.06/dockerfile/Dockerfile
43+
wget https://github.com/NVIDIA/spark-rapids-examples/raw/branch-24.08/dockerfile/Dockerfile
4444

4545
# Optionally install additional jars into ${SPARK_HOME}/jars/
4646

docs/get-started/xgboost-examples/prepare-package-data/preparation-python.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ For simplicity export the location to these jars. All examples assume the packag
55
### Download the jars
66

77
Download the RAPIDS Accelerator for Apache Spark plugin jar
8-
* [RAPIDS Spark Package](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.06.0/rapids-4-spark_2.12-24.06.0.jar)
8+
* [RAPIDS Spark Package](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.08.1/rapids-4-spark_2.12-24.08.1.jar)
99

1010
### Build XGBoost Python Examples
1111

@@ -16,4 +16,4 @@ Following this [guide](/docs/get-started/xgboost-examples/building-sample-apps/p
1616
You need to copy the dataset to `/opt/xgboost`. Use the following links to download the data.
1717
1. [Mortgage dataset](/docs/get-started/xgboost-examples/dataset/mortgage.md)
1818
2. [Taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
19-
3. [Agaricus dataset](https://gust.dev/r/xgboost-agaricus)
19+
3. [Agaricus dataset](https://github.com/dmlc/xgboost/tree/master/demo/data)

docs/get-started/xgboost-examples/prepare-package-data/preparation-scala.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ For simplicity export the location to these jars. All examples assume the packag
55
### Download the jars
66

77
1. Download the RAPIDS Accelerator for Apache Spark plugin jar
8-
* [RAPIDS Spark Package](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.06.0/rapids-4-spark_2.12-24.06.0.jar)
8+
* [RAPIDS Spark Package](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/24.08.1/rapids-4-spark_2.12-24.08.1.jar)
99

1010
### Build XGBoost Scala Examples
1111

@@ -16,4 +16,4 @@ Following this [guide](/docs/get-started/xgboost-examples/building-sample-apps/s
1616
You need to copy the dataset to `/opt/xgboost`. Use the following links to download the data.
1717
1. [Mortgage dataset](/docs/get-started/xgboost-examples/dataset/mortgage.md)
1818
2. [Taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
19-
3. [Agaricus dataset](https://gust.dev/r/xgboost-agaricus)
19+
3. [Agaricus dataset](https://github.com/dmlc/xgboost/tree/master/demo/data)

docs/img/guides/microbm.png

50.4 KB
Loading

examples/ML+DL-Examples/Spark-DL/criteo_train/Dockerfile

-229
This file was deleted.

0 commit comments

Comments
 (0)