NVIDIA
diff --git a/‎.github/workflows/auto-merge.yml
+4-4 b/‎.github/workflows/auto-merge.yml
+4-4
diff --git a/‎.github/workflows/markdown-links-check.yml
+35 b/‎.github/workflows/markdown-links-check.yml
+35
diff --git a/‎.github/workflows/markdown-links-check/markdown-links-check-config.json
+6 b/‎.github/workflows/markdown-links-check/markdown-links-check-config.json
+6
diff --git a/‎.gitignore
+1 b/‎.gitignore
+1
diff --git a/‎README.md
+40-77 b/‎README.md
+40-77
diff --git a/‎datasets/customer-churn.tar.gz
162 KB b/‎datasets/customer-churn.tar.gz
162 KB
diff --git a/‎docs/get-started/xgboost-examples/building-sample-apps/python.md
+1-1 b/‎docs/get-started/xgboost-examples/building-sample-apps/python.md
+1-1
diff --git a/‎docs/get-started/xgboost-examples/building-sample-apps/scala.md
+2-2 b/‎docs/get-started/xgboost-examples/building-sample-apps/scala.md
+2-2
diff --git a/‎docs/get-started/xgboost-examples/csp/aws/ec2.md
+1-2 b/‎docs/get-started/xgboost-examples/csp/aws/ec2.md
+1-2
diff --git a/‎docs/get-started/xgboost-examples/csp/databricks/databricks.md
+3-3 b/‎docs/get-started/xgboost-examples/csp/databricks/databricks.md
+3-3
diff --git a/‎docs/get-started/xgboost-examples/csp/databricks/generate-init-script-10.4.ipynb
+11-10 b/‎docs/get-started/xgboost-examples/csp/databricks/generate-init-script-10.4.ipynb
+11-10
@@ -18,7 +18,7 @@ name: auto-merge HEAD to BASE
 on:
   pull_request_target:
     branches:
-    - branch-22.04
+    - branch-22.06
     types: [closed]
 
 jobs:
@@ -29,13 +29,13 @@ jobs:
     steps:
       - uses: actions/checkout@v2
         with:
-          ref: branch-22.04 # force to fetch from latest upstream instead of PR ref
+          ref: branch-22.06 # force to fetch from latest upstream instead of PR ref
 
       - name: auto-merge job
         uses: ./.github/workflows/auto-merge
         env:
           OWNER: NVIDIA
           REPO_NAME: spark-rapids-examples
-          HEAD: branch-22.04
-          BASE: branch-22.06
+          HEAD: branch-22.06
+          BASE: branch-22.08
           AUTOMERGE_TOKEN: ${{ secrets.AUTOMERGE_TOKEN }} # use to merge PR
@@ -0,0 +1,35 @@
+# Copyright (c) 2022, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# A workflow to check if PR got broken hyperlinks
+name: Check Markdown links
+
+on:
+  pull_request:
+    types: [opened, synchronize, reopened]
+
+jobs:
+  markdown-link-check:
+    runs-on: ubuntu-latest
+    steps:
+    - name: work around permission issue
+      run: git config --global --add safe.directory /github/workspace
+    - uses: actions/checkout@master
+    - uses: gaurav-nelson/github-action-markdown-link-check@v1
+      with:
+        max-depth: -1
+        use-verbose-mode: 'yes'
+        check-modified-files-only: 'yes'
+        config-file: '.github/workflows/markdown-links-check/markdown-links-check-config.json'
+        base-branch: 'main'
@@ -0,0 +1,6 @@
+{
+  "timeout": "15s",
+  "retryOn429": true,
+  "retryCount":30,
+  "aliveStatusCodes": [200, 403]
+} 
@@ -21,3 +21,4 @@
 .scala_dependencies
 .settings
 hs_err*.log
+target
@@ -1,79 +1,42 @@
 # spark-rapids-examples
 
-A repo for Spark related utilities and examples using the Rapids Accelerator,including ETL, ML/DL, etc.
-
-Enterprise AI is built on ETL pipelines and relies on AI infrastructure to effectively integrate and
-process large amounts of data. One of the fundamental purposes of
-[RAPIDS Accelerator](https://nvidia.github.io/spark-rapids/Getting-Started/)
-is to effectively integrate large ETL and ML/DL pipelines. Rapids Accelerator for [Apache Spark](https://spark.apache.org/)
-offers seamless integration with Machine learning frameworks such XGBoost, PCA. Users can leverage the Apache Spark cluster
-with NVIDIA GPUs to accelerate the ETL pipelines and then use the same infrastructure to load the data frame
-into single or multiple GPUs across multiple nodes to train with GPU accelerated XGBoost or a PCA.
-In addition, if you are using a Deep learning framework to train your tabular data with the same Apache Spark cluster,
-we have leveraged NVIDIA’s NVTabular library to load and train the data across multiple nodes with GPUs.
-NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and
-easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
-We also add MIG support to YARN to allow CSPs to split an A100/A30 into multiple MIG
-devices and have them appear like a normal GPU.
-
-Please see the [Rapids Accelerator for Spark documentation](https://nvidia.github.io/spark-rapids/Getting-Started/) for supported
-Spark versions and requirements. It is recommended to set up Spark Cluster with JDK8.
-
-## Getting Started Guides
-
-### 1. Microbenchmark guide
-
-The microbenchmark on [RAPIDS Accelerator For Apache Spark](https://nvidia.github.io/spark-rapids/) is to identify,
-test and analyze the best queries which can be accelerated on the GPU. For detail information please refer to this
-[guide](/examples/micro-benchmarks).
-
-### 2. Xgboost examples guide
-
-We provide three similar Xgboost benchmarks, Mortgage, Taxi and Agaricus.
-Try one of the ["Getting Started Guides"](/examples/Spark-ETL+XGBoost).
-Please note that they target the Mortgage dataset as written with a few changes
-to `EXAMPLE_CLASS` and `dataPath`, they can be easily adapted with each other with different datasets.
-
-### 3. TensorFlow training on Horovod Spark example guide
-
-We provide a Criteo Benchmark to demo ETL and deep learning training on Horovod Spark, please refer to
-this [guide](/examples/Spark-DL/criteo_train).
-
-### 4. PCA example guide
-
-This is an example of the GPU accelerated PCA algorithm running on Spark. For detail information please refer to this
-[guide](/examples/Spark-cuML/pca).
-
-### 5. MIG support
-We provide some [guides](/examples/MIG-Support) about the Multi-Instance GPU (MIG) feature based on
-the NVIDIA Ampere architecture (such as NVIDIA A100 and A30) GPU.
-
-### 6. Spark Rapids UDF examples
-This is examples of the GPU accelerated UDF.
-refer to this
-[guide](/examples/RAPIDS-accelerated-UDFs).
-
-### 7. Spark cuSpatial
-This is a RapidsUDF examples to use [cuSpatial](https://github.com/rapidsai/cuspatial) library to solve the point-in-polygon problem. For detail information please refer to this [guide](/examples/Spark-cuSpatial).
-
-## API
-### 1. Xgboost examples API
-
-These guides focus on GPU related Scala and python API interfaces.
-- [Scala API](/docs/api-docs/xgboost-examples-api-docs/scala.md)
-- [Python API](/docs/api-docs/xgboost-examples-api-docs/python.md)
-
-## Troubleshooting
-You can trouble-shooting issues according to following guides.
-- [Trouble Shooting XGBoost](/docs/trouble-shooting/xgboost-examples-trouble-shooting.md)
-
-## Contributing
-See the [Contributing guide](CONTRIBUTING.md).
-
-## Contact Us
-
-Please see the [RAPIDS](https://rapids.ai/community.html) website for contact information.
-
-## License
-
-This content is licensed under the [Apache License 2.0](/LICENSE)
+This is the [RAPIDS Accelerator for Apache Spark](https://nvidia.github.io/spark-rapids/) examples repo.
+RAPIDS Accelerator for Apache Spark accelerates Spark applications with no code changes.
+You can download the latest version of RAPIDS Accelerator [here](https://nvidia.github.io/spark-rapids/docs/download.html).
+This repo contains examples and applications that showcases the performance and benefits of using 
+RAPIDS Accelerator in data processing and machine learning pipelines. 
+There are broadly four categories of examples in this repo: 
+1. [SQL/Dataframe](./examples/SQL+DF-Examples) 
+2. [Spark XGBoost](./examples/XGBoost-Examples) 
+3. [Deep Learning/Machine Learning](./examples/ML+DL-Examples) 
+4. [RAPIDS UDF](./examples/UDF-Examples)
+
+For more information on each of the examples please look into respective categories.
+
+Here is the list of notebooks in this repo:
+
+|   | Category  | Notebook Name | Description
+| ------------- | ------------- | ------------- | -------------
+| 1 | SQL/DF | Microbenchmark | Spark SQL operations such as expand, hash aggregate, windowing, and cross joins with up to 20x performance benefits
+| 2 | SQL/DF | Customer Churn | Data federation for modeling customer Churn with a sample telco customer data
+| 3 | XGBoost | Agaricus (Scala) | Uses XGBoost classifier function to create model that can accurately differentiate between edible and poisonous mushrooms with the [agaricus dataset](https://archive.ics.uci.edu/ml/datasets/mushroom)
+| 4 | XGBoost | Mortgage (Scala) | End-to-end ETL + XGBoost example to predict mortgage default with [Fannie Mae Single-Family Loan Performance Data](https://capitalmarkets.fanniemae.com/credit-risk-transfer/single-family-credit-risk-transfer/fannie-mae-single-family-loan-performance-data)
+| 5 | XGBoost | Taxi (Scala) | End-to-end ETL + XGBoost example to predict taxi trip fare amount with [NYC taxi trips data set](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
+| 6 | ML/DL | Criteo Training | ETL and deep learning training of the Criteo 1TB Click Logs dataset
+| 7 | ML/DL | PCA End-to-End | Spark MLlib based PCA example to train and transform with a synthetic dataset
+| 8 | UDF | cuSpatial - Point in Polygon | Spark cuSpatial example for Point in Polygon function using NYC Taxi pickup location dataset
+
+Here is the list of Apache Spark applications (Scala and PySpark) that 
+can be built for running on GPU with RAPIDS Accelerator in this repo:
+
+|   | Category  | Notebook Name | Description
+| ------------- | ------------- | ------------- | -------------
+| 1 | XGBoost | Agaricus (Scala) | Uses XGBoost classifier function to create model that can accurately differentiate between edible and poisonous mushrooms with the [agaricus dataset](https://archive.ics.uci.edu/ml/datasets/mushroom)
+| 2 | XGBoost | Mortgage (Scala) | End-to-end ETL + XGBoost example to predict mortgage default with [Fannie Mae Single-Family Loan Performance Data](https://capitalmarkets.fanniemae.com/credit-risk-transfer/single-family-credit-risk-transfer/fannie-mae-single-family-loan-performance-data)
+| 3 | XGBoost | Taxi (Scala) | End-to-end ETL + XGBoost example to predict taxi trip fare amount with [NYC taxi trips data set](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
+| 4 | ML/DL | PCA End-to-End | Spark MLlib based PCA example to train and transform with a synthetic dataset
+| 5 | UDF | cuSpatial - Point in Polygon | Spark cuSpatial example for Point in Polygon function using NYC Taxi pickup location dataset
+| 6 | UDF | URL Decode | Decodes URL-encoded strings using the [Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/stable/)
+| 7 | UDF | URL Encode | URL-encodes strings using the [Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/stable/)
+| 8 | UDF | [CosineSimilarity](./examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/java/com/nvidia/spark/rapids/udf/java/CosineSimilarity.java) | Computes the cosine similarity between two float vectors using [native code](./examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/src)
+| 9 | UDF | [StringWordCount](./examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/java/com/nvidia/spark/rapids/udf/hive/StringWordCount.java)  | Implements a Hive simple UDF using [native code](./examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/src) to count words in strings
@@ -21,4 +21,4 @@ Two files are required by PySpark:
 
 + *main.py*
 
-  entrypoint for PySpark, you can find it in 'spark-rapids-examples/Spark-ETL+XGBoost/examples' folder
+  entrypoint for PySpark, you can find it in 'spark-rapids-examples/examples/XGBoost-Examples' folder
@@ -8,13 +8,13 @@ Follow these steps to build the Scala jars:
 
 ``` bash
 git clone https://github.com/NVIDIA/spark-rapids-examples.git
-cd spark-rapids-examples/examples/Spark-ETL+XGBoost
+cd spark-rapids-examples/examples/XGBoost-Examples
 mvn package
 ```
 
 ## The generated Jars
 
-Let's assume LATEST_VERSION is **0.2.2**. The build process will generate two jars as belows,
+Let's assume LATEST_VERSION is **0.2.3**. The build process will generate two jars as belows,
 
 + *aggregator/target/sample_xgboost_apps-${LATEST_VERSION}.jar*
 
 
@@ -132,10 +132,9 @@ $SPARK_HOME/sbin/start-slave.sh <master-spark-URL>
 
 Make sure you have prepared the necessary packages and dataset by following this [guide](/docs/get-started/xgboost-examples/prepare-package-data/preparation-scala.md)
 
-Copy cudf and rapids jars to `$SPARK_HOME/jars`
+Copy rapids jars to `$SPARK_HOME/jars`
 
 ``` bash
-cp $CUDF_JAR $SPARK_HOME/jars/
 cp $RAPIDS_JAR $SPARK_HOME/jars/
 ```
 
 
@@ -49,7 +49,7 @@ cluster.
 
     - [Databricks 10.4 LTS
     ML](https://docs.databricks.com/release-notes/runtime/9.1ml.html#system-environment) has CUDA 11
-    installed.  Users will need to use 22.04.0 or later on Databricks 10.4 LTS ML. In this case use
+    installed.  Users will need to use 22.06.0 or later on Databricks 10.4 LTS ML. In this case use
     [generate-init-script-10.4.ipynb](generate-init-script-10.4.ipynb) which will install
     the RAPIDS Spark plugin.
 
@@ -108,13 +108,13 @@ Import the GPU Mortgage Example Notebook
 ---------------------------
 
 1. See [Managing Notebooks](https://docs.databricks.com/user-guide/notebooks/notebook-manage.html) on how to import a notebook.
-2. Import the example notebook: [XGBoost4j-Spark mortgage notebook](/examples/Spark-ETL+XGBoost/mortgage/notebooks/scala/mortgage-gpu.ipynb)
+2. Import the example notebook: [XGBoost4j-Spark mortgage notebook](../../../../../examples/XGBoost-Examples/mortgage/notebooks/scala/mortgage-gpu.ipynb)
 3. Inside the mortgage example notebook, update the data paths from 
 "/data/datasets/mortgage-small/train" to "dbfs:/FileStore/tables/mortgage/csv/train/mortgage_train_merged.csv"
 "/data/datasets/mortgage-small/eval" to "dbfs:/FileStore/tables/mortgage/csv/test/mortgage_eval_merged.csv"
 
 The example notebook comes with the following configuration, you can adjust this according to your setup.
-See supported configuration options here: [xgboost parameters](/examples/Spark-ETL+XGBoost/app-parameters/supported_xgboost_parameters_python.md)
+See supported configuration options here: [xgboost parameters](../../../../../examples/XGBoost-Examples/app-parameters/supported_xgboost_parameters_python.md)
 
 ``` bash
 params = { 
 
@@ -24,10 +24,9 @@
    "source": [
     "%sh\n",
     "cd ../../dbfs/FileStore/jars/\n",
-    "sudo wget -O cudf-22.04.0-cuda11.jar https://repo1.maven.org/maven2/ai/rapids/cudf/22.04.0/cudf-22.04.0-cuda11.jar\n",
-    "sudo wget -O rapids-4-spark_2.12-22.04.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.04.0/rapids-4-spark_2.12-22.04.0.jar\n",
-    "sudo wget -O xgboost4j_3.0-1.4.2-0.3.0.jar https://repo1.maven.org/maven2/com/nvidia/xgboost4j_3.0/1.4.2-0.3.0/xgboost4j_3.0-1.4.2-0.3.0.jar\n",
-    "sudo wget -O xgboost4j-spark_3.0-1.4.2-0.3.0.jar https://repo1.maven.org/maven2/com/nvidia/xgboost4j-spark_3.0/1.4.2-0.3.0/xgboost4j-spark_3.0-1.4.2-0.3.0.jar\n",
+    "sudo wget -O rapids-4-spark_2.12-22.06.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.06.0/rapids-4-spark_2.12-22.06.0.jar\n",
+    "sudo wget -O xgboost4j-gpu_2.12-1.6.1.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-gpu_2.12/1.6.1/xgboost4j-gpu_2.12-1.6.1.jar\n",
+    "sudo wget -O xgboost4j-spark-gpu_2.12-1.6.1.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-spark-gpu_2.12/1.6.1/xgboost4j-spark-gpu_2.12-1.6.1.jar\n",
     "ls -ltr\n",
     "\n",
     "# Your Jars are downloaded in dbfs:/FileStore/jars directory"
@@ -57,10 +56,12 @@
    "source": [
     "dbutils.fs.put(\"/databricks/init_scripts/init.sh\",\"\"\"\n",
     "#!/bin/bash\n",
-    "sudo cp /dbfs/FileStore/jars/xgboost4j_3.0-1.4.2-0.3.0.jar /databricks/jars/spark--maven-trees--ml--10.x--xgboost-gpu--ml.dmlc--xgboost4j-gpu_2.12--ml.dmlc__xgboost4j-gpu_2.12__1.5.2.jar\n",
-    "sudo cp /dbfs/FileStore/jars/cudf-22.04.0-cuda11.jar /databricks/jars/\n",
-    "sudo cp /dbfs/FileStore/jars/rapids-4-spark_2.12-22.04.0.jar /databricks/jars/\n",
-    "sudo cp /dbfs/FileStore/jars/xgboost4j-spark_3.0-1.4.2-0.3.0.jar /databricks/jars/spark--maven-trees--ml--10.x--xgboost-gpu--ml.dmlc--xgboost4j-spark-gpu_2.12--ml.dmlc__xgboost4j-spark-gpu_2.12__1.5.2.jar\"\"\", True)"
+    "sudo rm -f /databricks/jars/spark--maven-trees--ml--10.x--xgboost-gpu--ml.dmlc--xgboost4j-gpu_2.12--ml.dmlc__xgboost4j-gpu_2.12__1.5.2.jar\n",
+    "sudo rm -f /databricks/jars/spark--maven-trees--ml--10.x--xgboost-gpu--ml.dmlc--xgboost4j-spark-gpu_2.12--ml.dmlc__xgboost4j-spark-gpu_2.12__1.5.2.jar\n",
+    "\n",
+    "sudo cp /dbfs/FileStore/jars/xgboost4j-gpu_2.12-1.6.1.jar /databricks/jars/\n",
+    "sudo cp /dbfs/FileStore/jars/rapids-4-spark_2.12-22.06.0.jar /databricks/jars/\n",
+    "sudo cp /dbfs/FileStore/jars/xgboost4j-spark-gpu_2.12-1.6.1.jar /databricks/jars/\"\"\", True)"
    ]
   },
   {
@@ -131,8 +132,8 @@
     "\n",
     "1. Edit your cluster, adding an initialization script from `dbfs:/databricks/init_scripts/init.sh` in the \"Advanced Options\" under \"Init Scripts\" tab\n",
     "2. Reboot the cluster\n",
-    "3. Go to \"Libraries\" tab under your cluster and install `dbfs:/FileStore/jars/xgboost4j-spark_3.0-1.4.2-0.3.0.jar` in your cluster by selecting the \"DBFS\" option for installing jars\n",
-    "4. Import the mortgage example notebook from `https://github.com/NVIDIA/spark-rapids-examples/blob/branch-22.04/examples/Spark-ETL+XGBoost/mortgage/notebooks/python/mortgage-gpu.ipynb`\n",
+    "3. Go to \"Libraries\" tab under your cluster and install `dbfs:/FileStore/jars/xgboost4j-spark-gpu_2.12-1.6.1.jar` in your cluster by selecting the \"DBFS\" option for installing jars\n",
+    "4. Import the mortgage example notebook from `https://github.com/NVIDIA/spark-rapids-examples/blob/branch-22.06/examples/XGBoost-Examples/mortgage/notebooks/python/mortgage-gpu.ipynb`\n",
     "5. Inside the mortgage example notebook, update the data paths\n",
     "  `train_data = reader.schema(schema).option('header', True).csv('/data/mortgage/csv/small-train.csv')`\n",
     "  `trans_data = reader.schema(schema).option('header', True).csv('/data/mortgage/csv/small-trans.csv')`"
Original file line number	Diff line number	Diff line change
`@@ -21,4 +21,4 @@ Two files are required by PySpark:`
`21`	`21`
`22`	`22`	`+ main.py`
`23`	`23`
`24`		`- entrypoint for PySpark, you can find it in 'spark-rapids-examples/Spark-ETL+XGBoost/examples' folder`
	`24`	`+ entrypoint for PySpark, you can find it in 'spark-rapids-examples/examples/XGBoost-Examples' folder`