|
1 | 1 | # spark-rapids-examples
|
2 | 2 |
|
3 |
| -A repo for Spark related utilities and examples using the Rapids Accelerator,including ETL, ML/DL, etc. |
4 |
| - |
5 |
| -Enterprise AI is built on ETL pipelines and relies on AI infrastructure to effectively integrate and |
6 |
| -process large amounts of data. One of the fundamental purposes of |
7 |
| -[RAPIDS Accelerator](https://nvidia.github.io/spark-rapids/Getting-Started/) |
8 |
| -is to effectively integrate large ETL and ML/DL pipelines. Rapids Accelerator for [Apache Spark](https://spark.apache.org/) |
9 |
| -offers seamless integration with Machine learning frameworks such XGBoost, PCA. Users can leverage the Apache Spark cluster |
10 |
| -with NVIDIA GPUs to accelerate the ETL pipelines and then use the same infrastructure to load the data frame |
11 |
| -into single or multiple GPUs across multiple nodes to train with GPU accelerated XGBoost or a PCA. |
12 |
| -In addition, if you are using a Deep learning framework to train your tabular data with the same Apache Spark cluster, |
13 |
| -we have leveraged NVIDIA’s NVTabular library to load and train the data across multiple nodes with GPUs. |
14 |
| -NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and |
15 |
| -easily manipulate terabyte scale datasets used to train deep learning based recommender systems. |
16 |
| -We also add MIG support to YARN to allow CSPs to split an A100/A30 into multiple MIG |
17 |
| -devices and have them appear like a normal GPU. |
18 |
| - |
19 |
| -Please see the [Rapids Accelerator for Spark documentation](https://nvidia.github.io/spark-rapids/Getting-Started/) for supported |
20 |
| -Spark versions and requirements. It is recommended to set up Spark Cluster with JDK8. |
21 |
| - |
22 |
| -## Getting Started Guides |
23 |
| - |
24 |
| -### 1. Microbenchmark guide |
25 |
| - |
26 |
| -The microbenchmark on [RAPIDS Accelerator For Apache Spark](https://nvidia.github.io/spark-rapids/) is to identify, |
27 |
| -test and analyze the best queries which can be accelerated on the GPU. For detail information please refer to this |
28 |
| -[guide](/examples/micro-benchmarks). |
29 |
| - |
30 |
| -### 2. Xgboost examples guide |
31 |
| - |
32 |
| -We provide three similar Xgboost benchmarks, Mortgage, Taxi and Agaricus. |
33 |
| -Try one of the ["Getting Started Guides"](/examples/Spark-ETL+XGBoost). |
34 |
| -Please note that they target the Mortgage dataset as written with a few changes |
35 |
| -to `EXAMPLE_CLASS` and `dataPath`, they can be easily adapted with each other with different datasets. |
36 |
| - |
37 |
| -### 3. TensorFlow training on Horovod Spark example guide |
38 |
| - |
39 |
| -We provide a Criteo Benchmark to demo ETL and deep learning training on Horovod Spark, please refer to |
40 |
| -this [guide](/examples/Spark-DL/criteo_train). |
41 |
| - |
42 |
| -### 4. PCA example guide |
43 |
| - |
44 |
| -This is an example of the GPU accelerated PCA algorithm running on Spark. For detail information please refer to this |
45 |
| -[guide](/examples/Spark-cuML/pca). |
46 |
| - |
47 |
| -### 5. MIG support |
48 |
| -We provide some [guides](/examples/MIG-Support) about the Multi-Instance GPU (MIG) feature based on |
49 |
| -the NVIDIA Ampere architecture (such as NVIDIA A100 and A30) GPU. |
50 |
| - |
51 |
| -### 6. Spark Rapids UDF examples |
52 |
| -This is examples of the GPU accelerated UDF. |
53 |
| -refer to this |
54 |
| -[guide](/examples/RAPIDS-accelerated-UDFs). |
55 |
| - |
56 |
| -### 7. Spark cuSpatial |
57 |
| -This is a RapidsUDF examples to use [cuSpatial](https://github.com/rapidsai/cuspatial) library to solve the point-in-polygon problem. For detail information please refer to this [guide](/examples/Spark-cuSpatial). |
58 |
| - |
59 |
| -## API |
60 |
| -### 1. Xgboost examples API |
61 |
| - |
62 |
| -These guides focus on GPU related Scala and python API interfaces. |
63 |
| -- [Scala API](/docs/api-docs/xgboost-examples-api-docs/scala.md) |
64 |
| -- [Python API](/docs/api-docs/xgboost-examples-api-docs/python.md) |
65 |
| - |
66 |
| -## Troubleshooting |
67 |
| -You can trouble-shooting issues according to following guides. |
68 |
| -- [Trouble Shooting XGBoost](/docs/trouble-shooting/xgboost-examples-trouble-shooting.md) |
69 |
| - |
70 |
| -## Contributing |
71 |
| -See the [Contributing guide](CONTRIBUTING.md). |
72 |
| - |
73 |
| -## Contact Us |
74 |
| - |
75 |
| -Please see the [RAPIDS](https://rapids.ai/community.html) website for contact information. |
76 |
| - |
77 |
| -## License |
78 |
| - |
79 |
| -This content is licensed under the [Apache License 2.0](/LICENSE) |
| 3 | +This is the [RAPIDS Accelerator for Apache Spark](https://nvidia.github.io/spark-rapids/) examples repo. |
| 4 | +RAPIDS Accelerator for Apache Spark accelerates Spark applications with no code changes. |
| 5 | +You can download the latest version of RAPIDS Accelerator [here](https://nvidia.github.io/spark-rapids/docs/download.html). |
| 6 | +This repo contains examples and applications that showcases the performance and benefits of using |
| 7 | +RAPIDS Accelerator in data processing and machine learning pipelines. |
| 8 | +There are broadly four categories of examples in this repo: |
| 9 | +1. [SQL/Dataframe](./examples/SQL+DF-Examples) |
| 10 | +2. [Spark XGBoost](./examples/XGBoost-Examples) |
| 11 | +3. [Deep Learning/Machine Learning](./examples/ML+DL-Examples) |
| 12 | +4. [RAPIDS UDF](./examples/UDF-Examples) |
| 13 | + |
| 14 | +For more information on each of the examples please look into respective categories. |
| 15 | + |
| 16 | +Here is the list of notebooks in this repo: |
| 17 | + |
| 18 | +| | Category | Notebook Name | Description |
| 19 | +| ------------- | ------------- | ------------- | ------------- |
| 20 | +| 1 | SQL/DF | Microbenchmark | Spark SQL operations such as expand, hash aggregate, windowing, and cross joins with up to 20x performance benefits |
| 21 | +| 2 | SQL/DF | Customer Churn | Data federation for modeling customer Churn with a sample telco customer data |
| 22 | +| 3 | XGBoost | Agaricus (Scala) | Uses XGBoost classifier function to create model that can accurately differentiate between edible and poisonous mushrooms with the [agaricus dataset](https://archive.ics.uci.edu/ml/datasets/mushroom) |
| 23 | +| 4 | XGBoost | Mortgage (Scala) | End-to-end ETL + XGBoost example to predict mortgage default with [Fannie Mae Single-Family Loan Performance Data](https://capitalmarkets.fanniemae.com/credit-risk-transfer/single-family-credit-risk-transfer/fannie-mae-single-family-loan-performance-data) |
| 24 | +| 5 | XGBoost | Taxi (Scala) | End-to-end ETL + XGBoost example to predict taxi trip fare amount with [NYC taxi trips data set](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) |
| 25 | +| 6 | ML/DL | Criteo Training | ETL and deep learning training of the Criteo 1TB Click Logs dataset |
| 26 | +| 7 | ML/DL | PCA End-to-End | Spark MLlib based PCA example to train and transform with a synthetic dataset |
| 27 | +| 8 | UDF | cuSpatial - Point in Polygon | Spark cuSpatial example for Point in Polygon function using NYC Taxi pickup location dataset |
| 28 | + |
| 29 | +Here is the list of Apache Spark applications (Scala and PySpark) that |
| 30 | +can be built for running on GPU with RAPIDS Accelerator in this repo: |
| 31 | + |
| 32 | +| | Category | Notebook Name | Description |
| 33 | +| ------------- | ------------- | ------------- | ------------- |
| 34 | +| 1 | XGBoost | Agaricus (Scala) | Uses XGBoost classifier function to create model that can accurately differentiate between edible and poisonous mushrooms with the [agaricus dataset](https://archive.ics.uci.edu/ml/datasets/mushroom) |
| 35 | +| 2 | XGBoost | Mortgage (Scala) | End-to-end ETL + XGBoost example to predict mortgage default with [Fannie Mae Single-Family Loan Performance Data](https://capitalmarkets.fanniemae.com/credit-risk-transfer/single-family-credit-risk-transfer/fannie-mae-single-family-loan-performance-data) |
| 36 | +| 3 | XGBoost | Taxi (Scala) | End-to-end ETL + XGBoost example to predict taxi trip fare amount with [NYC taxi trips data set](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) |
| 37 | +| 4 | ML/DL | PCA End-to-End | Spark MLlib based PCA example to train and transform with a synthetic dataset |
| 38 | +| 5 | UDF | cuSpatial - Point in Polygon | Spark cuSpatial example for Point in Polygon function using NYC Taxi pickup location dataset |
| 39 | +| 6 | UDF | URL Decode | Decodes URL-encoded strings using the [Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/stable/) |
| 40 | +| 7 | UDF | URL Encode | URL-encodes strings using the [Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/stable/) |
| 41 | +| 8 | UDF | [CosineSimilarity](./examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/java/com/nvidia/spark/rapids/udf/java/CosineSimilarity.java) | Computes the cosine similarity between two float vectors using [native code](./examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/src) |
| 42 | +| 9 | UDF | [StringWordCount](./examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/java/com/nvidia/spark/rapids/udf/hive/StringWordCount.java) | Implements a Hive simple UDF using [native code](./examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/src) to count words in strings |
0 commit comments