From e000fcd18a8dd8ba3c08c3a30180e0e8a96e5aa4 Mon Sep 17 00:00:00 2001
From: Andrea Francis Soria Jimenez <andrea@huggingface.co>
Date: Mon, 19 Aug 2024 08:05:17 -0400
Subject: [PATCH] doc: Read parquet files with PySpark  (#3020)

* Basic doc for pyspark

* Fix toctree
---
 docs/source/_toctree.yml       |  2 ++
 docs/source/parquet_process.md |  1 +
 docs/source/pyspark.md         | 59 ++++++++++++++++++++++++++++++++++
 3 files changed, 62 insertions(+)
 create mode 100644 docs/source/pyspark.md
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
index 2c8ddcc30..64b7ab6a2 100644
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -46,6 +46,8 @@
           title: Polars
         - local: mlcroissant
           title: mlcroissant
+        - local: pyspark
+          title: PySpark
 - title: Conceptual Guides
   sections:
     - local: configs_and_splits
diff --git a/docs/source/parquet_process.md b/docs/source/parquet_process.md
index 78683d9ea..9a7c5602a 100644
--- a/docs/source/parquet_process.md
+++ b/docs/source/parquet_process.md
@@ -12,3 +12,4 @@ There are several different libraries you can use to work with the published Par
 - [Pandas](https://pandas.pydata.org/docs/index.html), a data analysis tool for working with data structures
 - [Polars](https://pola-rs.github.io/polars-book/user-guide/), a Rust based DataFrame library
 - [mlcroissant](https://github.com/mlcommons/croissant/tree/main/python/mlcroissant), a library for loading datasets from Croissant metadata
+- [pyspark](https://spark.apache.org/docs/latest/api/python), the Python API for Apache Spark
diff --git a/docs/source/pyspark.md b/docs/source/pyspark.md
new file mode 100644
index 000000000..fb20b052a
--- /dev/null
+++ b/docs/source/pyspark.md
@@ -0,0 +1,59 @@
+# PySpark
+
+[pyspark](https://spark.apache.org/docs/latest/api/python) is the Python interface for Apache Spark, enabling large-scale data processing and real-time analytics in a distributed environment using Python.
+
+<Tip>
+
+For a detailed guide on how to analyze datasets on the Hub with PySpark, check out this [blog](https://huggingface.co/blog/asoria/pyspark-hugging-face-datasets).
+
+</Tip>
+
+To start working with Parquet files in PySpark, you'll first need to add the file(s) to a Spark context. Below is an example of how to read a single Parquet file:
+
+```py
+from pyspark import SparkFiles, SparkContext, SparkFiles
+from pyspark.sql import SparkSession
+
+# Initialize a Spark session
+spark = SparkSession.builder.appName("WineReviews").getOrCreate()
+
+# Add the Parquet file to the Spark context
+spark.sparkContext.addFile("https://huggingface.co/api/datasets/james-burton/wine_reviews/parquet/default/train/0.parquet")
+
+# Read the Parquet file into a DataFrame
+df = spark.read.parquet(SparkFiles.get("0.parquet"))
+
+```
+If your dataset is sharded into multiple Parquet files, you'll need to add each file to the Spark context individually. Here's how to do it:
+
+```py
+import requests
+
+# Fetch the URLs of the Parquet files for the train split
+r = requests.get('https://huggingface.co/api/datasets/james-burton/wine_reviews/parquet')
+train_parquet_files = r.json()['default']['train']
+
+# Add each Parquet file to the Spark context
+for url in train_parquet_files:
+  spark.sparkContext.addFile(url)
+
+# Read all Parquet files into a single DataFrame
+df = spark.read.parquet(SparkFiles.getRootDirectory() + "/*.parquet")
+
+```
+
+Once you've loaded the data into a PySpark DataFrame, you can perform various operations to explore and analyze it:
+
+```py
+print(f"Shape of the dataset: {df.count()}, {len(df.columns)}")
+
+# Display first 10 rows
+df.show(n=10)
+
+# Get a statistical summary of the data
+df.describe().show()
+
+# Print the schema of the DataFrame
+df.printSchema()
+
+```
\ No newline at end of file