From e000fcd18a8dd8ba3c08c3a30180e0e8a96e5aa4 Mon Sep 17 00:00:00 2001 From: Andrea Francis Soria Jimenez Date: Mon, 19 Aug 2024 08:05:17 -0400 Subject: [PATCH] doc: Read parquet files with PySpark (#3020) * Basic doc for pyspark * Fix toctree --- docs/source/_toctree.yml | 2 ++ docs/source/parquet_process.md | 1 + docs/source/pyspark.md | 59 ++++++++++++++++++++++++++++++++++ 3 files changed, 62 insertions(+) create mode 100644 docs/source/pyspark.md diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml index 2c8ddcc30..64b7ab6a2 100644 --- a/docs/source/_toctree.yml +++ b/docs/source/_toctree.yml @@ -46,6 +46,8 @@ title: Polars - local: mlcroissant title: mlcroissant + - local: pyspark + title: PySpark - title: Conceptual Guides sections: - local: configs_and_splits diff --git a/docs/source/parquet_process.md b/docs/source/parquet_process.md index 78683d9ea..9a7c5602a 100644 --- a/docs/source/parquet_process.md +++ b/docs/source/parquet_process.md @@ -12,3 +12,4 @@ There are several different libraries you can use to work with the published Par - [Pandas](https://pandas.pydata.org/docs/index.html), a data analysis tool for working with data structures - [Polars](https://pola-rs.github.io/polars-book/user-guide/), a Rust based DataFrame library - [mlcroissant](https://github.com/mlcommons/croissant/tree/main/python/mlcroissant), a library for loading datasets from Croissant metadata +- [pyspark](https://spark.apache.org/docs/latest/api/python), the Python API for Apache Spark diff --git a/docs/source/pyspark.md b/docs/source/pyspark.md new file mode 100644 index 000000000..fb20b052a --- /dev/null +++ b/docs/source/pyspark.md @@ -0,0 +1,59 @@ +# PySpark + +[pyspark](https://spark.apache.org/docs/latest/api/python) is the Python interface for Apache Spark, enabling large-scale data processing and real-time analytics in a distributed environment using Python. + + + +For a detailed guide on how to analyze datasets on the Hub with PySpark, check out this [blog](https://huggingface.co/blog/asoria/pyspark-hugging-face-datasets). + + + +To start working with Parquet files in PySpark, you'll first need to add the file(s) to a Spark context. Below is an example of how to read a single Parquet file: + +```py +from pyspark import SparkFiles, SparkContext, SparkFiles +from pyspark.sql import SparkSession + +# Initialize a Spark session +spark = SparkSession.builder.appName("WineReviews").getOrCreate() + +# Add the Parquet file to the Spark context +spark.sparkContext.addFile("https://huggingface.co/api/datasets/james-burton/wine_reviews/parquet/default/train/0.parquet") + +# Read the Parquet file into a DataFrame +df = spark.read.parquet(SparkFiles.get("0.parquet")) + +``` +If your dataset is sharded into multiple Parquet files, you'll need to add each file to the Spark context individually. Here's how to do it: + +```py +import requests + +# Fetch the URLs of the Parquet files for the train split +r = requests.get('https://huggingface.co/api/datasets/james-burton/wine_reviews/parquet') +train_parquet_files = r.json()['default']['train'] + +# Add each Parquet file to the Spark context +for url in train_parquet_files: + spark.sparkContext.addFile(url) + +# Read all Parquet files into a single DataFrame +df = spark.read.parquet(SparkFiles.getRootDirectory() + "/*.parquet") + +``` + +Once you've loaded the data into a PySpark DataFrame, you can perform various operations to explore and analyze it: + +```py +print(f"Shape of the dataset: {df.count()}, {len(df.columns)}") + +# Display first 10 rows +df.show(n=10) + +# Get a statistical summary of the data +df.describe().show() + +# Print the schema of the DataFrame +df.printSchema() + +``` \ No newline at end of file