From 09c35408dcbdff6f373d33a4169d3c1e6791f762 Mon Sep 17 00:00:00 2001
From: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Date: Tue, 9 Jul 2024 18:59:14 +0200
Subject: [PATCH] Docs: mention partial parquet conversion for big row groups
 (#2982)

mention partial parquet conversion for big row groups
---
 docs/source/parquet.md | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/docs/source/parquet.md b/docs/source/parquet.md
index 77c3ee7529..a5340ea9fe 100644
--- a/docs/source/parquet.md
+++ b/docs/source/parquet.md
@@ -199,10 +199,18 @@ To read and query the Parquet files, take a look at the [Query datasets from the
 
 ## Partially converted datasets
 
-The Parquet version can be partial if the dataset is not already in Parquet format or if it is bigger than 5GB.
+The Parquet version can be partial in two cases:
+- if the dataset is already in Parquet format but it contains row groups bigger than the recommended size (100-300MB uncompressed)
+- if the dataset is not already in Parquet format or if it is bigger than 5GB.
 
 In that case the Parquet files are generated up to 5GB and placed in a split directory prefixed with "partial", e.g. "partial-train" instead of "train".
 
+You can check the row groups size directly on Hugging Face using the Parquet metadata sidebar, for example [here](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu/tree/main/data/CC-MAIN-2013-20?show_file_info=data%2FCC-MAIN-2013-20%2Ftrain-00000-of-00014.parquet):
+
+![clic-parquet-metadata-sidebar](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets-server/clic-parquet-metadata-sidebar.png)
+
+![parquet-metadata-sidebar-total-byte-size](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets-server/parquet-metadata-sidebar-total-byte-size.png)
+
 ## Parquet-native datasets
 
 When the dataset is already in Parquet format, the data are not converted and the files in `refs/convert/parquet` are links to the original files. This rule suffers an exception to ensure the dataset viewer API to stay fast: if the [row group](https://parquet.apache.org/docs/concepts/) size of the original Parquet files is too big, new Parquet files are generated.