Add parquet to api (#907)

* docs: ✏️ add /parquet API endpoints * docs: ✏️ add a sentence about accessing the parquet through API
huggingface · Aug 1, 2023 · 54d819f · julien-c · Aug 22, 2023 · 54d819f
1 parent 4aba098
commit 54d819f
Show file tree

Hide file tree

Showing 2 changed files with 4 additions and 0 deletions.
diff --git a/docs/hub/api.md b/docs/hub/api.md
@@ -13,6 +13,8 @@ The base URL for those endpoints below is `https://huggingface.co`. For example,
 | /api/models-tags-by-type   GET                                               	| Gets all the available model tags hosted in the Hub                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    	| `get_model_tags()`                	|                                                                                                                                                                       	|   	|
 | /api/datasets     GET                                                        	| Get information from all datasets in the Hub. The response is paginated, use the [`Link` header](https://docs.github.com/en/rest/guides/using-pagination-in-the-rest-api?apiVersion=2022-11-28#link-header) to get the following pages. You can specify additional parameters to have more specific results. - `search`: Filter based on substrings for repos and their usernames, such as `pets` or `microsoft`   - `author`: Filter datasets by an other or organization, such as `huggingface` or `microsoft` - `filter`: Filter based on tags, such as `task_categories:text-classification` or `languages:en`. - `sort`: Property to use when sorting, such as `downloads` or `author`. - `direction`: Direction in which to sort, such as `-1` for descending, and anything else for ascending. - `limit`: Limit the number of datasets fetched.  - `full`: Whether to fetch most dataset data, such as all tags, the files, etc.                         	| `list_datasets()`                 	| ```params= {   "search":"search", "author":"author", "filter":"filter", "sort":"sort", "direction":"direction", "limit":"limit", "full":"full", "config":"config"}``` 	|   	|
 | /api/datasets/{repo_id}   /api/datasets/{repo_id}/revision/{revision}    GET 	| Get all information for a specific dataset.   - `full`: Whether to fetch most dataset data, such as all tags, the files, etc.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          	| `dataset_info(repo_id, revision)` 	| ```headers = { "authorization" :  "Bearer $token", "full" : "full"  }```                                                                                              	|   	|
+| /api/datasets/{repo_id}/parquet   /api/datasets/{repo_id}/parquet/{config}/{split}       GET 	| Get the list of auto-converted parquet files.                                                                                                                                                                                                                                                                                                                                                                                                                                                             	|  	| ```headers = { "authorization" :  "Bearer $token", "full" : "full"  }```                                                                                              	|   	|
+| /api/datasets/{repo_id}/parquet/{config}/{split}/{n}.parquet       GET 	| Get the nth shard of the auto-converted parquet files.                                                                                                                                                                                                                                                                                                                                                                                                                                                             	|  	| ```headers = { "authorization" :  "Bearer $token", "full" : "full"  }```                                                                                              	|   	|
 | /api/datasets-tags-by-type   GET                                             	| Gets all the available dataset tags hosted in the Hub                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  	| `get_dataset_tags()`              	|                                                                                                                                                                       	|   	|
 | /api/spaces     GET                                                          	| Get information from all Spaces in the Hub. The response is paginated, use the [`Link` header](https://docs.github.com/en/rest/guides/using-pagination-in-the-rest-api?apiVersion=2022-11-28#link-header) to get the following pages. You can specify additional parameters to have more specific results.   - `search`: Filter based on substrings for repos and their usernames, such as `resnet` or `microsoft` - `author`: Filter models by an author or organization, such as `huggingface` or `microsoft` - `filter`: Filter based on tags, such as `text-classification` or `spacy`. - `sort`: Property to use when sorting, such as `downloads` or `author`.  - `direction`: Direction in which to sort, such as `-1` for descending, and anything else for ascending. - `limit`: Limit the number of models fetched.  - `full`: Whether to fetch most model data, such as all tags, the files, etc.  - `config`: Whether to also fetch the repo config. 	| `list_spaces()`                   	| ```params= {   "search":"search", "author":"author", "filter":"filter", "sort":"sort", "direction":"direction", "limit":"limit", "full":"full", "config":"config"}``` 	|   	|
 | /api/spaces/{repo_id}   /api/spaces/{repo_id}/revision/{revision}    GET     	| Get all information for a specific model.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              	| `space_info(repo_id, revision)`   	| ```headers = { "authorization" :  "Bearer $token" }```                                                                                                                	|   	|

diff --git a/docs/hub/datasets-viewer.md b/docs/hub/datasets-viewer.md
@@ -15,6 +15,8 @@ You can share a specific row by clicking on it, and then copying the URL in the
 
 Every dataset is auto-converted to the Parquet format. Click on [_"Auto-converted to Parquet"_](https://huggingface.co/datasets/glue/tree/refs%2Fconvert%2Fparquet/cola) to access the Parquet files. Refer to the [Datasets Server docs](/docs/datasets-server/parquet_process) to learn how to query the dataset with libraries such as Polars, Pandas or DuckDB.
 
+You can also access the list of Parquet files programmatically using the [API](./api#endpoints-table): https://huggingface.co/api/datasets/glue/parquet.
+
 ## Very large datasets
 
 For datasets >5GB, we only auto-convert to Parquet the first ~5GB of the dataset.