Skip to content

Commit

Permalink
Add parquet to api (#907)
Browse files Browse the repository at this point in the history
* docs: ✏️ add /parquet API endpoints

* docs: ✏️ add a sentence about accessing the parquet through API
  • Loading branch information
severo authored Aug 1, 2023
1 parent 4aba098 commit 54d819f
Show file tree
Hide file tree
Showing 2 changed files with 4 additions and 0 deletions.
2 changes: 2 additions & 0 deletions docs/hub/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ The base URL for those endpoints below is `https://huggingface.co`. For example,
| /api/models-tags-by-type GET | Gets all the available model tags hosted in the Hub | `get_model_tags()` | | |
| /api/datasets GET | Get information from all datasets in the Hub. The response is paginated, use the [`Link` header](https://docs.github.com/en/rest/guides/using-pagination-in-the-rest-api?apiVersion=2022-11-28#link-header) to get the following pages. You can specify additional parameters to have more specific results. - `search`: Filter based on substrings for repos and their usernames, such as `pets` or `microsoft` - `author`: Filter datasets by an other or organization, such as `huggingface` or `microsoft` - `filter`: Filter based on tags, such as `task_categories:text-classification` or `languages:en`. - `sort`: Property to use when sorting, such as `downloads` or `author`. - `direction`: Direction in which to sort, such as `-1` for descending, and anything else for ascending. - `limit`: Limit the number of datasets fetched. - `full`: Whether to fetch most dataset data, such as all tags, the files, etc. | `list_datasets()` | ```params= { "search":"search", "author":"author", "filter":"filter", "sort":"sort", "direction":"direction", "limit":"limit", "full":"full", "config":"config"}``` | |
| /api/datasets/{repo_id} /api/datasets/{repo_id}/revision/{revision} GET | Get all information for a specific dataset. - `full`: Whether to fetch most dataset data, such as all tags, the files, etc. | `dataset_info(repo_id, revision)` | ```headers = { "authorization" : "Bearer $token", "full" : "full" }``` | |
| /api/datasets/{repo_id}/parquet /api/datasets/{repo_id}/parquet/{config}/{split} GET | Get the list of auto-converted parquet files. | | ```headers = { "authorization" : "Bearer $token", "full" : "full" }``` | |
| /api/datasets/{repo_id}/parquet/{config}/{split}/{n}.parquet GET | Get the nth shard of the auto-converted parquet files. | | ```headers = { "authorization" : "Bearer $token", "full" : "full" }``` | |
| /api/datasets-tags-by-type GET | Gets all the available dataset tags hosted in the Hub | `get_dataset_tags()` | | |
| /api/spaces GET | Get information from all Spaces in the Hub. The response is paginated, use the [`Link` header](https://docs.github.com/en/rest/guides/using-pagination-in-the-rest-api?apiVersion=2022-11-28#link-header) to get the following pages. You can specify additional parameters to have more specific results. - `search`: Filter based on substrings for repos and their usernames, such as `resnet` or `microsoft` - `author`: Filter models by an author or organization, such as `huggingface` or `microsoft` - `filter`: Filter based on tags, such as `text-classification` or `spacy`. - `sort`: Property to use when sorting, such as `downloads` or `author`. - `direction`: Direction in which to sort, such as `-1` for descending, and anything else for ascending. - `limit`: Limit the number of models fetched. - `full`: Whether to fetch most model data, such as all tags, the files, etc. - `config`: Whether to also fetch the repo config. | `list_spaces()` | ```params= { "search":"search", "author":"author", "filter":"filter", "sort":"sort", "direction":"direction", "limit":"limit", "full":"full", "config":"config"}``` | |
| /api/spaces/{repo_id} /api/spaces/{repo_id}/revision/{revision} GET | Get all information for a specific model. | `space_info(repo_id, revision)` | ```headers = { "authorization" : "Bearer $token" }``` | |
Expand Down
2 changes: 2 additions & 0 deletions docs/hub/datasets-viewer.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ You can share a specific row by clicking on it, and then copying the URL in the

Every dataset is auto-converted to the Parquet format. Click on [_"Auto-converted to Parquet"_](https://huggingface.co/datasets/glue/tree/refs%2Fconvert%2Fparquet/cola) to access the Parquet files. Refer to the [Datasets Server docs](/docs/datasets-server/parquet_process) to learn how to query the dataset with libraries such as Polars, Pandas or DuckDB.

You can also access the list of Parquet files programmatically using the [API](./api#endpoints-table): https://huggingface.co/api/datasets/glue/parquet.

This comment has been minimized.

Copy link
@julien-c

julien-c Aug 22, 2023

Member

Hub API, to differentiate it vs. the datasets-server?


## Very large datasets

For datasets >5GB, we only auto-convert to Parquet the first ~5GB of the dataset.
Expand Down

0 comments on commit 54d819f

Please sign in to comment.