From c2252ebd2c5e1d520f6a168b4d94ae52f348ace6 Mon Sep 17 00:00:00 2001 From: Julien Chaumond Date: Thu, 11 Apr 2024 10:04:45 +0200 Subject: [PATCH] Datasets Server => Dataset Viewer (#1979) * Datasets Server => Dataset Viewer * Apply suggestions from code review Co-authored-by: Sylvain Lesage --------- Co-authored-by: Sylvain Lesage --- _blog.yml | 2 +- duckdb-nsql-7b.md | 10 +++++----- hub-duckdb.md | 6 +++--- huggy-lingo.md | 6 +++--- huggylingo.md | 6 +++--- zh/huggy-lingo.md | 6 +++--- 6 files changed, 18 insertions(+), 18 deletions(-) diff --git a/_blog.yml b/_blog.yml index cce1000f4a..15ea8989d8 100644 --- a/_blog.yml +++ b/_blog.yml @@ -3716,7 +3716,7 @@ - open-source-collab - local: duckdb-nsql-7b - title: "Text2SQL using Hugging Face Datasets Server API and Motherduck DuckDB-NSQL-7B" + title: "Text2SQL using Hugging Face Dataset Viewer API and Motherduck DuckDB-NSQL-7B" author: asoria thumbnail: /blog/assets/duckdb-nsql-7b/thumbnail.png date: April 4, 2024 diff --git a/duckdb-nsql-7b.md b/duckdb-nsql-7b.md index 987491fe45..3be45834dc 100644 --- a/duckdb-nsql-7b.md +++ b/duckdb-nsql-7b.md @@ -1,5 +1,5 @@ --- -title: "Text2SQL using Hugging Face Datasets Server API and Motherduck DuckDB-NSQL-7B" +title: "Text2SQL using Hugging Face Dataset Viewer API and Motherduck DuckDB-NSQL-7B" thumbnail: /blog/assets/duckdb-nsql-7b/thumbnail.png authors: - user: asoria @@ -11,7 +11,7 @@ authors: guest: true --- -# Text2SQL using Hugging Face Datasets Server API and Motherduck DuckDB-NSQL-7B +# Text2SQL using Hugging Face Dataset Viewer API and Motherduck DuckDB-NSQL-7B Today, integrating AI-powered features, particularly leveraging Large Language Models (LLMs), has become increasingly prevalent across various tasks such as text generation, classification, image-to-text, image-to-image transformations, etc. @@ -27,7 +27,7 @@ In recent months, significant strides have been made in this arena. [MotherDuck] Initially fine-tuned from Meta’s original [Llama-2–7b](https://huggingface.co/meta-llama/Llama-2-7b) model using a broad dataset covering general SQL queries, DuckDB-NSQL-7B underwent further refinement with DuckDB text-to-SQL pairs. Notably, its capabilities extend beyond crafting `SELECT` statements; it can generate a wide range of valid DuckDB SQL statements, including official documentation and extensions, making it a versatile tool for data exploration and analysis. -In this article, we will learn how to deal with text2sql tasks using the DuckDB-NSQL-7B model, Hugging Face datasets server API for parquet files and duckdb for data retrieval. +In this article, we will learn how to deal with text2sql tasks using the DuckDB-NSQL-7B model, the Hugging Face dataset viewer API for parquet files and duckdb for data retrieval.

text2sql flow
@@ -66,7 +66,7 @@ llama = Llama( The main goal of `llama.cpp` is to enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware - locally and in the cloud. We will use this approach. -### Hugging Face Datasets Server API for more than 120K datasets +### Hugging Face Dataset Viewer API for more than 120K datasets Data is a crucial component in any Machine Learning endeavor. Hugging Face is a valuable resource, offering access to over 120,000 free and open datasets spanning various formats, including CSV, Parquet, JSON, audio, and image files. @@ -79,7 +79,7 @@ For this demo, we will be using the [world-cities-geo](https://huggingface.co/da Dataset viewer of world-cities-geo dataset

-Behind the scenes, each dataset in the Hub is processed by the [Hugging Face datasets server API](https://huggingface.co/docs/datasets-server/index), which gets useful information and serves functionalities like: +Behind the scenes, each dataset in the Hub is processed by the [Hugging Face dataset viewer API](https://huggingface.co/docs/datasets-server/index), which gets useful information and serves functionalities like: - List the dataset **splits, column names and data types** - Get the dataset **size** (in number of rows or bytes) - Download and view **rows at any index** in the dataset diff --git a/hub-duckdb.md b/hub-duckdb.md index c8e8306359..392417bd94 100644 --- a/hub-duckdb.md +++ b/hub-duckdb.md @@ -21,7 +21,7 @@ We are happy to share that we recently added another feature to help you analyze ## TLDR -[Datasets Server](https://huggingface.co/docs/datasets-server/index) **automatically converts all public datasets on the Hub to Parquet files**, that you can see by clicking on the "Auto-converted to Parquet" button at the top of a dataset page. You can also access the list of the Parquet files URLs with a simple HTTP call. +The [dataset viewer](https://huggingface.co/docs/datasets-server/index) **automatically converts all public datasets on the Hub to Parquet files**, that you can see by clicking on the "Auto-converted to Parquet" button at the top of a dataset page. You can also access the list of the Parquet files URLs with a simple HTTP call. ```py r = requests.get("https://datasets-server.huggingface.co/parquet?dataset=blog_authorship_corpus") @@ -61,11 +61,11 @@ To learn more, check out the [documentation](https://huggingface.co/docs/dataset ## From dataset to Parquet -[Parquet](https://parquet.apache.org/docs/) files are columnar, making them more efficient to store, load and analyze. This is especially important when you're working with large datasets, which we’re seeing more and more of in the LLM era. To support this, Datasets Server automatically converts and publishes any public dataset on the Hub as Parquet files. The URL to the Parquet files can be retrieved with the [`/parquet`](https://huggingface.co/docs/datasets-server/quick_start#access-parquet-files) endpoint. +[Parquet](https://parquet.apache.org/docs/) files are columnar, making them more efficient to store, load and analyze. This is especially important when you're working with large datasets, which we’re seeing more and more of in the LLM era. To support this, the dataset viewer automatically converts and publishes any public dataset on the Hub as Parquet files. The URL to the Parquet files can be retrieved with the [`/parquet`](https://huggingface.co/docs/datasets-server/quick_start#access-parquet-files) endpoint. ## Analyze with DuckDB -DuckDB offers super impressive performance for running complex analytical queries. It is able to execute a SQL query directly on a remote Parquet file without any overhead. With the [`httpfs`](https://duckdb.org/docs/extensions/httpfs) extension, DuckDB is able to query remote files such as datasets stored on the Hub using the URL provided from the `/parquet` endpoint. DuckDB also supports querying multiple Parquet files which is really convenient because Datasets Server shards big datasets into smaller 500MB chunks. +DuckDB offers super impressive performance for running complex analytical queries. It is able to execute a SQL query directly on a remote Parquet file without any overhead. With the [`httpfs`](https://duckdb.org/docs/extensions/httpfs) extension, DuckDB is able to query remote files such as datasets stored on the Hub using the URL provided from the `/parquet` endpoint. DuckDB also supports querying multiple Parquet files which is really convenient because the dataset viewer shards big datasets into smaller 500MB chunks. ## Looking forward diff --git a/huggy-lingo.md b/huggy-lingo.md index 4ae9c00191..4f2a8857a0 100644 --- a/huggy-lingo.md +++ b/huggy-lingo.md @@ -80,9 +80,9 @@ dataset = load_dataset("biglam/on_the_books") However, for some of the datasets on the Hub, we might be keen not to download the whole dataset. We could instead try to load a sample of the dataset. However, depending on how the dataset was created, we might still end up downloading more data than we’d need onto the machine we’re working on. -Luckily, many datasets on the Hub are available via the [datasets server](https://huggingface.co/docs/datasets-server/index). The datasets server is an API that allows us to access datasets hosted on the Hub without downloading the dataset locally. The Datasets Server powers the Datasets Viewer preview you will see for many datasets hosted on the Hub. +Luckily, many datasets on the Hub are available via the [dataset viewer API](https://huggingface.co/docs/datasets-server/index). It allows us to access datasets hosted on the Hub without downloading the dataset locally. The API powers the dataset viewer you will see for many datasets hosted on the Hub. -For this first experiment with predicting language for datasets, we define a list of column names and data types likely to contain textual content i.e. `text` or `prompt` column names and `string` features are likely to be relevant `image` is not. This means we can avoid predicting the language for datasets where language information is less relevant, for example, image classification datasets. We use the Datasets Server to get 20 rows of text data to pass to a machine learning model (we could modify this to take more or fewer examples from the dataset). +For this first experiment with predicting language for datasets, we define a list of column names and data types likely to contain textual content i.e. `text` or `prompt` column names and `string` features are likely to be relevant `image` is not. This means we can avoid predicting the language for datasets where language information is less relevant, for example, image classification datasets. We use the dataset viewer API to get 20 rows of text data to pass to a machine learning model (we could modify this to take more or fewer examples from the dataset). This approach means that for the majority of datasets on the Hub we can quickly request the contents of likely text columns for the first 20 rows in a dataset. @@ -122,7 +122,7 @@ This system not only updates the datasets with language information, but also do As the number of datasets on the Hub grows, metadata becomes increasingly important. Language metadata, in particular, can be incredibly valuable for identifying the correct dataset for your use case. -With the assistance of the Datasets Server and the [Librarian-Bots](https://huggingface.co/librarian-bots), we can update our dataset metadata at a scale that wouldn't be possible manually. As a result, we're enriching the Hub and making it an even more powerful tool for data scientists, linguists, and AI enthusiasts around the world. +With the assistance of the dataset viewer API and the [Librarian-Bots](https://huggingface.co/librarian-bots), we can update our dataset metadata at a scale that wouldn't be possible manually. As a result, we're enriching the Hub and making it an even more powerful tool for data scientists, linguists, and AI enthusiasts around the world. As the machine learning librarian at Hugging Face, I continue exploring opportunities for automatic metadata enrichment for machine learning artefacts hosted on the Hub. Feel free to reach out (daniel at thiswebsite dot co) if you have ideas or want to collaborate on this effort! diff --git a/huggylingo.md b/huggylingo.md index 25001f4a54..b193e4c5c6 100644 --- a/huggylingo.md +++ b/huggylingo.md @@ -75,9 +75,9 @@ dataset = load_dataset("biglam/on_the_books") However, for some of the datasets on the Hub, we might be keen not to download the whole dataset. We could instead try to load a sample of the dataset. However, depending on how the dataset was created, we might still end up downloading more data than we’d need onto the machine we’re working on. -Luckily, many datasets on the Hub are available via the [datasets server](https://huggingface.co/docs/datasets-server/index). The datasets server is an API that allows us to access datasets hosted on the Hub without downloading the dataset locally. The Datasets Server powers the Datasets Viewer preview you will see for many datasets hosted on the Hub. +Luckily, many datasets on the Hub are available via the [dataset viewer API](https://huggingface.co/docs/datasets-server/index). It allows us to access datasets hosted on the Hub without downloading the dataset locally. The API powers the dataset viewer you will see for many datasets hosted on the Hub. -For this first experiment with predicting language for datasets, we define a list of column names and data types likely to contain textual content i.e. `text` or `prompt` column names and `string` features are likely to be relevant `image` is not. This means we can avoid predicting the language for datasets where language information is less relevant, for example, image classification datasets. We use the Datasets Server to get 20 rows of text data to pass to a machine learning model (we could modify this to take more or fewer examples from the dataset). +For this first experiment with predicting language for datasets, we define a list of column names and data types likely to contain textual content i.e. `text` or `prompt` column names and `string` features are likely to be relevant `image` is not. This means we can avoid predicting the language for datasets where language information is less relevant, for example, image classification datasets. We use the dataset viewer API to get 20 rows of text data to pass to a machine learning model (we could modify this to take more or fewer examples from the dataset). This approach means that for the majority of datasets on the Hub we can quickly request the contents of likely text columns for the first 20 rows in a dataset. @@ -112,7 +112,7 @@ This automated system not only updates the datasets with language information, b As the number of datasets on the Hub grows, metadata becomes increasingly important. Language metadata, in particular, can be incredibly valuable for identifying the correct dataset for your use case. -With the assistance of the Datasets Server and the [Librarian-Bots](https://huggingface.co/librarian-bots), we can update our dataset metadata at a scale that wouldn't be possible manually. As a result, we're enriching the Hub and making it an even more powerful tool for data scientists, linguists, and AI enthusiasts around the world. +With the assistance of the dataset viewer API and the [Librarian-Bots](https://huggingface.co/librarian-bots), we can update our dataset metadata at a scale that wouldn't be possible manually. As a result, we're enriching the Hub and making it an even more powerful tool for data scientists, linguists, and AI enthusiasts around the world. As the machine learning librarian at Hugging Face, I continue exploring opportunities for automatic metadata enrichment for machine learning artefacts hosted on the Hub. Feel free to reach out (daniel at thiswebsite dot co) if you have ideas or want to collaborate on this effort! diff --git a/zh/huggy-lingo.md b/zh/huggy-lingo.md index 3a4dec8c4d..3e84fab594 100644 --- a/zh/huggy-lingo.md +++ b/zh/huggy-lingo.md @@ -79,9 +79,9 @@ dataset = load_dataset("biglam/on_the_books") 对于 Hub 上的某些数据集,我们可能不希望下载整个数据集。我们可以尝试加载数据集的部分样本。然而,根据数据集的创建方式不同,对某些数据集,我们最终下载到机器上的数据可能仍比我们实际需要的多。 -幸运的是,Hub 上的许多数据集都可以通过 [datasets server](https://huggingface.co/docs/datasets-server/index) 获得。Datasets server 是一个 API,其允许我们无需下载到本地即可访问 Hub 上托管的数据集。Datasets server 已被应用于数据集查看器预览功能,Hub 上托管的许多数据集都支持数据集查看器预览功能。 +幸运的是,Hub 上的许多数据集都可以通过 [dataset viewer](https://huggingface.co/docs/datasets-server/index) 获得。Dataset viewer 是一个 API,其允许我们无需下载到本地即可访问 Hub 上托管的数据集。Dataset viewer 已被应用于数据集查看器预览功能,Hub 上托管的许多数据集都支持数据集查看器预览功能。 -为了给语言检测实验准备数据,我们首先定义了一个白名单,其中包含了可能包含文本的列名及数据类型,如名字为 `text` 或 `prompt` 的列以及数据类型为 `string` 的特征可能包含文本,但名字为 `image` 的列大概率是不相关的。这意味着我们可以避免为不相关的数据集预测其语言,例如为图像分类数据集预测语言。我们用 datasets server 获取 20 行文本数据并传给机器学习模型 (具体用多少行数据可以根据实际情况修改)。 +为了给语言检测实验准备数据,我们首先定义了一个白名单,其中包含了可能包含文本的列名及数据类型,如名字为 `text` 或 `prompt` 的列以及数据类型为 `string` 的特征可能包含文本,但名字为 `image` 的列大概率是不相关的。这意味着我们可以避免为不相关的数据集预测其语言,例如为图像分类数据集预测语言。我们用 dataset viewer 获取 20 行文本数据并传给机器学习模型 (具体用多少行数据可以根据实际情况修改)。 这么做的话,我们可以对 Hub 上的大多数数据集,快速获取它们前 20 行数据的文本内容。 @@ -120,4 +120,4 @@ dataset = load_dataset("biglam/on_the_books") 随着 Hub 上的数据集越来越多,元数据变得越来越重要,而其中语言元数据可以帮助用户甄别出合适自己场景的数据集。 -在 Datasets server 和 [Librarian-Bots](https://huggingface.co/librarian-bots) 的帮助下,我们可以大规模地自动更新数据集元数据,这是手动更新无法企及的。我们正在用这种方法不断丰富 Hub,进而使 Hub 成为服务世界各地的数据科学家、语言学家和人工智能爱好者的强大工具。 \ No newline at end of file +在 Dataset viewer 和 [Librarian-Bots](https://huggingface.co/librarian-bots) 的帮助下,我们可以大规模地自动更新数据集元数据,这是手动更新无法企及的。我们正在用这种方法不断丰富 Hub,进而使 Hub 成为服务世界各地的数据科学家、语言学家和人工智能爱好者的强大工具。 \ No newline at end of file