Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Format conversion tools for post tuning datasets #514

Merged
merged 17 commits into from
Dec 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ In this new version, we support more features for **multimodal data (including v
- [2024-02-05] Our paper has been accepted by SIGMOD'24 industrial track!
- [2024-01-10] Discover new horizons in "Data Mixture"—Our second data-centric LLM competition has kicked off! Please visit the competition's [official website](https://tianchi.aliyun.com/competition/entrance/532174) for more information.
- [2024-01-05] We release **Data-Juicer v0.1.3** now!
In this new version, we support **more Python versions** (3.8-3.10), and support **multimodal** dataset [converting](tools/multimodal/README.md)/[processing](docs/Operators.md) (Including texts, images, and audios. More modalities will be supported in the future).
In this new version, we support **more Python versions** (3.8-3.10), and support **multimodal** dataset [converting](tools/fmt_conversion/multimodal/README.md)/[processing](docs/Operators.md) (Including texts, images, and audios. More modalities will be supported in the future).
Besides, our paper is also updated to [v3](https://arxiv.org/abs/2309.02033).
- [2023-10-13] Our first data-centric LLM competition begins! Please
visit the competition's official websites, FT-Data Ranker ([1B Track](https://tianchi.aliyun.com/competition/entrance/532157), [7B Track](https://tianchi.aliyun.com/competition/entrance/532158)), for more information.
Expand Down
2 changes: 1 addition & 1 deletion README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ Data-Juicer正在积极更新和维护中,我们将定期强化和新增更多
- [2024-02-05] 我们的论文被SIGMOD'24 industrial track接收!
- [2024-01-10] 开启“数据混合”新视界——第二届Data-Juicer大模型数据挑战赛已经正式启动!立即访问[竞赛官网](https://tianchi.aliyun.com/competition/entrance/532174),了解赛事详情。
- [2024-01-05] **Data-Juicer v0.1.3** 版本发布了。
在这个新版本中,我们支持了**更多Python版本**(3.8-3.10),同时支持了**多模态**数据集的[转换](tools/multimodal/README_ZH.md)[处理](docs/Operators_ZH.md)(包括文本、图像和音频。更多模态也将会在之后支持)!
在这个新版本中,我们支持了**更多Python版本**(3.8-3.10),同时支持了**多模态**数据集的[转换](tools/fmt_conversion/multimodal/README_ZH.md)[处理](docs/Operators_ZH.md)(包括文本、图像和音频。更多模态也将会在之后支持)!
此外,我们的论文也更新到了[第三版](https://arxiv.org/abs/2309.02033)
- [2023-10-13] 我们的第一届以数据为中心的 LLM 竞赛开始了!
请访问大赛官网,FT-Data Ranker([1B赛道](https://tianchi.aliyun.com/competition/entrance/532157)[7B赛道](https://tianchi.aliyun.com/competition/entrance/532158) ) ,了解更多信息。
Expand Down
54 changes: 54 additions & 0 deletions tools/fmt_conversion/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Format Conversion Tools

Here Data-Juicer provides tens of format conversion tools for diverse datasets, including multimodal datasets, post tuning datasets, and so on.
These tools help to convert the dataset in the original format to a unified, intermediate format used in Data-Juicer, which we call it "DJ format".
An overview of DJ format is shown below:

```python
{
// >>> core contents: texts, dialogs, ...
"text": "xxx",
"query": "xxx",
"response": "xxx",
......
// <<< core contents

// >>> extra data contents: multimodal data paths, ...
"images": [
"path/to/the/image/of/antarctica_snowfield",
"path/to/the/image/of/antarctica_map",
"path/to/the/image/of/europe_map"
],
"audios": [
"path/to/the/audio/of/sound_of_waves_in_Antarctic_Ocean"
],
"videos": [
"path/to/the/video/of/remote_sensing_view_of_antarctica"
],
// <<< extra data contents

// >>> meta infos and stats, which could be primitive or produced by Data-Juicer
"meta": {
"src": "customized",
"version": "0.1",
"author": "xxx"
},
"stats": {
"lang": "en",
"image_widths": [224, 336, 512],
...
},
// <<< meta infos and stats
}
```

There are about three parts in DJ format:
1. Core contents: such as texts in the pretraining dataset of LLMs, dialogs in the post tuning dataset, and so on. They are directly related to the training or fine-tuning procedures in the downstream usage of the dataset.
2. Extra data contents: such as the paths to the multimodal data in the multimodal datasets. They are organized as path lists.
3. Meta infos & Stats: such as version or source information of the dataset that are inherent from the original dataset, or category tags and stats produced by OPs of Data-Juicer.

The 2nd and 3rd parts of them are common used and organized in nearly the same structures for diverse datasets.
As a contrast, the 1st part, which is the core contents, might be quite different for different kinds of datasets.
Here are the corresponding documents for different datasets that introduce more details about this part:
- [Multimodal datasets](multimodal/README.md)
- [Post Tuning](post_tuning_dialog/README.md)
54 changes: 54 additions & 0 deletions tools/fmt_conversion/README_ZH.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# 格式转换工具

在这里,Data-Juicer 为各式各样的数据集提供了十数种格式转换工具,包括多模态数据集,后微调数据集等等。
这些工具帮助我们将原始格式的数据集转换为 Data-Juicer 使用的一种统一的、中间的格式表示,我们将其称为"DJ 格式"。
DJ 格式的一个示例如下所示:

```python
{
// >>> 核心内容:文本,对话,......
"text": "xxx",
"query": "xxx",
"response": "xxx",
......
// <<< 核心内容

// >>> 额外数据内容:多模态数据路径,......
"images": [
"path/to/the/image/of/antarctica_snowfield",
"path/to/the/image/of/antarctica_map",
"path/to/the/image/of/europe_map"
],
"audios": [
"path/to/the/audio/of/sound_of_waves_in_Antarctic_Ocean"
],
"videos": [
"path/to/the/video/of/remote_sensing_view_of_antarctica"
],
// <<< 额外数据内容

// >>> meta 信息和 stats,它们可能是数据集原生的,也可以由 Data-Juicer 产出
"meta": {
"src": "customized",
"version": "0.1",
"author": "xxx"
},
"stats": {
"lang": "en",
"image_widths": [224, 336, 512],
...
},
// <<< meta 信息和 stats
}
```

在 DJ 格式中大概包括三个部分:
1. 核心内容:例如 LLM 的预训练数据集中的文本内容,后微调数据集中的对话内容等。它们与数据集的下游使用的训练或者微调过程直接相关。
2. 额外数据内容:例如多模态数据集中的多模态数据路径。它们被组织为路径列表。
3. Meta 信息和 Stats:例如从原始数据集中继承而来的数据集版本或来源信息,或者由 Data-Juicer 的算子产出的类别 tags 和 stats 信息。

其中,第 2 和第 3 部分对于不同的数据集来说是通用的,而且都会被组织为几乎相同的结构。
作为对比,第 1 部分,也就是核心内容部分,对于各种数据集来说可能非常不同。
这里列举了针对不同种类数据集介绍这个部分更多细节的对应的文档:
- [多模态数据集](multimodal/README_ZH.md)
- [后微调数据集](post_tuning_dialog/README_ZH.md)
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Both input and output of this utility conform to Data-Juicer's data format. If y
To learn more about the usage of the absolute to relative path conversion tool, you can execute the following command:

```shell
python tools/multimodal/absolute_path_to_relative_path.py --help
python tools/fmt_conversion/multimodal/absolute_path_to_relative_path.py --help
```

## Dataset Format Conversion
Expand Down Expand Up @@ -94,7 +94,7 @@ For all tools, you can run the following command to find out the usage of them:

```shell
# e.g. llava_to_dj.py
python tools/multimodal/source_format_to_data_juicer_format/llava_to_dj.py --help
python tools/fmt_conversion/multimodal/source_format_to_data_juicer_format/llava_to_dj.py --help
```

Before using these tools, you might need to take a glance at the reference
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
可以运行以下命令来了解绝对路径转化相对路径工具的详细用法:

```shell
python tools/multimodal/absolute_path_to_relative_path.py --help
python tools/fmt_conversion/multimodal/absolute_path_to_relative_path.py --help
```

## 数据集格式转换
Expand Down Expand Up @@ -86,7 +86,7 @@ python tools/multimodal/absolute_path_to_relative_path.py --help

```shell
# 例如:llava_to_dj.py
python tools/multimodal/source_format_to_data_juicer_format/llava_to_dj.py --help
python tools/fmt_conversion/multimodal/source_format_to_data_juicer_format/llava_to_dj.py --help
```
在使用这些工具之前,您可能需要查看上表中每个格式的参考资料,以更好地了解详细的格式信息,并理解每个工具的参数含义。

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@
from tqdm import tqdm

from data_juicer.utils.mm_utils import SpecialTokens
from tools.multimodal.utils import remove_dj_special_tokens
from tools.fmt_conversion.multimodal.utils import remove_dj_special_tokens


def main(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@
from tqdm import tqdm

from data_juicer.utils.mm_utils import SpecialTokens
from tools.multimodal.utils import remove_dj_special_tokens
from tools.fmt_conversion.multimodal.utils import remove_dj_special_tokens


def main(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@
from tqdm import tqdm

from data_juicer.utils.mm_utils import SpecialTokens
from tools.multimodal.utils import remove_dj_special_tokens
from tools.fmt_conversion.multimodal.utils import remove_dj_special_tokens


def main(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@
from tqdm import tqdm

from data_juicer.utils.mm_utils import SpecialTokens
from tools.multimodal.utils import remove_dj_special_tokens
from tools.fmt_conversion.multimodal.utils import remove_dj_special_tokens


def main(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,8 @@
from data_juicer.utils.file_utils import add_suffix_to_filename
from data_juicer.utils.mm_utils import (SpecialTokens, cut_video_by_seconds,
timecode_string_to_seconds)
from tools.multimodal.utils import (check_args_load_to_dj_data,
convert_text_to_dj)
from tools.fmt_conversion.multimodal.utils import (check_args_load_to_dj_data,
convert_text_to_dj)


def main(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,8 +43,8 @@
from tqdm import tqdm

from data_juicer.utils.mm_utils import SpecialTokens
from tools.multimodal.utils import (check_args_load_to_dj_data,
convert_text_to_dj)
from tools.fmt_conversion.multimodal.utils import (check_args_load_to_dj_data,
convert_text_to_dj)


def main(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,8 @@
from tqdm import tqdm

from data_juicer.utils.mm_utils import SpecialTokens
from tools.multimodal.utils import (check_args_load_to_dj_data,
convert_text_to_dj)
from tools.fmt_conversion.multimodal.utils import (check_args_load_to_dj_data,
convert_text_to_dj)


@logger.catch(reraise=True)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,8 +58,8 @@
from tqdm import tqdm

from data_juicer.utils.mm_utils import SpecialTokens
from tools.multimodal.utils import (check_args_load_to_dj_data,
convert_text_to_dj)
from tools.fmt_conversion.multimodal.utils import (check_args_load_to_dj_data,
convert_text_to_dj)


@logger.catch(reraise=True)
Expand Down
File renamed without changes.
96 changes: 96 additions & 0 deletions tools/fmt_conversion/post_tuning_dialog/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Post Tuning Tools

For post tuning formats, we mainly consider 4 formats to support [ModelScope-Swift](https://github.com/modelscope/ms-swift/blob/main/docs/source_en/Customization/Custom-dataset.md) and [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory/blob/main/data/README.md).

- Swift's Messages format (Very similar to the LLaMA-Factory's ShareGPT format, with different key names):

```python
{
"messages": [
{
"role": "system",
"content": "<system>"
},
{
"role": "user",
"content": "<query1>"
},
{
"role": "assistant",
"content": "<response1>"
},
{
"role": "user",
"content": "<query2>"
},
{
"role": "assistant",
"content": "<response2>"
}
]
}
```

- Swift's ShareGPT format:

```python
{
"system": "<system>",
"conversation": [
{
"human": "<query1>",
"assistant": "<response1>"
},
{
"human": "<query2>",
"assistant": "<response2>"
}
]
}
```

- Alpaca format (used in the same definition in Swift and LLaMA-Factory):

```python
{
"system": "<system>",
"instruction": "<query-inst>",
"input": "<query-input>",
"output": "<response>"
}
```

- Swift's Query-Response format:

```python
{
"system": "<system>",
"query": "<query2>",
"response": "<response2>",
"history": [
[
"<query1>",
"<response1>"
]
]
}
```

In Data-Juicer, we pre-set fields to align with the last two formats (Alpaca and Query-Response), which serves as our intermediate format for post-tuning dialog datasets. Correspondingly, we provide several tools to convert datasets in other formats to the following DJ format and vice versa.

- DJ default format for post-tuning OPs:

```python
{
"system": "<system>",
"instruction": "<query-inst>",
"query": "<query2>",
"response": "<response2>",
"history": [
[
"<query1>",
"<response1>"
]
]
}
```
Loading
Loading