-
Notifications
You must be signed in to change notification settings - Fork 197
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Format conversion tools for post tuning datasets (#514)
* + add sharegpt <--> dj format conversion tools * - move multimodal into fmt_conversion * + add basic docs for format conversion tools and post tuning dialog format conversion tools * * rename tools * + add messages <--> dj conversion tools * + add messages <--> dj conversion tools * - reorganize the directory * * rename functions * + add conversion tools for ModelScope-Swift ShareGPT format * + add conversion tools for Alpaca format * * fix typos in doc strings * Update post_tuning_dialog/README.md * Update pos_tuning_dialog/README_ZH.md align with en version * clearly point out the DJ format * clearly point out the DJ format in zh * minor typo fix --------- Co-authored-by: Daoyuan Chen <[email protected]>
- Loading branch information
Showing
32 changed files
with
1,490 additions
and
18 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
# Format Conversion Tools | ||
|
||
Here Data-Juicer provides tens of format conversion tools for diverse datasets, including multimodal datasets, post tuning datasets, and so on. | ||
These tools help to convert the dataset in the original format to a unified, intermediate format used in Data-Juicer, which we call it "DJ format". | ||
An overview of DJ format is shown below: | ||
|
||
```python | ||
{ | ||
// >>> core contents: texts, dialogs, ... | ||
"text": "xxx", | ||
"query": "xxx", | ||
"response": "xxx", | ||
...... | ||
// <<< core contents | ||
|
||
// >>> extra data contents: multimodal data paths, ... | ||
"images": [ | ||
"path/to/the/image/of/antarctica_snowfield", | ||
"path/to/the/image/of/antarctica_map", | ||
"path/to/the/image/of/europe_map" | ||
], | ||
"audios": [ | ||
"path/to/the/audio/of/sound_of_waves_in_Antarctic_Ocean" | ||
], | ||
"videos": [ | ||
"path/to/the/video/of/remote_sensing_view_of_antarctica" | ||
], | ||
// <<< extra data contents | ||
|
||
// >>> meta infos and stats, which could be primitive or produced by Data-Juicer | ||
"meta": { | ||
"src": "customized", | ||
"version": "0.1", | ||
"author": "xxx" | ||
}, | ||
"stats": { | ||
"lang": "en", | ||
"image_widths": [224, 336, 512], | ||
... | ||
}, | ||
// <<< meta infos and stats | ||
} | ||
``` | ||
|
||
There are about three parts in DJ format: | ||
1. Core contents: such as texts in the pretraining dataset of LLMs, dialogs in the post tuning dataset, and so on. They are directly related to the training or fine-tuning procedures in the downstream usage of the dataset. | ||
2. Extra data contents: such as the paths to the multimodal data in the multimodal datasets. They are organized as path lists. | ||
3. Meta infos & Stats: such as version or source information of the dataset that are inherent from the original dataset, or category tags and stats produced by OPs of Data-Juicer. | ||
|
||
The 2nd and 3rd parts of them are common used and organized in nearly the same structures for diverse datasets. | ||
As a contrast, the 1st part, which is the core contents, might be quite different for different kinds of datasets. | ||
Here are the corresponding documents for different datasets that introduce more details about this part: | ||
- [Multimodal datasets](multimodal/README.md) | ||
- [Post Tuning](post_tuning_dialog/README.md) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
# 格式转换工具 | ||
|
||
在这里,Data-Juicer 为各式各样的数据集提供了十数种格式转换工具,包括多模态数据集,后微调数据集等等。 | ||
这些工具帮助我们将原始格式的数据集转换为 Data-Juicer 使用的一种统一的、中间的格式表示,我们将其称为"DJ 格式"。 | ||
DJ 格式的一个示例如下所示: | ||
|
||
```python | ||
{ | ||
// >>> 核心内容:文本,对话,...... | ||
"text": "xxx", | ||
"query": "xxx", | ||
"response": "xxx", | ||
...... | ||
// <<< 核心内容 | ||
|
||
// >>> 额外数据内容:多模态数据路径,...... | ||
"images": [ | ||
"path/to/the/image/of/antarctica_snowfield", | ||
"path/to/the/image/of/antarctica_map", | ||
"path/to/the/image/of/europe_map" | ||
], | ||
"audios": [ | ||
"path/to/the/audio/of/sound_of_waves_in_Antarctic_Ocean" | ||
], | ||
"videos": [ | ||
"path/to/the/video/of/remote_sensing_view_of_antarctica" | ||
], | ||
// <<< 额外数据内容 | ||
|
||
// >>> meta 信息和 stats,它们可能是数据集原生的,也可以由 Data-Juicer 产出 | ||
"meta": { | ||
"src": "customized", | ||
"version": "0.1", | ||
"author": "xxx" | ||
}, | ||
"stats": { | ||
"lang": "en", | ||
"image_widths": [224, 336, 512], | ||
... | ||
}, | ||
// <<< meta 信息和 stats | ||
} | ||
``` | ||
|
||
在 DJ 格式中大概包括三个部分: | ||
1. 核心内容:例如 LLM 的预训练数据集中的文本内容,后微调数据集中的对话内容等。它们与数据集的下游使用的训练或者微调过程直接相关。 | ||
2. 额外数据内容:例如多模态数据集中的多模态数据路径。它们被组织为路径列表。 | ||
3. Meta 信息和 Stats:例如从原始数据集中继承而来的数据集版本或来源信息,或者由 Data-Juicer 的算子产出的类别 tags 和 stats 信息。 | ||
|
||
其中,第 2 和第 3 部分对于不同的数据集来说是通用的,而且都会被组织为几乎相同的结构。 | ||
作为对比,第 1 部分,也就是核心内容部分,对于各种数据集来说可能非常不同。 | ||
这里列举了针对不同种类数据集介绍这个部分更多细节的对应的文档: | ||
- [多模态数据集](multimodal/README_ZH.md) | ||
- [后微调数据集](post_tuning_dialog/README_ZH.md) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
# Post Tuning Tools | ||
|
||
For post tuning formats, we mainly consider 4 formats to support [ModelScope-Swift](https://github.com/modelscope/ms-swift/blob/main/docs/source_en/Customization/Custom-dataset.md) and [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory/blob/main/data/README.md). | ||
|
||
- Swift's Messages format (Very similar to the LLaMA-Factory's ShareGPT format, with different key names): | ||
|
||
```python | ||
{ | ||
"messages": [ | ||
{ | ||
"role": "system", | ||
"content": "<system>" | ||
}, | ||
{ | ||
"role": "user", | ||
"content": "<query1>" | ||
}, | ||
{ | ||
"role": "assistant", | ||
"content": "<response1>" | ||
}, | ||
{ | ||
"role": "user", | ||
"content": "<query2>" | ||
}, | ||
{ | ||
"role": "assistant", | ||
"content": "<response2>" | ||
} | ||
] | ||
} | ||
``` | ||
|
||
- Swift's ShareGPT format: | ||
|
||
```python | ||
{ | ||
"system": "<system>", | ||
"conversation": [ | ||
{ | ||
"human": "<query1>", | ||
"assistant": "<response1>" | ||
}, | ||
{ | ||
"human": "<query2>", | ||
"assistant": "<response2>" | ||
} | ||
] | ||
} | ||
``` | ||
|
||
- Alpaca format (used in the same definition in Swift and LLaMA-Factory): | ||
|
||
```python | ||
{ | ||
"system": "<system>", | ||
"instruction": "<query-inst>", | ||
"input": "<query-input>", | ||
"output": "<response>" | ||
} | ||
``` | ||
|
||
- Swift's Query-Response format: | ||
|
||
```python | ||
{ | ||
"system": "<system>", | ||
"query": "<query2>", | ||
"response": "<response2>", | ||
"history": [ | ||
[ | ||
"<query1>", | ||
"<response1>" | ||
] | ||
] | ||
} | ||
``` | ||
|
||
In Data-Juicer, we pre-set fields to align with the last two formats (Alpaca and Query-Response), which serves as our intermediate format for post-tuning dialog datasets. Correspondingly, we provide several tools to convert datasets in other formats to the following DJ format and vice versa. | ||
|
||
- DJ default format for post-tuning OPs: | ||
|
||
```python | ||
{ | ||
"system": "<system>", | ||
"instruction": "<query-inst>", | ||
"query": "<query2>", | ||
"response": "<response2>", | ||
"history": [ | ||
[ | ||
"<query1>", | ||
"<response1>" | ||
] | ||
] | ||
} | ||
``` |
Oops, something went wrong.