Skip to content

Commit

Permalink
release dj v0.2.0 (dj_video) (#227)
Browse files Browse the repository at this point in the history
* release dj v0.2.0 (dj_video)
* authored by data-juicer team
  • Loading branch information
yxdyc authored Mar 7, 2024
1 parent 475c52b commit 2720113
Show file tree
Hide file tree
Showing 172 changed files with 11,515 additions and 1,040 deletions.
152 changes: 89 additions & 63 deletions README.md

Large diffs are not rendered by default.

144 changes: 82 additions & 62 deletions README_ZH.md

Large diffs are not rendered by default.

139 changes: 116 additions & 23 deletions configs/config_all.yaml

Large diffs are not rendered by default.

16 changes: 15 additions & 1 deletion configs/data_juicer_recipes/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ We found that there are still some "bad" samples in existing processed datasets

We use simple 3-σ rule to set the hyperparameters for ops in each recipe.

## Before and after refining for Pretraining Dataset
## Before and after refining for Pretraining Text Dataset

| subset | #samples before | #samples after | keep ratio | config link | data link | source |
|----------------------|:---------------------------:|:--------------:|:----------:|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------|
Expand Down Expand Up @@ -35,3 +35,17 @@ We use simple 3-σ rule to set the hyperparameters for ops in each recipe.
|------------------|:-------------------------:|:--------------------------------------:|:----------:|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------|
| Alpaca-Cot EN | 136,219,879 | 72,855,345 | 54.48% | [alpaca-cot-en-refine.yaml](alpaca_cot/alpaca-cot-en-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-en-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-en-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/alpaca-cot-en-refined-by-data-juicer) | [39 Subsets of Alpaca-CoT](alpaca_cot/README.md#refined-alpaca-cot-dataset-meta-info) |
| Alpaca-Cot ZH | 21,197,246 | 9,873,214 | 46.58% | [alpaca-cot-zh-refine.yaml](alpaca_cot/alpaca-cot-zh-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-zh-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-zh-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/alpaca-cot-zh-refined-by-data-juicer) | [28 Subsets of Alpaca-CoT](alpaca_cot/README.md#refined-alpaca-cot-dataset-meta-info) |

## Before and after refining for Multimodal Dataset

| subset | #samples before | #samples after | keep ratio | config link | data link | source |
|---------------------------|:---------------------------:|:--------------:|:----------:|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
| LLaVA pretrain (LCS-558k) | 558,128 | 500,380 | 89.65% | [llava-pretrain-refine.yaml](llava-pretrain-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/LLaVA-1.5/public/llava-pretrain-refine-result.json) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/llava-pretrain-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/llava-pretrain-refined-by-data-juicer) | [LLaVA-1.5](https://github.com/haotian-liu/LLaVA) |

### Evaluation Results
- LLaVA pretrain (LCS-558k): models **pretrained with refined dataset** and fine-tuned with the original instruct dataset outperforms the baseline (LLaVA-1.5-13B) on 10 out of 12 benchmarks.

| model | VQAv2 | GQA | VizWiz | SQA | TextVQA | POPE | MME | MM-Bench | MM-Bench-CN | SEED | LLaVA-Bench-Wild | MM-Vet |
|-------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| LLaVA-1.5-13B <br> (baseline) | **80.0** | 63.3 | 53.6 | 71.6 | **61.3** | 85.9 | 1531.3 | 67.7 | 63.6 | 61.6 | 72.5 | 36.1 |
| LLaVA-1.5-13B <br> (refined pretrain dataset) | 79.94 | **63.5** | **54.09** | **74.20** | 60.82 | **86.67** | **1565.53** | **68.2** | **63.9** | **61.8** | **75.9** | **37.4** |
14 changes: 14 additions & 0 deletions configs/data_juicer_recipes/README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,3 +35,17 @@
|-------------------|:------------------------:|:----------------------------------:|:---------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------|
| Alpaca-Cot EN | 136,219,879 | 72,855,345 | 54.48% | [alpaca-cot-en-refine.yaml](alpaca_cot/alpaca-cot-en-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-en-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-en-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/alpaca-cot-en-refined-by-data-juicer) | [来自Alpaca-CoT的39个子集](alpaca_cot/README_ZH.md#完善的-alpaca-cot-数据集元信息) |
| Alpaca-Cot ZH | 21,197,246 | 9,873,214 | 46.58% | [alpaca-cot-zh-refine.yaml](alpaca_cot/alpaca-cot-zh-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-zh-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-zh-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/alpaca-cot-zh-refined-by-data-juicer) | [来自Alpaca-CoT的28个子集](alpaca_cot/README_ZH.md#完善的-alpaca-cot-数据集元信息) |

## 完善前后的多模态数据集

| 数据子集 | 完善前的样本数目 | 完善后的样本数目 | 样本保留率 | 配置链接 | 数据链接 | 来源 |
|---------------------------|:---------------------------:|:--------------:|:----------:|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
| LLaVA pretrain (LCS-558k) | 558,128 | 500,380 | 89.65% | [llava-pretrain-refine.yaml](llava-pretrain-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/LLaVA-1.5/public/llava-pretrain-refine-result.json) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/llava-pretrain-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/llava-pretrain-refined-by-data-juicer) | [LLaVA-1.5](https://github.com/haotian-liu/LLaVA) |

### 评测结果
- LLaVA pretrain (LCS-558k): 使用**完善后的预训练数据集**预训练并使用原始的指令数据集微调后的模型在12个评测集上有10个超过了基线模型LLaVA-1.5-13B。

| 模型 | VQAv2 | GQA | VizWiz | SQA | TextVQA | POPE | MME | MM-Bench | MM-Bench-CN | SEED | LLaVA-Bench-Wild | MM-Vet |
|---------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| LLaVA-1.5-13B <br> (基线) | **80.0** | 63.3 | 53.6 | 71.6 | **61.3** | 85.9 | 1531.3 | 67.7 | 63.6 | 61.6 | 72.5 | 36.1 |
| LLaVA-1.5-13B <br> (完善后的预训练数据集) | 79.94 | **63.5** | **54.09** | **74.20** | 60.82 | **86.67** | **1565.53** | **68.2** | **63.9** | **61.8** | **75.9** | **37.4** |
60 changes: 60 additions & 0 deletions configs/data_juicer_recipes/llava-pretrain-refine.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
project_name: 'llava-1.5-pretrain-dataset-refine-recipe'
dataset_path: 'blip_laion_cc_sbu_558k_dj_fmt_only_caption.jsonl' # converted LLaVA pretrain dataset in Data-Juicer format with only_keep_caption is True. See tools/multimodal/source_format_to_data_juicer_format/llava_to_dj.py
export_path: 'blip_laion_cc_sbu_558k_dj_fmt_only_caption_refined.jsonl'

np: 42 # number of subprocess to process your dataset
text_keys: 'text' # the key name of field where the sample texts to be processed, e.g., `text`, `instruction`, `output`, ...

# for multimodal data processing
image_key: 'images' # Key name of field to store the list of sample image paths.
image_special_token: '<image>' # The special token that represents an image in the text. For LLaVA, it's "<image>". Should be aligned with the args when running conversion tools.
eoc_special_token: '<|__dj__eoc|>' # The special token that represents the end of a chunk in the text. In default, it's "<|__dj__eoc|>". You can specify your own special token according to your input dataset. Should be aligned with the args when running conversion tools.

open_tracer: true

# process schedule: a list of several process operators with their arguments
process:
- fix_unicode_mapper: # fix unicode errors in text.
- punctuation_normalization_mapper: # normalize unicode punctuations to English punctuations.

# 558128
# Filter ops
- alphanumeric_filter: #558087 # filter text with alphabet/numeric ratio out of specific range.
tokenization: false # Whether to count the ratio of alphanumeric to the total number of tokens.
min_ratio: 0.60 # the min ratio of filter range
- character_repetition_filter: #546105 # filter text with the character repetition ratio out of specific range
rep_len: 10 # repetition length for char-level n-gram
max_ratio: 0.09373663 # the max ratio of filter range
- flagged_words_filter: #543960 # filter text with the flagged-word ratio larger than a specific max value
lang: en # consider flagged words in what language
tokenization: false # whether to use model to tokenize documents
max_ratio: 0.0 # the max ratio to filter text
- perplexity_filter: #532029 # filter text with perplexity score out of specific range
lang: en # compute perplexity in what language
max_ppl: 14435.5806 # the max perplexity score to filter text
- special_characters_filter: #531968 # filter text with special-char ratio out of specific range
min_ratio: 0.16534802 # the min ratio of filter range
max_ratio: 0.42023757 # the max ratio of filter range
- word_repetition_filter: # 530773 # filter text with the word repetition ratio out of specific range
lang: en # sample in which language
tokenization: false # whether to use model to tokenize documents
rep_len: 10 # repetition length for word-level n-gram
max_ratio: 0.03085751 # the max ratio of filter range

- image_aspect_ratio_filter: #542389 # filter samples according to the aspect ratios of images (a fraction of width by height, r=w/h) in them
min_ratio: 0.333 # the min aspect ratio of filter range
max_ratio: 3.0 # the max aspect ratio of filter range
any_or_all: any # keep this sample when any/all images meet the filter condition
- image_shape_filter: #533966 # filter samples according to the widths and heights of images in them
max_width: 727.8798422276 # the max width of width filter range
max_height: 606.2421072264 # the max height of height filter range
any_or_all: any # keep this sample when any/all images meet the filter condition
- image_size_filter: # 533966 # filter samples according to the size of images (in bytes) within them
max_size: "124KB" # the max size of filter range
any_or_all: any # keep this sample when any/all images meet the filter condition
- image_text_similarity_filter: #544202 # filter samples according to the similarity between text and images.
hf_clip: openai/clip-vit-base-patch32 # name of used Hugging Face clip
min_score: 0.20315419 # the min similarity of filter range
- image_text_matching_filter: # filter samples according to the matching score between image and text.
hf_blip: Salesforce/blip-itm-base-coco # name of used Hugging Face blip
min_score: 0.44930778 # the min matching score of filter range
12 changes: 12 additions & 0 deletions data_juicer/config/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,18 @@ def init_configs(args=None):
help='The special token that represents an audio in the text. In '
'default, it\'s "<__dj__audio>". You can specify your own special'
' token according to your input dataset.')
parser.add_argument(
'--video_key',
type=str,
default='videos',
help='Key name of field to store the list of sample video paths.')
parser.add_argument(
'--video_special_token',
type=str,
default=SpecialTokens.video,
help='The special token that represents a video in the text. In '
'default, it\'s "<__dj__video>". You can specify your own special'
' token according to your input dataset.')
parser.add_argument(
'--eoc_special_token',
type=str,
Expand Down
Loading

0 comments on commit 2720113

Please sign in to comment.