This folder contains some configuration files to allow users to easily and quickly refine Alpaca-CoT.
The raw data files can be downloaded from Alpaca-CoT on HuggingFace.
Use raw_alpaca_cot_merge_add_meta.py to select instruction
, input
and output
columns and merge them to text
field with a space, and add extra META info to dataset:
python tools/preprocess/raw_alpaca_cot_merge_add_meta.py \
--src_dir <Alpaca-CoT_src_dir> \
--target_dir <target_dir> \
--num_proc <num_proc>
Use dataset_split_by_language.py to split the dataset to EN and ZH sub-datasets:
python tools/preprocess/dataset_split_by_language.py \
--src_dir <src_dir> \
--target_dir <target_dir> \
--suffixes jsonl \
--num_proc <num_proc>
After preprocess, modify the dataset path in alpaca-cot-en-refine.yaml and alpaca-cot-zh-refine.yaml, and then execute the following command to reproduce the processing flow of refined Alpaca-CoT.
# refine English dataset
python tools/process_data.py --config configs/data_juicer_recipes/alpaca_cot/alpaca-cot-en-refine.yaml
# refine Chinese dataset
python tools/process_data.py --config configs/data_juicer_recipes/alpaca_cot/alpaca-cot-zh-refine.yaml
Each sample in refined data of Alpaca-CoT contains meta info listed as below:
- Language Tags:
- EN: Instruction datasets in English
- CN: Instruction datasets in Chinese
- ML: [Multi-lingual] Instruction datasets in multiple languages
- Task Tags
- MT: [Multi-task] Datasets containing multiple tasks
- TS: [Task-specific] Datasets tailored for specific tasks
- Generation-method:
- HG: [Human Generated Dataset] Datasets created by humans
- SI: [Self-Instruct] Datasets generated using self-instruct methods
- MIX: [Mixed Dataset] Dataset contains both human and machine generated data
- COL: [Collection of Dataset] Dataset made from a collection of other datasets
-
Dataset
: dataset name in Alpaca-CoT -
origin_path
: original file path in Alpaca-CoT -
IFT
: tagged as Instruct Fine-Tuning datasets -
CFT
: tagged as Chat Fine-Tuning datasets-
CFT-SR
: tagged as Single-round Dialog datasets -
CFT-MR
: tagged as Multi-round Dialog datasets -
CFT-P
: tagged as Preference datasets
-
Task | Gen | Lang | Dataset | IFT | CFT-SR | CFT-MR | CFT-P | |
---|---|---|---|---|---|---|---|---|
Chain-of-Thought | MT | HG | EN/CN | Chain-of-Thought | ✅ | |||
GPT4all | MT | COL | EN | GPT4all | ✅ | ✅ | ||
GPTeacher | MT | SI | EN | GPTeacher | ✅ | |||
Guanaco | MT | SI | ML | Guanaco | ✅ | |||
HC3 | TS | MIX | EN/CN | HC3 | ✅ | ✅ | ||
alpaca | MT | SI | EN | alpaca | ✅ | |||
Natural-Instructions | MT | COL | ML | Natural-Instructions | ✅ | |||
belle_cn | TS/MT | SI | CN | belle_cn | ✅ | |||
instinwild | MT | SI | EN/CN | instinwild | ✅ | |||
prosocial-dialog | TS | MIX | EN | prosocial-dialog | ✅ | |||
finance | TS | COL | EN | finance | ✅ | |||
xP3 | MT | COL | ML | xP3 | ✅ | |||
firefly | MT | COL | CN | firefly | ✅ | |||
instruct | MT | COL | EN | instruct | ✅ | |||
CodeAlpaca | TS | SI | EN | CodeAlpaca | ✅ | |||
alpacaGPT4 | MT | SI | EN/CN | alpacaGPT4 | ✅ | ✅ | ||
webGPT | TS | MIX | EN | webGPT | ✅ | ✅ | ||
dolly | TS | HG | EN | dolly | ✅ | |||
baize | MT | COL | EN | baize | ✅ | |||
hh-rlhf | TS | MIX | EN | hh-rlhf | ✅ | ✅ | ✅ | |
OIG | MT | COL | EN | OIG | ✅ | |||
GAOKAO | MT | COL | CN | GAOKAO | ✅ | |||
camel | MT | SI | EN | camel | ✅ | |||
FLAN-Muffin | MT | COL | EN | FLAN-Muffin | ✅ | |||
COIG | MT | COL | CN | COIG | ✅ | |||
gpt4tools | MT | SI | EN | gpt4tools | ✅ | |||
ShareGPT | MT | MIX | EN | ShareGPT | ✅ | ✅ | ||
Auto-CoT | MT | COL | EN | Auto-CoT | ✅ | |||
MOSS | TS | SI | EN/CN | MOSS | ✅ | |||
ultrachat | TS | SI | EN | ultrachat | ✅ | |||
Chinese-medical | TS | COL | CN | Chinese-medical | ✅ | |||
CSL | MT | COL | CN | CSL | ✅ | |||
pCLUE | MT | COL | CN | pCLUE | ✅ | |||
news_commentary | TS | COL | CN | news_commentary | ✅ | |||
StackExchange | MT | COL | EN | StackExchange | ✅ | ✅ | ||
ConvAI2 | TS | HG | EN | ConvAI2 | ✅ | |||
FastChat | MT | SI | EN | FastChat | ✅ | |||
Tabular-LLM-Data | MT | COL | EN/CN | Tabular-LLM-Data | ✅ |