Skip to content

Latest commit

 

History

History
114 lines (93 loc) · 7.83 KB

File metadata and controls

114 lines (93 loc) · 7.83 KB

Refine Alpaca-CoT Config Files

This folder contains some configuration files to allow users to easily and quickly refine Alpaca-CoT.

Preprocess

The raw data files can be downloaded from Alpaca-CoT on HuggingFace.

Convert raw Alpaca-CoT data to jsonl

Use raw_alpaca_cot_merge_add_meta.py to select instruction, input and output columns and merge them to text field with a space, and add extra META info to dataset:

python tools/preprocess/raw_alpaca_cot_merge_add_meta.py    \
    --src_dir             <Alpaca-CoT_src_dir>              \
    --target_dir          <target_dir>                      \
    --num_proc            <num_proc>

Split datasets to sub-datasets by language

Use dataset_split_by_language.py to split the dataset to EN and ZH sub-datasets:

python tools/preprocess/dataset_split_by_language.py    \
    --src_dir             <src_dir>                     \
    --target_dir          <target_dir>                  \
    --suffixes            jsonl                         \
    --num_proc            <num_proc>

Process

After preprocess, modify the dataset path in alpaca-cot-en-refine.yaml and alpaca-cot-zh-refine.yaml, and then execute the following command to reproduce the processing flow of refined Alpaca-CoT.

# refine English dataset
python tools/process_data.py --config configs/data_juicer_recipes/alpaca_cot/alpaca-cot-en-refine.yaml

# refine Chinese dataset
python tools/process_data.py --config configs/data_juicer_recipes/alpaca_cot/alpaca-cot-zh-refine.yaml

Meta Info

Each sample in refined data of Alpaca-CoT contains meta info listed as below:

Alpaca-CoT original meta info

  • Language Tags:
    • EN: Instruction datasets in English
    • CN: Instruction datasets in Chinese
    • ML: [Multi-lingual] Instruction datasets in multiple languages
  • Task Tags
    • MT: [Multi-task] Datasets containing multiple tasks
    • TS: [Task-specific] Datasets tailored for specific tasks
  • Generation-method:
    • HG: [Human Generated Dataset] Datasets created by humans
    • SI: [Self-Instruct] Datasets generated using self-instruct methods
    • MIX: [Mixed Dataset] Dataset contains both human and machine generated data
    • COL: [Collection of Dataset] Dataset made from a collection of other datasets

Data-Juicer Meta info

  • Dataset: dataset name in Alpaca-CoT

  • origin_path: original file path in Alpaca-CoT

  • IFT: tagged as Instruct Fine-Tuning datasets

  • CFT: tagged as Chat Fine-Tuning datasets

    • CFT-SR: tagged as Single-round Dialog datasets

    • CFT-MR: tagged as Multi-round Dialog datasets

    • CFT-P: tagged as Preference datasets

Refined Alpaca-CoT dataset Meta info

Task Gen Lang Dataset IFT CFT-SR CFT-MR CFT-P
Chain-of-Thought MT HG EN/CN Chain-of-Thought
GPT4all MT COL EN GPT4all
GPTeacher MT SI EN GPTeacher
Guanaco MT SI ML Guanaco
HC3 TS MIX EN/CN HC3
alpaca MT SI EN alpaca
Natural-Instructions MT COL ML Natural-Instructions
belle_cn TS/MT SI CN belle_cn
instinwild MT SI EN/CN instinwild
prosocial-dialog TS MIX EN prosocial-dialog
finance TS COL EN finance
xP3 MT COL ML xP3
firefly MT COL CN firefly
instruct MT COL EN instruct
CodeAlpaca TS SI EN CodeAlpaca
alpacaGPT4 MT SI EN/CN alpacaGPT4
webGPT TS MIX EN webGPT
dolly TS HG EN dolly
baize MT COL EN baize
hh-rlhf TS MIX EN hh-rlhf
OIG MT COL EN OIG
GAOKAO MT COL CN GAOKAO
camel MT SI EN camel
FLAN-Muffin MT COL EN FLAN-Muffin
COIG MT COL CN COIG
gpt4tools MT SI EN gpt4tools
ShareGPT MT MIX EN ShareGPT
Auto-CoT MT COL EN Auto-CoT
MOSS TS SI EN/CN MOSS
ultrachat TS SI EN ultrachat
Chinese-medical TS COL CN Chinese-medical
CSL MT COL CN CSL
pCLUE MT COL CN pCLUE
news_commentary TS COL CN news_commentary
StackExchange MT COL EN StackExchange
ConvAI2 TS HG EN ConvAI2
FastChat MT SI EN FastChat
Tabular-LLM-Data MT COL EN/CN Tabular-LLM-Data