Skip to content

Latest commit

 

History

History
74 lines (52 loc) · 2.95 KB

README.md

File metadata and controls

74 lines (52 loc) · 2.95 KB

Postprocess tools

This folder contains some postprocess scripts for additional processing of your processed dataset using Data-Juicer.

Usage

Count tokens for datasets

Use count_token.py to count tokens for datasets.

python tools/postprocess/count_token.py        \
    --data_path            <data_path>         \
    --text_keys            <text_keys>         \
    --tokenizer_method     <tokenizer_method>  \
    --num_proc             <num_proc>

# get help
python tools/postprocess/count_token.py --help
  • data_path: path to the input dataset. Only support jsonl now.
  • text_keys: field keys that will be considered into token counts.
  • tokenizer_method: name of the Hugging Face tokenizer.
  • num_proc: number of processes to count tokens.

Mix multiple datasets with optional weights

Use data_mixture.py to mix multiple datasets.

This script will randomly select samples from every dataset and mix theses samples and export to a new_dataset.

python tools/postprocess/data_mixture.py        \
    --data_path             <data_path>         \
    --export_path           <export_path>       \
    --export_shard_size     <export_shard_size> \
    --num_proc              <num_proc>

# get help
python tools/postprocess/data_mixture.py  --help
  • data_path: a dataset file or a list of dataset files or a list of both them, optional weights, if not set, 1.0 as default.

  • export_path: a dataset file name for exporting mixed dataset, support json / jsonl / parquet.

  • export_shard_size: dataset file size in Byte. If not set, mixed dataset will be exported into only one file.

  • num_proc: process num to load and export datasets.

  • e.g., python tools/postprocess/data_mixture.py --data_path <w1> ds.jsonl <w2> ds_dir <w3> ds_file.json

Note: All datasets must have the same meta field, so we can use HuggingFace Datasets to align their features.

Deserialize meta fields in jsonl file

This tool is usually used with serialize_meta.py to deserialize the specified field into the original format.

python tools/postprocess/deserialize_meta.py           \
    --src_dir           <src_dir>         \
    --target_dir        <target_dir>      \
    --serialized_key    <serialized_key>  \
    --num_proc          <num_proc>

# get help
python tools/postprocess/deserialize_meta.py --help
  • src_dir: path to store jsonl files.
  • target_dir: path to save the converted jsonl files.
  • serialized_key: the key corresponding to the field that will be deserialized. Default it's 'source_info'.
  • num_proc (optional): number of process workers. Default it's 1.

Note: After deserialization, all serialized fields in the original file will be placed in 'serialized_key', this is to ensure that the fields generated after data-juicer processing will not conflict with the original meta fields.