This folder contains some postprocess scripts for additional processing of your processed dataset using Data-Juicer.
Use count_token.py to count tokens for datasets.
python tools/postprocess/count_token.py \
--data_path <data_path> \
--text_keys <text_keys> \
--tokenizer_method <tokenizer_method> \
--num_proc <num_proc>
# get help
python tools/postprocess/count_token.py --help
data_path
: path to the input dataset. Only supportjsonl
now.text_keys
: field keys that will be considered into token counts.tokenizer_method
: name of the Hugging Face tokenizer.num_proc
: number of processes to count tokens.
Use data_mixture.py to mix multiple datasets.
This script will randomly select samples from every dataset and mix theses samples and export to a new_dataset.
python tools/postprocess/data_mixture.py \
--data_path <data_path> \
--export_path <export_path> \
--export_shard_size <export_shard_size> \
--num_proc <num_proc>
# get help
python tools/postprocess/data_mixture.py --help
-
data_path
: a dataset file or a list of dataset files or a list of both them, optional weights, if not set, 1.0 as default. -
export_path
: a dataset file name for exporting mixed dataset, supportjson
/jsonl
/parquet
. -
export_shard_size
: dataset file size in Byte. If not set, mixed dataset will be exported into only one file. -
num_proc
: process num to load and export datasets. -
e.g.,
python tools/postprocess/data_mixture.py --data_path <w1> ds.jsonl <w2> ds_dir <w3> ds_file.json
Note: All datasets must have the same meta field, so we can use HuggingFace Datasets to align their features.
This tool is usually used with serialize_meta.py to deserialize the specified field into the original format.
python tools/postprocess/deserialize_meta.py \
--src_dir <src_dir> \
--target_dir <target_dir> \
--serialized_key <serialized_key> \
--num_proc <num_proc>
# get help
python tools/postprocess/deserialize_meta.py --help
src_dir
: path to store jsonl files.target_dir
: path to save the converted jsonl files.serialized_key
: the key corresponding to the field that will be deserialized. Default it's 'source_info'.num_proc
(optional): number of process workers. Default it's 1.
Note: After deserialization, all serialized fields in the original file will be placed in 'serialized_key'
, this is to ensure that the fields generated after data-juicer processing will not conflict with the original meta fields.