Operator Schemas

Operators are a collection of basic processes that assist in data modification, cleaning, filtering, deduplication, etc. We support a wide range of data sources and file formats, and allow for flexible extension to custom datasets.

This page offers a basic description of the operators (OPs) in Data-Juicer. Users can refer to the API documentation for the specific parameters of each operator. Users can refer to and run the unit tests (tests/ops/...) for examples of operator-wise usage as well as the effects of each operator when applied to built-in test data samples.

Overview

The operators in Data-Juicer are categorized into 5 types.

Type	Number	Description
Formatter	9	Discovers, loads, and canonicalizes source data
Mapper	63	Edits and transforms samples
Filter	44	Filters out low-quality samples
Deduplicator	8	Detects and removes duplicate samples
Selector	4	Selects top samples based on ranking
Grouper	2	Group samples to batched samples
Aggregator	3	Aggregate for batched samples, such as summary or conclusion

All the specific operators are listed below, each featured with several capability tags.

Domain Tags
- : general purpose
- : specific to LaTeX source files
- : specific to programming codes
- : closely related to financial sector
Modality Tags
- : specific to text
- : specific to images
- : specific to audios
- : specific to videos
- : specific to multimodal
Language Tags
- : English
- : Chinese
Resource Tags
- : only requires CPU resource (default)
- : requires GPU/CUDA resource as well

Formatter

Operator	Description	Source code	Unit tests
local_formatter	Prepares datasets from local files	code	tests
remote_formatter	Prepares datasets from remote (e.g., HuggingFace)	code	tests
csv_formatter	Prepares local `.csv` files	code	tests
tsv_formatter	Prepares local `.tsv` files	code	tests
json_formatter	Prepares local `.json`, `.jsonl`, `.jsonl.zst` files	code	-
parquet_formatter	Prepares local `.parquet` files	code	tests
text_formatter	Prepares other local text files (complete list)	code	-
empty_formatter	Prepares an empty dataset	code	tests
mixture_formatter	Handles a mixture of all the supported local file types	code	tests

Mapper

Operator	Description	Source code	Unit tests
audio_ffmpeg_wrapped_mapper	Simple wrapper to run a FFmpeg audio filter	code	tests
calibrate_qa_mapper	Calibrate question-answer pairs based on reference text	code	tests
calibrate_query_mapper	Calibrate query in question-answer pairs based on reference text	code	tests
calibrate_response_mapper	Calibrate response in question-answer pairs based on reference text	code	tests
chinese_convert_mapper	Converts Chinese between Traditional Chinese, Simplified Chinese and Japanese Kanji (by opencc)	code	tests
clean_copyright_mapper	Removes copyright notice at the beginning of code files (must contain the word copyright)	code	tests
clean_email_mapper	Removes email information	code	tests
clean_html_mapper	Removes HTML tags and returns plain text of all the nodes	code	tests
clean_ip_mapper	Removes IP addresses	code	tests
clean_links_mapper	Removes links, such as those starting with http or ftp	code	tests
expand_macro_mapper	Expands macros usually defined at the top of TeX documents	code	tests
extract_entity_attribute_mapper	Extract attributes for given entities from the text.	code	tests
extract_entity_relation_mapper	Extract entities and relations in the text for knowledge graph.	code	tests
extract_event_mapper	Extract events and relevant characters in the text.	code	tests
extract_keyword_mapper	Generate keywords for the text.	code	tests
extract_nickname_mapper	Extract nickname relationship in the text.	code	tests
extract_support_text_mapper	Extract support sub text for a summary.	code	tests
fix_unicode_mapper	Fixes broken Unicodes (by ftfy)	code	tests
generate_qa_from_examples_mapper	Generate question and answer pairs based on examples.	code	tests
generate_qa_from_text_mapper	Generate question and answer pairs from text.	code	tests
image_blur_mapper	Blur images	code	tests
image_captioning_from_gpt4v_mapper	generate samples whose texts are generated based on gpt-4-visison and the image	code	-
image_captioning_mapper	generate samples whose captions are generated based on another model (such as blip2) and the figure within the original sample	code	tests
image_diffusion_mapper	Generate and augment images by stable diffusion model	code	tests
image_face_blur_mapper	Blur faces detected in images	code	tests
image_tagging_mapper	Mapper to generate image tags from the input images.	code	tests
nlpaug_en_mapper	Simply augments texts in English based on the `nlpaug` library	code	tests
nlpcda_zh_mapper	Simply augments texts in Chinese based on the `nlpcda` library	code	tests
optimize_qa_mapper	Optimize both the query and response in question-answering samples.	code	tests
optimize_query_mapper	Optimize the query in question-answering samples.	code	tests
optimize_response_mapper	Optimize the response in question-answering samples.	code	tests
pair_preference_mapper	Construct paired preference samples.	code	tests
punctuation_normalization_mapper	Normalizes various Unicode punctuations to their ASCII equivalents	code	tests
python_file_mapper	Executing Python function defined in a file	code	tests
python_lambda_mapper	Executing Python lambda function on data samples	code	tests
relation_identity_mapper	Identify relation between two entity in the text.	code	tests
remove_bibliography_mapper	Removes the bibliography of TeX documents	code	tests
remove_comments_mapper	Removes the comments of TeX documents	code	tests
remove_header_mapper	Removes the running headers of TeX documents, e.g., titles, chapter or section numbers/names	code	tests
remove_long_words_mapper	Removes words with length outside the specified range	code	tests
remove_non_chinese_character_mapper	Remove non Chinese character in text samples.	code	tests
remove_repeat_sentences_mapper	Remove repeat sentences in text samples.	code	tests
remove_specific_chars_mapper	Removes any user-specified characters or substrings	code	tests
remove_table_text_mapper	Detects and removes possible table contents (:warning: relies on regular expression matching and thus fragile)	code	tests
remove_words_with_incorrect_ substrings_mapper	Removes words containing specified substrings	code	tests
replace_content_mapper	Replace all content in the text that matches a specific regular expression pattern with a designated replacement string	code	tests
sentence_split_mapper	Splits and reorganizes sentences according to semantics	code	tests
text_chunk_mapper	Split input text to chunks.	code	tests
video_captioning_from_audio_mapper	Caption a video according to its audio streams based on Qwen-Audio model	code	tests
video_captioning_from_frames_mapper	generate samples whose captions are generated based on an image-to-text model and sampled video frames. Captions from different frames will be concatenated to a single string	code	tests
video_captioning_from_summarizer_mapper	Generate video captions by summarizing several kinds of generated texts (captions from video/audio/frames, tags from audio/frames, ...)	code	tests
video_captioning_from_video_mapper	generate samples whose captions are generated based on another model (video-blip) and sampled video frame within the original sample	code	tests
video_extract_frames_mapper	extract frames from video files according to specified methods	code	tests
video_face_blur_mapper	Blur faces detected in videos	code	tests
video_ffmpeg_wrapped_mapper	Simple wrapper to run a FFmpeg video filter	code	tests
video_remove_watermark_mapper	Remove the watermarks in videos given regions	code	tests
video_resize_aspect_ratio_mapper	Resize video aspect ratio to a specified range	code	tests
video_resize_resolution_mapper	Map videos to ones with given resolution range	code	tests
video_split_by_duration_mapper	Mapper to split video by duration	code	tests
video_split_by_key_frame_mapper	Mapper to split video by key frame	code	tests
video_split_by_scene_mapper	Split videos into scene clips	code	tests
video_tagging_from_audio_mapper	Mapper to generate video tags from audio streams extracted from the video.	code	tests
video_tagging_from_frames_mapper	Mapper to generate video tags from frames extracted from the video.	code	tests
whitespace_normalization_mapper	Normalizes various Unicode whitespaces to the normal ASCII space (U+0020)	code	tests

Filter

Operator	Description	Source code	Unit tests
alphanumeric_filter	Keeps samples with alphanumeric ratio within the specified range	code	tests
audio_duration_filter	Keep data samples whose audios' durations are within a specified range	code	tests
audio_nmf_snr_filter	Keep data samples whose audios' Signal-to-Noise Ratios (SNRs, computed based on Non-Negative Matrix Factorization, NMF) are within a specified range	code	tests
audio_size_filter	Keep data samples whose audios' sizes are within a specified range	code	tests
average_line_length_filter	Keeps samples with average line length within the specified range	code	tests
character_repetition_filter	Keeps samples with char-level n-gram repetition ratio within the specified range	code	tests
flagged_words_filter	Keeps samples with flagged-word ratio below the specified threshold	code	tests
image_aesthetics_filter	Keeps samples containing images whose aesthetics scores are within the specified range	code	tests
image_aspect_ratio_filter	Keeps samples containing images with aspect ratios within the specified range	code	tests
image_face_count_filter	Keeps samples containing images with face counts within the specified range	code	tests
image_face_ratio_filter	Keeps samples containing images with face area ratios within the specified range	code	tests
image_nsfw_filter	Keeps samples containing images with NSFW scores below the threshold	code	tests
image_pair_similarity_filter	Keeps image pairs with image feature cosine similarity within the specified range based on a CLIP model	code	tests
image_shape_filter	Keeps samples containing images with widths and heights within the specified range	code	tests
image_size_filter	Keeps samples containing images whose size in bytes are within the specified range	code	tests
image_text_matching_filter	Keeps samples with image-text classification matching score within the specified range based on a BLIP model	code	tests
image_text_similarity_filter	Keeps samples with image-text feature cosine similarity within the specified range based on a CLIP model	code	tests
image_watermark_filter	Keeps samples containing images with predicted watermark probabilities below the threshold	code	tests
language_id_score_filter	Keeps samples of the specified language, judged by a predicted confidence score	code	tests
maximum_line_length_filter	Keeps samples with maximum line length within the specified range	code	tests
perplexity_filter	Keeps samples with perplexity score below the specified threshold	code	tests
phrase_grounding_recall_filter	Keeps samples whose locating recalls of phrases extracted from text in the images are within a specified range	code	tests
special_characters_filter	Keeps samples with special-char ratio within the specified range	code	tests
specified_field_filter	Filters samples based on field, with value lies in the specified targets	code	tests
specified_numeric_field_filter	Filters samples based on field, with value lies in the specified range (for numeric types)	code	tests
stopwords_filter	Keeps samples with stopword ratio above the specified threshold	code	tests
suffix_filter	Keeps samples with specified suffixes	code	tests
text_action_filter	Keeps samples containing action verbs in their texts	code	tests
text_entity_dependency_filter	Keeps samples containing dependency edges for an entity in the dependency tree of the texts	code	tests
text_length_filter	Keeps samples with total text length within the specified range	code	tests
token_num_filter	Keeps samples with token count within the specified range	code	tests
video_aesthetics_filter	Keeps samples whose specified frames have aesthetics scores within the specified range	code	tests
video_aspect_ratio_filter	Keeps samples containing videos with aspect ratios within the specified range	code	tests
video_duration_filter	Keep data samples whose videos' durations are within a specified range	code	tests
video_frames_text_similarity_filter	Keep data samples whose similarities between sampled video frame images and text are within a specific range	code	tests
video_motion_score_filter	Keep samples with video motion scores within a specific range	code	tests
video_motion_score_raft_filter	Keep samples with video motion scores (based on RAFT model) within a specific range	code	tests
video_nsfw_filter	Keeps samples containing videos with NSFW scores below the threshold	code	tests
video_ocr_area_ratio_filter	Keep data samples whose detected text area ratios for specified frames in the video are within a specified range	code	tests
video_resolution_filter	Keeps samples containing videos with horizontal and vertical resolutions within the specified range	code	tests
video_watermark_filter	Keeps samples containing videos with predicted watermark probabilities below the threshold	code	tests
video_tagging_from_frames_filter	Keep samples containing videos with given tags	code	tests
words_num_filter	Keeps samples with word count within the specified range	code	tests
word_repetition_filter	Keeps samples with word-level n-gram repetition ratio within the specified range	code	tests

Deduplicator

Operator	Description	Source code	Unit tests
document_deduplicator	Deduplicates samples at document-level by comparing MD5 hash	code	tests
document_minhash_deduplicator	Deduplicates samples at document-level using MinHashLSH	code	tests
document_simhash_deduplicator	Deduplicates samples at document-level using SimHash	code	tests
image_deduplicator	Deduplicates samples at document-level using exact matching of images between documents	code	tests
video_deduplicator	Deduplicates samples at document-level using exact matching of videos between documents	code	tests
ray_document_deduplicator	Deduplicates samples at document-level by comparing MD5 hash on ray	code	-
ray_image_deduplicator	Deduplicates samples at document-level using exact matching of images between documents on ray	code	-
ray_video_deduplicator	Deduplicates samples at document-level using exact matching of videos between documents on ray	code	-

Selector

Operator	Description	Source code	Unit tests
frequency_specified_field_selector	Selects top samples by comparing the frequency of the specified field	code	tests
random_selector	Selects samples randomly	code	tests
range_specified_field_selector	Selects samples within a specified range by comparing the values of the specified field	code	tests
topk_specified_field_selector	Selects top samples by comparing the values of the specified field	code	tests

Grouper

Operator	Tags	Description	Source code	Unit tests
key_value_grouper		Group samples to batched samples according values in given keys.	code	tests
naive_grouper		Group all samples to one batched sample.	code	tests

Aggregator

Operator	Description	Source code	Unit tests
entity_attribute_aggregator	Return conclusion of the given entity's attribute from some docs.	code	tests
most_relavant_entities_aggregator	Extract entities closely related to a given entity from some texts, and sort them in descending order of importance.	code	tests
nested_aggregator	Considering the limitation of input length, nested aggregate contents for each given number of samples.	code	tests

Contributing

We welcome contributions of adding new operators. Please refer to How-to Guide for Developers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operators.md

Operators.md

Operator Schemas

Overview

Formatter

Mapper

Filter

Deduplicator

Selector

Grouper

Aggregator

Contributing

Files

Operators.md

Latest commit

History

Operators.md

File metadata and controls

Operator Schemas

Overview

Formatter

Mapper

Filter

Deduplicator

Selector

Grouper

Aggregator

Contributing