Operators are a collection of basic processes that assist in data modification, cleaning, filtering, deduplication, etc. We support a wide range of data sources and file formats, and allow for flexible extension to custom datasets.
This page offers a basic description of the operators (OPs) in Data-Juicer. Users can refer to the API documentation for the specific parameters of each operator. Users can refer to and run the unit tests (tests/ops/...
) for examples of operator-wise usage as well as the effects of each operator when applied to built-in test data samples.
The operators in Data-Juicer are categorized into 5 types.
Type | Number | Description |
---|---|---|
Formatter | 9 | Discovers, loads, and canonicalizes source data |
Mapper | 63 | Edits and transforms samples |
Filter | 44 | Filters out low-quality samples |
Deduplicator | 8 | Detects and removes duplicate samples |
Selector | 4 | Selects top samples based on ranking |
Grouper | 2 | Group samples to batched samples |
Aggregator | 3 | Aggregate for batched samples, such as summary or conclusion |
All the specific operators are listed below, each featured with several capability tags.
- Domain Tags
- Modality Tags
- Language Tags
- Resource Tags
Operator | Tags | Description | Source code | Unit tests |
---|---|---|---|---|
local_formatter | Prepares datasets from local files | code | tests | |
remote_formatter | Prepares datasets from remote (e.g., HuggingFace) | code | tests | |
csv_formatter | Prepares local .csv files |
code | tests | |
tsv_formatter | Prepares local .tsv files |
code | tests | |
json_formatter | Prepares local .json , .jsonl , .jsonl.zst files |
code | - | |
parquet_formatter | Prepares local .parquet files |
code | tests | |
text_formatter | Prepares other local text files (complete list) | code | - | |
empty_formatter | Prepares an empty dataset | code | tests | |
mixture_formatter | Handles a mixture of all the supported local file types | code | tests |
Operator | Tags | Description | Source code | Unit tests |
---|---|---|---|---|
audio_ffmpeg_wrapped_mapper | Simple wrapper to run a FFmpeg audio filter | code | tests | |
calibrate_qa_mapper | Calibrate question-answer pairs based on reference text | code | tests | |
calibrate_query_mapper | Calibrate query in question-answer pairs based on reference text | code | tests | |
calibrate_response_mapper | Calibrate response in question-answer pairs based on reference text | code | tests | |
chinese_convert_mapper | Converts Chinese between Traditional Chinese, Simplified Chinese and Japanese Kanji (by opencc) | code | tests | |
clean_copyright_mapper | Removes copyright notice at the beginning of code files (must contain the word copyright) | code | tests | |
clean_email_mapper | Removes email information | code | tests | |
clean_html_mapper | Removes HTML tags and returns plain text of all the nodes | code | tests | |
clean_ip_mapper | Removes IP addresses | code | tests | |
clean_links_mapper | Removes links, such as those starting with http or ftp | code | tests | |
expand_macro_mapper | Expands macros usually defined at the top of TeX documents | code | tests | |
extract_entity_attribute_mapper | Extract attributes for given entities from the text. | code | tests | |
extract_entity_relation_mapper | Extract entities and relations in the text for knowledge graph. | code | tests | |
extract_event_mapper | Extract events and relevant characters in the text. | code | tests | |
extract_keyword_mapper | Generate keywords for the text. | code | tests | |
extract_nickname_mapper | Extract nickname relationship in the text. | code | tests | |
extract_support_text_mapper | Extract support sub text for a summary. | code | tests | |
fix_unicode_mapper | Fixes broken Unicodes (by ftfy) | code | tests | |
generate_qa_from_examples_mapper | Generate question and answer pairs based on examples. | code | tests | |
generate_qa_from_text_mapper | Generate question and answer pairs from text. | code | tests | |
image_blur_mapper | Blur images | code | tests | |
image_captioning_from_gpt4v_mapper | generate samples whose texts are generated based on gpt-4-visison and the image | code | - | |
image_captioning_mapper | generate samples whose captions are generated based on another model (such as blip2) and the figure within the original sample | code | tests | |
image_diffusion_mapper | Generate and augment images by stable diffusion model | code | tests | |
image_face_blur_mapper | Blur faces detected in images | code | tests | |
image_tagging_mapper | Mapper to generate image tags from the input images. | code | tests | |
nlpaug_en_mapper | Simply augments texts in English based on the nlpaug library |
code | tests | |
nlpcda_zh_mapper | Simply augments texts in Chinese based on the nlpcda library |
code | tests | |
optimize_qa_mapper | Optimize both the query and response in question-answering samples. | code | tests | |
optimize_query_mapper | Optimize the query in question-answering samples. | code | tests | |
optimize_response_mapper | Optimize the response in question-answering samples. | code | tests | |
pair_preference_mapper | Construct paired preference samples. | code | tests | |
punctuation_normalization_mapper | Normalizes various Unicode punctuations to their ASCII equivalents | code | tests | |
python_file_mapper | Executing Python function defined in a file | code | tests | |
python_lambda_mapper | Executing Python lambda function on data samples | code | tests | |
relation_identity_mapper | Identify relation between two entity in the text. | code | tests | |
remove_bibliography_mapper | Removes the bibliography of TeX documents | code | tests | |
remove_comments_mapper | Removes the comments of TeX documents | code | tests | |
remove_header_mapper | Removes the running headers of TeX documents, e.g., titles, chapter or section numbers/names | code | tests | |
remove_long_words_mapper | Removes words with length outside the specified range | code | tests | |
remove_non_chinese_character_mapper | Remove non Chinese character in text samples. | code | tests | |
remove_repeat_sentences_mapper | Remove repeat sentences in text samples. | code | tests | |
remove_specific_chars_mapper | Removes any user-specified characters or substrings | code | tests | |
remove_table_text_mapper | Detects and removes possible table contents (:warning: relies on regular expression matching and thus fragile) | code | tests | |
remove_words_with_incorrect_ substrings_mapper | Removes words containing specified substrings | code | tests | |
replace_content_mapper | Replace all content in the text that matches a specific regular expression pattern with a designated replacement string | code | tests | |
sentence_split_mapper | Splits and reorganizes sentences according to semantics | code | tests | |
text_chunk_mapper | Split input text to chunks. | code | tests | |
video_captioning_from_audio_mapper | Caption a video according to its audio streams based on Qwen-Audio model | code | tests | |
video_captioning_from_frames_mapper | generate samples whose captions are generated based on an image-to-text model and sampled video frames. Captions from different frames will be concatenated to a single string | code | tests | |
video_captioning_from_summarizer_mapper | Generate video captions by summarizing several kinds of generated texts (captions from video/audio/frames, tags from audio/frames, ...) | code | tests | |
video_captioning_from_video_mapper | generate samples whose captions are generated based on another model (video-blip) and sampled video frame within the original sample | code | tests | |
video_extract_frames_mapper | extract frames from video files according to specified methods | code | tests | |
video_face_blur_mapper | Blur faces detected in videos | code | tests | |
video_ffmpeg_wrapped_mapper | Simple wrapper to run a FFmpeg video filter | code | tests | |
video_remove_watermark_mapper | Remove the watermarks in videos given regions | code | tests | |
video_resize_aspect_ratio_mapper | Resize video aspect ratio to a specified range | code | tests | |
video_resize_resolution_mapper | Map videos to ones with given resolution range | code | tests | |
video_split_by_duration_mapper | Mapper to split video by duration | code | tests | |
video_split_by_key_frame_mapper | Mapper to split video by key frame | code | tests | |
video_split_by_scene_mapper | Split videos into scene clips | code | tests | |
video_tagging_from_audio_mapper | Mapper to generate video tags from audio streams extracted from the video. | code | tests | |
video_tagging_from_frames_mapper | Mapper to generate video tags from frames extracted from the video. | code | tests | |
whitespace_normalization_mapper | Normalizes various Unicode whitespaces to the normal ASCII space (U+0020) | code | tests |
Operator | Tags | Description | Source code | Unit tests |
---|---|---|---|---|
alphanumeric_filter | Keeps samples with alphanumeric ratio within the specified range | code | tests | |
audio_duration_filter | Keep data samples whose audios' durations are within a specified range | code | tests | |
audio_nmf_snr_filter | Keep data samples whose audios' Signal-to-Noise Ratios (SNRs, computed based on Non-Negative Matrix Factorization, NMF) are within a specified range | code | tests | |
audio_size_filter | Keep data samples whose audios' sizes are within a specified range | code | tests | |
average_line_length_filter | Keeps samples with average line length within the specified range | code | tests | |
character_repetition_filter | Keeps samples with char-level n-gram repetition ratio within the specified range | code | tests | |
flagged_words_filter | Keeps samples with flagged-word ratio below the specified threshold | code | tests | |
image_aesthetics_filter | Keeps samples containing images whose aesthetics scores are within the specified range | code | tests | |
image_aspect_ratio_filter | Keeps samples containing images with aspect ratios within the specified range | code | tests | |
image_face_count_filter | Keeps samples containing images with face counts within the specified range | code | tests | |
image_face_ratio_filter | Keeps samples containing images with face area ratios within the specified range | code | tests | |
image_nsfw_filter | Keeps samples containing images with NSFW scores below the threshold | code | tests | |
image_pair_similarity_filter | Keeps image pairs with image feature cosine similarity within the specified range based on a CLIP model | code | tests | |
image_shape_filter | Keeps samples containing images with widths and heights within the specified range | code | tests | |
image_size_filter | Keeps samples containing images whose size in bytes are within the specified range | code | tests | |
image_text_matching_filter | Keeps samples with image-text classification matching score within the specified range based on a BLIP model | code | tests | |
image_text_similarity_filter | Keeps samples with image-text feature cosine similarity within the specified range based on a CLIP model | code | tests | |
image_watermark_filter | Keeps samples containing images with predicted watermark probabilities below the threshold | code | tests | |
language_id_score_filter | Keeps samples of the specified language, judged by a predicted confidence score | code | tests | |
maximum_line_length_filter | Keeps samples with maximum line length within the specified range | code | tests | |
perplexity_filter | Keeps samples with perplexity score below the specified threshold | code | tests | |
phrase_grounding_recall_filter | Keeps samples whose locating recalls of phrases extracted from text in the images are within a specified range | code | tests | |
special_characters_filter | Keeps samples with special-char ratio within the specified range | code | tests | |
specified_field_filter | Filters samples based on field, with value lies in the specified targets | code | tests | |
specified_numeric_field_filter | Filters samples based on field, with value lies in the specified range (for numeric types) | code | tests | |
stopwords_filter | Keeps samples with stopword ratio above the specified threshold | code | tests | |
suffix_filter | Keeps samples with specified suffixes | code | tests | |
text_action_filter | Keeps samples containing action verbs in their texts | code | tests | |
text_entity_dependency_filter | Keeps samples containing dependency edges for an entity in the dependency tree of the texts | code | tests | |
text_length_filter | Keeps samples with total text length within the specified range | code | tests | |
token_num_filter | Keeps samples with token count within the specified range | code | tests | |
video_aesthetics_filter | Keeps samples whose specified frames have aesthetics scores within the specified range | code | tests | |
video_aspect_ratio_filter | Keeps samples containing videos with aspect ratios within the specified range | code | tests | |
video_duration_filter | Keep data samples whose videos' durations are within a specified range | code | tests | |
video_frames_text_similarity_filter | Keep data samples whose similarities between sampled video frame images and text are within a specific range | code | tests | |
video_motion_score_filter | Keep samples with video motion scores within a specific range | code | tests | |
video_motion_score_raft_filter | Keep samples with video motion scores (based on RAFT model) within a specific range | code | tests | |
video_nsfw_filter | Keeps samples containing videos with NSFW scores below the threshold | code | tests | |
video_ocr_area_ratio_filter | Keep data samples whose detected text area ratios for specified frames in the video are within a specified range | code | tests | |
video_resolution_filter | Keeps samples containing videos with horizontal and vertical resolutions within the specified range | code | tests | |
video_watermark_filter | Keeps samples containing videos with predicted watermark probabilities below the threshold | code | tests | |
video_tagging_from_frames_filter | Keep samples containing videos with given tags | code | tests | |
words_num_filter | Keeps samples with word count within the specified range | code | tests | |
word_repetition_filter | Keeps samples with word-level n-gram repetition ratio within the specified range | code | tests |
Operator | Tags | Description | Source code | Unit tests |
---|---|---|---|---|
document_deduplicator | Deduplicates samples at document-level by comparing MD5 hash | code | tests | |
document_minhash_deduplicator | Deduplicates samples at document-level using MinHashLSH | code | tests | |
document_simhash_deduplicator | Deduplicates samples at document-level using SimHash | code | tests | |
image_deduplicator | Deduplicates samples at document-level using exact matching of images between documents | code | tests | |
video_deduplicator | Deduplicates samples at document-level using exact matching of videos between documents | code | tests | |
ray_document_deduplicator | Deduplicates samples at document-level by comparing MD5 hash on ray | code | - | |
ray_image_deduplicator | Deduplicates samples at document-level using exact matching of images between documents on ray | code | - | |
ray_video_deduplicator | Deduplicates samples at document-level using exact matching of videos between documents on ray | code | - |
Operator | Tags | Description | Source code | Unit tests |
---|---|---|---|---|
frequency_specified_field_selector | Selects top samples by comparing the frequency of the specified field | code | tests | |
random_selector | Selects samples randomly | code | tests | |
range_specified_field_selector | Selects samples within a specified range by comparing the values of the specified field | code | tests | |
topk_specified_field_selector | Selects top samples by comparing the values of the specified field | code | tests |
Operator | Tags | Description | Source code | Unit tests |
---|---|---|---|---|
key_value_grouper | Group samples to batched samples according values in given keys. | code | tests | |
naive_grouper | Group all samples to one batched sample. | code | tests |
Operator | Tags | Description | Source code | Unit tests |
---|---|---|---|---|
entity_attribute_aggregator | Return conclusion of the given entity's attribute from some docs. | code | tests | |
most_relavant_entities_aggregator | Extract entities closely related to a given entity from some texts, and sort them in descending order of importance. | code | tests | |
nested_aggregator | Considering the limitation of input length, nested aggregate contents for each given number of samples. | code | tests |
We welcome contributions of adding new operators. Please refer to How-to Guide for Developers.