Skip to content

Latest commit

 

History

History
214 lines (189 loc) · 97.4 KB

Operators.md

File metadata and controls

214 lines (189 loc) · 97.4 KB

Operator Schemas

Operators are a collection of basic processes that assist in data modification, cleaning, filtering, deduplication, etc. We support a wide range of data sources and file formats, and allow for flexible extension to custom datasets.

This page offers a basic description of the operators (OPs) in Data-Juicer. Users can refer to the API documentation for the specific parameters of each operator. Users can refer to and run the unit tests (tests/ops/...) for examples of operator-wise usage as well as the effects of each operator when applied to built-in test data samples.

Overview

The operators in Data-Juicer are categorized into 5 types.

Type Number Description
Formatter 9 Discovers, loads, and canonicalizes source data
Mapper 63 Edits and transforms samples
Filter 44 Filters out low-quality samples
Deduplicator 8 Detects and removes duplicate samples
Selector 4 Selects top samples based on ranking
Grouper 2 Group samples to batched samples
Aggregator 3 Aggregate for batched samples, such as summary or conclusion

All the specific operators are listed below, each featured with several capability tags.

  • Domain Tags
    • General: general purpose
    • LaTeX: specific to LaTeX source files
    • Code: specific to programming codes
    • Financial: closely related to financial sector
  • Modality Tags
    • Text: specific to text
    • Image: specific to images
    • Audio: specific to audios
    • Video: specific to videos
    • Multimodal: specific to multimodal
  • Language Tags
    • en: English
    • zh: Chinese
  • Resource Tags
    • CPU: only requires CPU resource (default)
    • GPU: requires GPU/CUDA resource as well

Formatter

Operator Tags Description Source code Unit tests
local_formatter General en zh Prepares datasets from local files code tests
remote_formatter General en zh Prepares datasets from remote (e.g., HuggingFace) code tests
csv_formatter General en zh Prepares local .csv files code tests
tsv_formatter General en zh Prepares local .tsv files code tests
json_formatter General en zh Prepares local .json, .jsonl, .jsonl.zst files code -
parquet_formatter General en zh Prepares local .parquet files code tests
text_formatter General en zh Prepares other local text files (complete list) code -
empty_formatter General Prepares an empty dataset code tests
mixture_formatter General en zh Handles a mixture of all the supported local file types code tests

Mapper

Operator Tags Description Source code Unit tests
audio_ffmpeg_wrapped_mapper Audio Simple wrapper to run a FFmpeg audio filter code tests
calibrate_qa_mapper General Text en zh Calibrate question-answer pairs based on reference text code tests
calibrate_query_mapper General Text en zh Calibrate query in question-answer pairs based on reference text code tests
calibrate_response_mapper General Text en zh Calibrate response in question-answer pairs based on reference text code tests
chinese_convert_mapper General Text zh Converts Chinese between Traditional Chinese, Simplified Chinese and Japanese Kanji (by opencc) code tests
clean_copyright_mapper Code Text en zh Removes copyright notice at the beginning of code files (must contain the word copyright) code tests
clean_email_mapper General Text en zh Removes email information code tests
clean_html_mapper General Text en zh Removes HTML tags and returns plain text of all the nodes code tests
clean_ip_mapper General Text en zh Removes IP addresses code tests
clean_links_mapper General Text Code en zh Removes links, such as those starting with http or ftp code tests
expand_macro_mapper LaTeX Text en zh Expands macros usually defined at the top of TeX documents code tests
extract_entity_attribute_mapper General Text en zh Extract attributes for given entities from the text. code tests
extract_entity_relation_mapper General Text en zh Extract entities and relations in the text for knowledge graph. code tests
extract_event_mapper General Text en zh Extract events and relevant characters in the text. code tests
extract_keyword_mapper General Text en zh Generate keywords for the text. code tests
extract_nickname_mapper General Text en zh Extract nickname relationship in the text. code tests
extract_support_text_mapper General Text en zh Extract support sub text for a summary. code tests
fix_unicode_mapper General Text en zh Fixes broken Unicodes (by ftfy) code tests
generate_qa_from_examples_mapper General Text en zh GPU Generate question and answer pairs based on examples. code tests
generate_qa_from_text_mapper General Text en zh GPU Generate question and answer pairs from text. code tests
image_blur_mapper Image Blur images code tests
image_captioning_from_gpt4v_mapper Multimodal generate samples whose texts are generated based on gpt-4-visison and the image code -
image_captioning_mapper Multimodal GPU generate samples whose captions are generated based on another model (such as blip2) and the figure within the original sample code tests
image_diffusion_mapper Multimodal GPU Generate and augment images by stable diffusion model code tests
image_face_blur_mapper Image Blur faces detected in images code tests
image_tagging_mapper Multimodal GPU Mapper to generate image tags from the input images. code tests
nlpaug_en_mapper General Text en Simply augments texts in English based on the nlpaug library code tests
nlpcda_zh_mapper General Text zh Simply augments texts in Chinese based on the nlpcda library code tests
optimize_qa_mapper General Text en zh GPU Optimize both the query and response in question-answering samples. code tests
optimize_query_mapper General Text en zh GPU Optimize the query in question-answering samples. code tests
optimize_response_mapper General Text en zh GPU Optimize the response in question-answering samples. code tests
pair_preference_mapper General Text en zh Construct paired preference samples. code tests
punctuation_normalization_mapper General Text en zh Normalizes various Unicode punctuations to their ASCII equivalents code tests
python_file_mapper General Text en zh Executing Python function defined in a file code tests
python_lambda_mapper General Text en zh Executing Python lambda function on data samples code tests
relation_identity_mapper General Text en zh Identify relation between two entity in the text. code tests
remove_bibliography_mapper LaTeX Text en zh Removes the bibliography of TeX documents code tests
remove_comments_mapper LaTeX Text en zh Removes the comments of TeX documents code tests
remove_header_mapper LaTeX Text en zh Removes the running headers of TeX documents, e.g., titles, chapter or section numbers/names code tests
remove_long_words_mapper General Text en zh Removes words with length outside the specified range code tests
remove_non_chinese_character_mapper General Text en zh Remove non Chinese character in text samples. code tests
remove_repeat_sentences_mapper General Text en zh Remove repeat sentences in text samples. code tests
remove_specific_chars_mapper General Text en zh Removes any user-specified characters or substrings code tests
remove_table_text_mapper General Text Financial en Detects and removes possible table contents (:warning: relies on regular expression matching and thus fragile) code tests
remove_words_with_incorrect_ substrings_mapper General Text en zh Removes words containing specified substrings code tests
replace_content_mapper General Text en zh Replace all content in the text that matches a specific regular expression pattern with a designated replacement string code tests
sentence_split_mapper General Text en Splits and reorganizes sentences according to semantics code tests
text_chunk_mapper General Text en zh Split input text to chunks. code tests
video_captioning_from_audio_mapper Multimodal GPU Caption a video according to its audio streams based on Qwen-Audio model code tests
video_captioning_from_frames_mapper Multimodal GPU generate samples whose captions are generated based on an image-to-text model and sampled video frames. Captions from different frames will be concatenated to a single string code tests
video_captioning_from_summarizer_mapper Multimodal GPU Generate video captions by summarizing several kinds of generated texts (captions from video/audio/frames, tags from audio/frames, ...) code tests
video_captioning_from_video_mapper Multimodal GPU generate samples whose captions are generated based on another model (video-blip) and sampled video frame within the original sample code tests
video_extract_frames_mapper Multimodal GPU extract frames from video files according to specified methods code tests
video_face_blur_mapper Video Blur faces detected in videos code tests
video_ffmpeg_wrapped_mapper Video Simple wrapper to run a FFmpeg video filter code tests
video_remove_watermark_mapper Video Remove the watermarks in videos given regions code tests
video_resize_aspect_ratio_mapper Video Resize video aspect ratio to a specified range code tests
video_resize_resolution_mapper Video Map videos to ones with given resolution range code tests
video_split_by_duration_mapper Video Mapper to split video by duration code tests
video_split_by_key_frame_mapper Video Mapper to split video by key frame code tests
video_split_by_scene_mapper Video Split videos into scene clips code tests
video_tagging_from_audio_mapper Multimodal GPU Mapper to generate video tags from audio streams extracted from the video. code tests
video_tagging_from_frames_mapper Multimodal GPU Mapper to generate video tags from frames extracted from the video. code tests
whitespace_normalization_mapper General Text en zh Normalizes various Unicode whitespaces to the normal ASCII space (U+0020) code tests

Filter

Operator Tags Description Source code Unit tests
alphanumeric_filter General Text en zh Keeps samples with alphanumeric ratio within the specified range code tests
audio_duration_filter Audio Keep data samples whose audios' durations are within a specified range code tests
audio_nmf_snr_filter Audio Keep data samples whose audios' Signal-to-Noise Ratios (SNRs, computed based on Non-Negative Matrix Factorization, NMF) are within a specified range code tests
audio_size_filter Audio Keep data samples whose audios' sizes are within a specified range code tests
average_line_length_filter Code Text en zh Keeps samples with average line length within the specified range code tests
character_repetition_filter General Text en zh Keeps samples with char-level n-gram repetition ratio within the specified range code tests
flagged_words_filter General Text en zh Keeps samples with flagged-word ratio below the specified threshold code tests
image_aesthetics_filter Image GPU Keeps samples containing images whose aesthetics scores are within the specified range code tests
image_aspect_ratio_filter Image Keeps samples containing images with aspect ratios within the specified range code tests
image_face_count_filter Image Keeps samples containing images with face counts within the specified range code tests
image_face_ratio_filter Image Keeps samples containing images with face area ratios within the specified range code tests
image_nsfw_filter Image GPU Keeps samples containing images with NSFW scores below the threshold code tests
image_pair_similarity_filter Image GPU Keeps image pairs with image feature cosine similarity within the specified range based on a CLIP model code tests
image_shape_filter Image Keeps samples containing images with widths and heights within the specified range code tests
image_size_filter Image Keeps samples containing images whose size in bytes are within the specified range code tests
image_text_matching_filter Multimodal GPU Keeps samples with image-text classification matching score within the specified range based on a BLIP model code tests
image_text_similarity_filter Multimodal GPU Keeps samples with image-text feature cosine similarity within the specified range based on a CLIP model code tests
image_watermark_filter Image GPU Keeps samples containing images with predicted watermark probabilities below the threshold code tests
language_id_score_filter General Text en zh Keeps samples of the specified language, judged by a predicted confidence score code tests
maximum_line_length_filter Code Text en zh Keeps samples with maximum line length within the specified range code tests
perplexity_filter General Text en zh Keeps samples with perplexity score below the specified threshold code tests
phrase_grounding_recall_filter Multimodal GPU Keeps samples whose locating recalls of phrases extracted from text in the images are within a specified range code tests
special_characters_filter General Text en zh Keeps samples with special-char ratio within the specified range code tests
specified_field_filter General Text en zh Filters samples based on field, with value lies in the specified targets code tests
specified_numeric_field_filter General Text en zh Filters samples based on field, with value lies in the specified range (for numeric types) code tests
stopwords_filter General Text en zh Keeps samples with stopword ratio above the specified threshold code tests
suffix_filter General en zh Keeps samples with specified suffixes code tests
text_action_filter General Text en zh Keeps samples containing action verbs in their texts code tests
text_entity_dependency_filter General Text en zh Keeps samples containing dependency edges for an entity in the dependency tree of the texts code tests
text_length_filter General Text en zh Keeps samples with total text length within the specified range code tests
token_num_filter General Text en zh GPU Keeps samples with token count within the specified range code tests
video_aesthetics_filter Video GPU Keeps samples whose specified frames have aesthetics scores within the specified range code tests
video_aspect_ratio_filter Video Keeps samples containing videos with aspect ratios within the specified range code tests
video_duration_filter Video Keep data samples whose videos' durations are within a specified range code tests
video_frames_text_similarity_filter Multimodal GPU Keep data samples whose similarities between sampled video frame images and text are within a specific range code tests
video_motion_score_filter Video Keep samples with video motion scores within a specific range code tests
video_motion_score_raft_filter Video Keep samples with video motion scores (based on RAFT model) within a specific range code tests
video_nsfw_filter Video GPU Keeps samples containing videos with NSFW scores below the threshold code tests
video_ocr_area_ratio_filter Video GPU Keep data samples whose detected text area ratios for specified frames in the video are within a specified range code tests
video_resolution_filter Video Keeps samples containing videos with horizontal and vertical resolutions within the specified range code tests
video_watermark_filter Video GPU Keeps samples containing videos with predicted watermark probabilities below the threshold code tests
video_tagging_from_frames_filter Multimodal GPU Keep samples containing videos with given tags code tests
words_num_filter General Text en zh Keeps samples with word count within the specified range code tests
word_repetition_filter General Text en zh Keeps samples with word-level n-gram repetition ratio within the specified range code tests

Deduplicator

Operator Tags Description Source code Unit tests
document_deduplicator General Text en zh Deduplicates samples at document-level by comparing MD5 hash code tests
document_minhash_deduplicator General Text en zh Deduplicates samples at document-level using MinHashLSH code tests
document_simhash_deduplicator General Text en zh Deduplicates samples at document-level using SimHash code tests
image_deduplicator Image Deduplicates samples at document-level using exact matching of images between documents code tests
video_deduplicator Video Deduplicates samples at document-level using exact matching of videos between documents code tests
ray_document_deduplicator General Text en zh Deduplicates samples at document-level by comparing MD5 hash on ray code -
ray_image_deduplicator Image Deduplicates samples at document-level using exact matching of images between documents on ray code -
ray_video_deduplicator Video Deduplicates samples at document-level using exact matching of videos between documents on ray code -

Selector

Operator Tags Description Source code Unit tests
frequency_specified_field_selector General en zh Selects top samples by comparing the frequency of the specified field code tests
random_selector General en zh Selects samples randomly code tests
range_specified_field_selector General en zh Selects samples within a specified range by comparing the values of the specified field code tests
topk_specified_field_selector General en zh Selects top samples by comparing the values of the specified field code tests

Grouper

Operator Tags Description Source code Unit tests
key_value_grouper General Text en zh Group samples to batched samples according values in given keys. code tests
naive_grouper General Text en zh Group all samples to one batched sample. code tests

Aggregator

Operator Tags Description Source code Unit tests
entity_attribute_aggregator General Text en zh Return conclusion of the given entity's attribute from some docs. code tests
most_relavant_entities_aggregator General Text en zh Extract entities closely related to a given entity from some texts, and sort them in descending order of importance. code tests
nested_aggregator General Text en zh Considering the limitation of input length, nested aggregate contents for each given number of samples. code tests

Contributing

We welcome contributions of adding new operators. Please refer to How-to Guide for Developers.