Filter very slow #270

hiennm15 · 2024-08-19T08:29:47Z

I using 4xH100, 100 CPU cores, 1000 RAM to filter 1TB data japanese. Although the GPU is at 50% utilization and the CPU is running at 100%, only 3MB of data is processed per minute. I suspect that the tokenizer might be the bottleneck.
I want to ask about what is actually causing the bottleneck. Is there a way to improve the filter speed?

from datatrove.executor.local import LocalPipelineExecutor
from datatrove.pipeline.dedup import SentenceDedupFilter, SentenceDedupSignature, SentenceFindDedups
from datatrove.pipeline.dedup.sentence_dedup import SentDedupConfig
from datatrove.pipeline.extractors import Trafilatura
from datatrove.pipeline.filters import *
from datatrove.pipeline.readers import JsonlReader, WarcReader
from datatrove.pipeline.writers.jsonl import JsonlWriter
from datatrove.utils.typeshelper import Languages
import os

INPUT_READER = JsonlReader(
    data_folder="/home/altai/hiennm/data/pretrain",
    # recursive=True
)
TOTAL_TASKS = 4000
NUM_WORKERS = 400
FILTERING_OUTPUT_PATH = "/home/altai/hiennm/data/remove_"
stage = LocalPipelineExecutor(
    pipeline=[
        INPUT_READER,
        GopherRepetitionFilter(exclusion_writer=JsonlWriter(f"{FILTERING_OUTPUT_PATH}/removed/GopherRepetitionFilter")),
        C4QualityFilter(exclusion_writer=JsonlWriter(f"{FILTERING_OUTPUT_PATH}/removed/C4QualityFilter")),
        # LanguageFilter(exclusion_writer=JsonlWriter(f"{FILTERING_OUTPUT_PATH}/removed/LanguageFilter")),
        GopherQualityFilter(exclusion_writer=JsonlWriter(f"{FILTERING_OUTPUT_PATH}/removed/GopherQualityFilter")),
        C4BadWordsFilter(exclusion_writer=JsonlWriter(f"{FILTERING_OUTPUT_PATH}/removed/C4BadWordsFilter")),
        URLFilter(exclusion_writer=JsonlWriter(f"{FILTERING_OUTPUT_PATH}/removed/URLFilter")),
        # FineWebQualityFilter(exclusion_writer=JsonlWriter(f"{FILTERING_OUTPUT_PATH}/removed/FineWebQualityFilter")),
        JsonlWriter(output_folder="filter_/output")
    ],
    tasks=TOTAL_TASKS,
    workers=NUM_WORKERS,
    logging_dir="filter_/log",
) 
if __name__ == '__main__':
    # freeze_support()
    stage.run()

The text was updated successfully, but these errors were encountered:

justHungryMan · 2024-08-20T08:52:44Z

I think you use an English tokenizer to handle Japanese.

hiennm15 · 2024-08-20T09:16:32Z

@justHungryMan No, I definitely switched the tokenizer to Japanese. I also adjusted the code accordingly for processing Japanese (for example, Japanese doesn’t use spaces to separate words, etc.). Even, I set the default from en to ja to make sure it applies correctly to Japanese.

pengwenzhi · 2024-08-21T07:35:23Z

how much cpu core your have in a one machine, LocalPipelineExecutor just only support one machine，the speed depends on your core

guipenedo · 2024-08-28T11:43:35Z

Odd that your GPUs are at 50%, the pipeline you show shouldn't be using the gpus at all

to debug you can introduce a limit= in the reader and process only a bit of data and then check the stats.json file to see which block is taking the longest

NazimHAli · 2024-09-09T23:41:39Z

Odd that your GPUs are at 50%, the pipeline you show shouldn't be using the gpus at all

to debug you can introduce a limit= in the reader and process only a bit of data and then check the stats.json file to see which block is taking the longest

Dang, I was about to ask how he used a GPU. Would be nice to incorporate CUDA libs to filter.

BramVanroy · 2024-10-09T19:05:18Z

Since you are writing all exclusions of every component to disk as well as reading your dataset from disk, it might be that you are IO bottlenecked. Worth checking whether you get improvements without the exclusion writers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter very slow #270

Filter very slow #270

hiennm15 commented Aug 19, 2024 •

edited

Loading

justHungryMan commented Aug 20, 2024

hiennm15 commented Aug 20, 2024 •

edited

Loading

pengwenzhi commented Aug 21, 2024

guipenedo commented Aug 28, 2024 •

edited

Loading

NazimHAli commented Sep 9, 2024

BramVanroy commented Oct 9, 2024

Filter very slow #270

Filter very slow #270

Comments

hiennm15 commented Aug 19, 2024 • edited Loading

justHungryMan commented Aug 20, 2024

hiennm15 commented Aug 20, 2024 • edited Loading

pengwenzhi commented Aug 21, 2024

guipenedo commented Aug 28, 2024 • edited Loading

NazimHAli commented Sep 9, 2024

BramVanroy commented Oct 9, 2024

hiennm15 commented Aug 19, 2024 •

edited

Loading

hiennm15 commented Aug 20, 2024 •

edited

Loading

guipenedo commented Aug 28, 2024 •

edited

Loading