Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter very slow #270

Open
hiennm15 opened this issue Aug 19, 2024 · 6 comments
Open

Filter very slow #270

hiennm15 opened this issue Aug 19, 2024 · 6 comments

Comments

@hiennm15
Copy link

hiennm15 commented Aug 19, 2024

I using 4xH100, 100 CPU cores, 1000 RAM to filter 1TB data japanese. Although the GPU is at 50% utilization and the CPU is running at 100%, only 3MB of data is processed per minute. I suspect that the tokenizer might be the bottleneck.
I want to ask about what is actually causing the bottleneck. Is there a way to improve the filter speed?

from datatrove.executor.local import LocalPipelineExecutor
from datatrove.pipeline.dedup import SentenceDedupFilter, SentenceDedupSignature, SentenceFindDedups
from datatrove.pipeline.dedup.sentence_dedup import SentDedupConfig
from datatrove.pipeline.extractors import Trafilatura
from datatrove.pipeline.filters import *
from datatrove.pipeline.readers import JsonlReader, WarcReader
from datatrove.pipeline.writers.jsonl import JsonlWriter
from datatrove.utils.typeshelper import Languages
import os

INPUT_READER = JsonlReader(
    data_folder="/home/altai/hiennm/data/pretrain",
    # recursive=True
)
TOTAL_TASKS = 4000
NUM_WORKERS = 400
FILTERING_OUTPUT_PATH = "/home/altai/hiennm/data/remove_"
stage = LocalPipelineExecutor(
    pipeline=[
        INPUT_READER,
        GopherRepetitionFilter(exclusion_writer=JsonlWriter(f"{FILTERING_OUTPUT_PATH}/removed/GopherRepetitionFilter")),
        C4QualityFilter(exclusion_writer=JsonlWriter(f"{FILTERING_OUTPUT_PATH}/removed/C4QualityFilter")),
        # LanguageFilter(exclusion_writer=JsonlWriter(f"{FILTERING_OUTPUT_PATH}/removed/LanguageFilter")),
        GopherQualityFilter(exclusion_writer=JsonlWriter(f"{FILTERING_OUTPUT_PATH}/removed/GopherQualityFilter")),
        C4BadWordsFilter(exclusion_writer=JsonlWriter(f"{FILTERING_OUTPUT_PATH}/removed/C4BadWordsFilter")),
        URLFilter(exclusion_writer=JsonlWriter(f"{FILTERING_OUTPUT_PATH}/removed/URLFilter")),
        # FineWebQualityFilter(exclusion_writer=JsonlWriter(f"{FILTERING_OUTPUT_PATH}/removed/FineWebQualityFilter")),
        JsonlWriter(output_folder="filter_/output")
    ],
    tasks=TOTAL_TASKS,
    workers=NUM_WORKERS,
    logging_dir="filter_/log",
) 
if __name__ == '__main__':
    # freeze_support()
    stage.run()

image
image

@justHungryMan
Copy link
Contributor

I think you use an English tokenizer to handle Japanese.

@hiennm15
Copy link
Author

hiennm15 commented Aug 20, 2024

@justHungryMan No, I definitely switched the tokenizer to Japanese. I also adjusted the code accordingly for processing Japanese (for example, Japanese doesn’t use spaces to separate words, etc.). Even, I set the default from en to ja to make sure it applies correctly to Japanese.

@pengwenzhi
Copy link

how much cpu core your have in a one machine, LocalPipelineExecutor just only support one machine,the speed depends on your core

@guipenedo
Copy link
Collaborator

guipenedo commented Aug 28, 2024

Odd that your GPUs are at 50%, the pipeline you show shouldn't be using the gpus at all

to debug you can introduce a limit= in the reader and process only a bit of data and then check the stats.json file to see which block is taking the longest

@NazimHAli
Copy link

Odd that your GPUs are at 50%, the pipeline you show shouldn't be using the gpus at all

to debug you can introduce a limit= in the reader and process only a bit of data and then check the stats.json file to see which block is taking the longest

Dang, I was about to ask how he used a GPU. Would be nice to incorporate CUDA libs to filter.

@BramVanroy
Copy link
Contributor

Since you are writing all exclusions of every component to disk as well as reading your dataset from disk, it might be that you are IO bottlenecked. Worth checking whether you get improvements without the exclusion writers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants