-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter very slow #270
Comments
I think you use an English tokenizer to handle Japanese. |
@justHungryMan No, I definitely switched the tokenizer to Japanese. I also adjusted the code accordingly for processing Japanese (for example, Japanese doesn’t use spaces to separate words, etc.). Even, I set the default from en to ja to make sure it applies correctly to Japanese. |
how much cpu core your have in a one machine, LocalPipelineExecutor just only support one machine,the speed depends on your core |
Odd that your GPUs are at 50%, the pipeline you show shouldn't be using the gpus at all to debug you can introduce a |
Dang, I was about to ask how he used a GPU. Would be nice to incorporate CUDA libs to filter. |
Since you are writing all exclusions of every component to disk as well as reading your dataset from disk, it might be that you are IO bottlenecked. Worth checking whether you get improvements without the exclusion writers. |
I using 4xH100, 100 CPU cores, 1000 RAM to filter 1TB data japanese. Although the GPU is at 50% utilization and the CPU is running at 100%, only 3MB of data is processed per minute. I suspect that the tokenizer might be the bottleneck.
I want to ask about what is actually causing the bottleneck. Is there a way to improve the filter speed?
The text was updated successfully, but these errors were encountered: