-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected performance degradation behavior in minhash deduplication stage 2 #298
Comments
Hi, thank you for the benchmarks.
for T tasks in stage 1, each processing N/T documents, processing in stage 2 would be On your example you have kept T constant so we should expect linear scaling on K. Not sure why this was not the case but one possible reason could be filesystem issues, each task in step2 will open T files. And each of the T files in each bucket will be opened by all the tasks assigned to this bucket ( Are you able to run some sort of filesystem/file access benchmark for the 3 8TB configurations? |
Thanks for your responses here and in the other non-public channels.
Assuming that
I mentioned offline that it's possible, but we'd rather review our usage of DataTrove and rule out any user errors before spending the budget on redoing the benchmarks.
No. The reported values were averages.
Here are the maximum runtime values instead (full logs shared with you offline).
|
I've been running some large-scale benchmarking with minhash deduplication on SLURM clusters, loosely following this example
The benchmarks consist of running stages 1 and 2 with the following configurations:
What I'm observing is that
stage 1
seems to scale fairly linearly between these configs. I have the following timing values in the final stats file (all values are in minutes):However, for
stage 2
, the scaling becomes quite different, especially when running the 8TB configuration:As reported above, the some 8TB configs for
stage 2
(boldfaced) is taking an unexpectedly long time to run. I repeated these experiments several times, and the results appear consistent.I was wondering if this behavior is expected? If so, what could be a possible explanation?
Let me know if I can provide further information.
The text was updated successfully, but these errors were encountered: