Unexpected performance degradation behavior in minhash deduplication stage 2 #298

Maghoumi · 2024-10-17T00:44:57Z

I've been running some large-scale benchmarking with minhash deduplication on SLURM clusters, loosely following this example

The benchmarks consist of running stages 1 and 2 with the following configurations:

Two input datasets: 4TB and 8TB
Number of tasks for stage 1: 1200 and 2400
Number of tasks for stage 2: 300 and 600 (also tried 1200 once, see below)

What I'm observing is that stage 1 seems to scale fairly linearly between these configs. I have the following timing values in the final stats file (all values are in minutes):

Stage 1 tasks vs. Dataset size	4TB	8TB
1200 tasks	71 minutes	123 minutes
2400 tasks	36 minutes	61 minutes

However, for stage 2, the scaling becomes quite different, especially when running the 8TB configuration:

Stage 2 tasks vs. Dataset size	4TB (stage 1 done with 1200 tasks)	8TB (stage 1 done with 2400 tasks)
300 tasks	25 minutes	38 minutes
600 tasks	17 minutes	> 54 minutes <
1200 tasks	-	> 50 minutes <

As reported above, the some 8TB configs for stage 2 (boldfaced) is taking an unexpectedly long time to run. I repeated these experiments several times, and the results appear consistent.

I was wondering if this behavior is expected? If so, what could be a possible explanation?
Let me know if I can provide further information.

The text was updated successfully, but these errors were encountered:

guipenedo · 2024-10-28T16:15:19Z

Hi, thank you for the benchmarks.
Stage 1 is indeed fully parallelizeable hence the linear scaling.
For stage2, performance also depends on the number of files/tasks from step1:

we use a priority queue onto which we load the smallest hash+document combo from each file
as we process each document, we add the next hash from the file where this document was in

for T tasks in stage 1, each processing N/T documents, processing in stage 2 would be O(N*log(T)). As you parallelize stage 2 into K tasks you should in theory get O(N*log(T)/K), as each task should process approximately the same number of hashes. The main factor not captured by the complexity here on T is that with an increase of T you will also have to go through a lot more files, and we actually only read a few lines at a time for each one to not have memory explode.

On your example you have kept T constant so we should expect linear scaling on K. Not sure why this was not the case but one possible reason could be filesystem issues, each task in step2 will open T files. And each of the T files in each bucket will be opened by all the tasks assigned to this bucket (K/nb of buckets tasks).

Are you able to run some sort of filesystem/file access benchmark for the 3 8TB configurations?
Also, I assume the time in your benchmarks is the max out of each individual task, is this the case? Could you also provide the average per task (or the max if the values on your table are actually the average)

Maghoumi · 2024-11-20T19:08:38Z

Thanks for your responses here and in the other non-public channels.

The main factor not captured by the complexity here on T is that with an increase of T you will also have to go through a lot more files, and we actually only read a few lines at a time for each one to not have memory explode.

Assuming that T goes up, wouldn't that imply each intermediate file would be smaller, and therefore reading "a few lines" wouldn't be much of an issue, since we can load a bunch into the memory? In our case, we had enough memory to ensure it wouldn't be an issue. What I'm trying to understand is, what would be the hypothetical threshold of running into I/O thrashing in this case, since that seems to be your primary hypothesis here.
Also, I was under the impression that each "chunk" of stage 1 would be processed independently and there is no shared work between different chunks. Can you confirm my understanding is correct?

Are you able to run some sort of filesystem/file access benchmark for the 3 8TB configurations?

I mentioned offline that it's possible, but we'd rather review our usage of DataTrove and rule out any user errors before spending the budget on redoing the benchmarks.

Also, I assume the time in your benchmarks is the max out of each individual task, is this the case?

No. The reported values were averages.

Could you also provide the average per task (or the max if the values on your table are actually the average)

Here are the maximum runtime values instead (full logs shared with you offline).

Stage 2 tasks vs. Dataset size	8TB (stage 1 done with 2400 tasks)
300 tasks	45 minutes
600 tasks	93 minutes
1200 tasks	125 minutes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected performance degradation behavior in minhash deduplication stage 2 #298

Unexpected performance degradation behavior in minhash deduplication stage 2 #298

Maghoumi commented Oct 17, 2024 •

edited

Loading

guipenedo commented Oct 28, 2024 •

edited

Loading

Maghoumi commented Nov 20, 2024

Unexpected performance degradation behavior in minhash deduplication stage 2 #298

Unexpected performance degradation behavior in minhash deduplication stage 2 #298

Comments

Maghoumi commented Oct 17, 2024 • edited Loading

guipenedo commented Oct 28, 2024 • edited Loading

Maghoumi commented Nov 20, 2024

Maghoumi commented Oct 17, 2024 •

edited

Loading

guipenedo commented Oct 28, 2024 •

edited

Loading