Feat: Multi-GPU Evaluation #3611

MattGPT-ai · 2025-02-05T06:51:51Z

Flair now supports multi-GPU training, but not evaluation. This means that the work of n-1 GPUs is wasted during evaluation time, and this can dramatically reduce the benefit of multi-GPU training if your eval set is considerable in size. Even worse, I believe it can be slower than single GPU evaluation, as CPU portions of the evaluation code have to repeat n times, but are sharing the same CPU and memory resources.

This PR implements multi-GPU acceleration for evaluate in the Classifier, TextRegressor, and TextPairRegressor model types. It uses the DistributedSampler to split the eval set between the GPUs, predictions are run, and the results of inference are aggregated between processes before the metrics are calculated in the main process and returned.

MattGPT-ai · 2025-02-05T18:39:12Z

Looks like checks are hitting an unrelated type error:
4256: error: Argument 1 to "Entity" has incompatible type "tuple[Optional[int], int]"; expected "tuple[int, int]" [arg-type]

alanakbik · 2025-02-05T20:22:55Z

@MattGPT-ai this is due to a new mypy version and affected a deprecated class. I just fixed it in #3613. If you update this branch to current master the error should disappear.

…gather functions to distributed utils. This works by using a DistributedSampler to allocate samples across GPU processes, then aggregates all the predictions and losses from all processes before running the evaluation. Uses broadcast to ensure all processes return the same valid result.

…ugging code

… sentence ID

MattGPT-ai · 2025-02-06T00:26:11Z

Awesome that worked, checks passed!

MattGPT-ai · 2025-03-04T23:18:32Z

We have been using this change successfully in our fork for about a month now, it's been a major speed improvement especially when evaluation sets are large!

MattGPT-ai force-pushed the mattb.multi-gpu-evaluate branch from da4e8f0 to 7734bdd Compare February 5, 2025 06:58

MattGPT-ai added 5 commits February 5, 2025 14:33

feat: add multi-GPU support to evaluation for regression models

20425b0

feat: time corpus validation function

d0d7043

refactor: fix MyPy type issues, refactor data loader and clean up deb…

fe1c46c

…ugging code

change span representation to include rank to prevent collisions from…

c0f7e7d

… sentence ID

MattGPT-ai force-pushed the mattb.multi-gpu-evaluate branch from 7734bdd to c0f7e7d Compare February 5, 2025 22:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Multi-GPU Evaluation #3611

Feat: Multi-GPU Evaluation #3611

MattGPT-ai commented Feb 5, 2025

MattGPT-ai commented Feb 5, 2025

alanakbik commented Feb 5, 2025

MattGPT-ai commented Feb 6, 2025

MattGPT-ai commented Mar 4, 2025

Feat: Multi-GPU Evaluation #3611

Are you sure you want to change the base?

Feat: Multi-GPU Evaluation #3611

Conversation

MattGPT-ai commented Feb 5, 2025

MattGPT-ai commented Feb 5, 2025

alanakbik commented Feb 5, 2025

MattGPT-ai commented Feb 6, 2025

MattGPT-ai commented Mar 4, 2025