Errors when training with multiple gpus #38

mmuguang · 2022-04-07T11:20:12Z

It worked well when training with single gpu.
But an error about batchsampler occurs when using multiple gpus

mmuguang · 2022-04-07T11:20:55Z

AttributeError: 'ConcatDatasetBatchSampler' object has no attribute 'batch_size'

turpaultn · 2022-04-08T08:41:27Z

Hi there,
If you don't use the batch_sampler, what happens ? (of course your results won't satisfy, but it is just to get more info about the issue)

popcornell · 2022-04-08T12:43:06Z

It is due to lightning defaulting to Distributed Data Parallel (DDP). You have to do some workarounds to make the custom Sampler work with DDP.
Can you try with plain DataParallel ? It works on my side with DataParallel.

Set in the YAML file backend: dp

mmuguang · 2022-04-08T12:56:16Z

Thanks! @popcornell
it works after setting backend=dp

mmuguang · 2022-04-08T13:13:10Z

There is another problem during validation using 2 gpus.

popcornell · 2022-04-08T13:49:01Z

Seems that Lightning does not split the filenames list (it splits however the torch tensor between the GPUs).

popcornell · 2022-04-08T13:49:29Z

Might be a bug of Lightning, I don't know how to fix that easily

popcornell · 2022-04-08T14:25:04Z

Lightning-AI/pytorch-lightning#1508 seems we have to rewrite the collate_fn then

popcornell · 2022-04-12T17:09:40Z

@mmuguang an easy fix is to use batch_size = 1 for validation. But then you would probably run evaluation only every X epochs

popcornell · 2022-04-12T17:12:56Z

It is not very easy to fix this with Lightning. I tried to use Speechbrain dataio.batch.PaddedBatch collate_fn but it did not work with DP and Lightning.
Also we can't encode the paths as strings as they have different length.. too complicated to use a Tokenizer and then decode them IMO.
I'll say we go for batch 1 approach with DP and issue a warning.

popcornell · 2022-04-12T17:13:13Z

@turpaultn opinions on this ?

mmuguang · 2022-04-13T07:45:39Z

@popcornell I have given up using Lightning and rewrite the baseline with pytorch. it can split the filenames and work well with dp . The old version of lightning is so difficult to use.

mmuguang · 2022-04-13T07:51:48Z

The lightning also has bugs during training with dp. When using 2 gpus, the loss on second gpu become Nan. Perhaps there are only unlabeled audio on it so the supervised loss is None. The final loss is also Nan and the model can't be trained normally.

popcornell · 2022-04-13T10:59:33Z

I think that is expected the batch is divided among GPUs.
Yeah unfortunately multi-GPU is broken currently.
Probably yes we need to dump Lightning from baseline code but maybe we can do it for next year baseline.

Moadab-AI · 2022-05-03T12:44:46Z

I think rather than ditching lightning altogether, a much simpler solution would be to index "all" audio files in the dataset and return the corresponding ID/index inall getitem() methods of datasets instead of the filepath strings. This way they would be treated as tensors and there wont be any problem with multigpu run.

popcornell · 2022-05-03T18:18:55Z

Thanks @Moadab-AI for the help.

I ve thought about that and there could be a problem because we use ConcatDataset.
I am sure it can be made to work but I think it will be hacky.
We would have to propagate the original individual datasets to the pl.lightningmodule.

Also IDK but it is maybe already very hacky the code IMO and not very readable by newcomers, especially ones who never had hands-on experience with lightning.

What are your thoughts on this ? I would like to hear your feedback.

turpaultn · 2022-09-20T15:44:13Z

My opinion on this:

We have small models, running them on multiple GPUs does not seem to make that much sense anyway when you know the training time of one model 🤔
If you have multiple GPUs, you can run multiple experiments in parallel.

If there is a need to run on multiple GPUs for whatever reason:

New dataset bigger
More complex approach
Something else...
Then I would say let's create a new recipe even with "hacks" for more advanced users 😁

What do you think ?

turpaultn assigned turpaultn and popcornell and unassigned turpaultn Apr 8, 2022

cai525 mentioned this issue Oct 19, 2023

AttributeError: 'ConcatDatasetBatchSampler' object has no attribute 'batch_size' #83

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Errors when training with multiple gpus #38

Errors when training with multiple gpus #38

mmuguang commented Apr 7, 2022

mmuguang commented Apr 7, 2022

turpaultn commented Apr 8, 2022

popcornell commented Apr 8, 2022

mmuguang commented Apr 8, 2022

mmuguang commented Apr 8, 2022

popcornell commented Apr 8, 2022

popcornell commented Apr 8, 2022

popcornell commented Apr 8, 2022

popcornell commented Apr 12, 2022

popcornell commented Apr 12, 2022

popcornell commented Apr 12, 2022

mmuguang commented Apr 13, 2022

mmuguang commented Apr 13, 2022

popcornell commented Apr 13, 2022

Moadab-AI commented May 3, 2022

popcornell commented May 3, 2022

turpaultn commented Sep 20, 2022

Errors when training with multiple gpus #38

Errors when training with multiple gpus #38

Comments

mmuguang commented Apr 7, 2022

mmuguang commented Apr 7, 2022

turpaultn commented Apr 8, 2022

popcornell commented Apr 8, 2022

mmuguang commented Apr 8, 2022

mmuguang commented Apr 8, 2022

popcornell commented Apr 8, 2022

popcornell commented Apr 8, 2022

popcornell commented Apr 8, 2022

popcornell commented Apr 12, 2022

popcornell commented Apr 12, 2022

popcornell commented Apr 12, 2022

mmuguang commented Apr 13, 2022

mmuguang commented Apr 13, 2022

popcornell commented Apr 13, 2022

Moadab-AI commented May 3, 2022

popcornell commented May 3, 2022

turpaultn commented Sep 20, 2022