Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors when training with multiple gpus #38

Open
mmuguang opened this issue Apr 7, 2022 · 17 comments
Open

Errors when training with multiple gpus #38

mmuguang opened this issue Apr 7, 2022 · 17 comments
Assignees

Comments

@mmuguang
Copy link

mmuguang commented Apr 7, 2022

It worked well when training with single gpu.
But an error about batchsampler occurs when using multiple gpus
image
image

@mmuguang
Copy link
Author

mmuguang commented Apr 7, 2022

AttributeError: 'ConcatDatasetBatchSampler' object has no attribute 'batch_size'

@turpaultn
Copy link
Collaborator

Hi there,
If you don't use the batch_sampler, what happens ? (of course your results won't satisfy, but it is just to get more info about the issue)

@turpaultn turpaultn assigned turpaultn and popcornell and unassigned turpaultn Apr 8, 2022
@popcornell
Copy link
Collaborator

It is due to lightning defaulting to Distributed Data Parallel (DDP). You have to do some workarounds to make the custom Sampler work with DDP.
Can you try with plain DataParallel ? It works on my side with DataParallel.

Set in the YAML file backend: dp

@mmuguang
Copy link
Author

mmuguang commented Apr 8, 2022

Thanks! @popcornell
it works after setting backend=dp

@mmuguang
Copy link
Author

mmuguang commented Apr 8, 2022

There is another problem during validation using 2 gpus.
image
image

@popcornell
Copy link
Collaborator

Seems that Lightning does not split the filenames list (it splits however the torch tensor between the GPUs).

@popcornell
Copy link
Collaborator

Might be a bug of Lightning, I don't know how to fix that easily

@popcornell
Copy link
Collaborator

Lightning-AI/pytorch-lightning#1508 seems we have to rewrite the collate_fn then

@popcornell
Copy link
Collaborator

@mmuguang an easy fix is to use batch_size = 1 for validation. But then you would probably run evaluation only every X epochs

@popcornell
Copy link
Collaborator

It is not very easy to fix this with Lightning. I tried to use Speechbrain dataio.batch.PaddedBatch collate_fn but it did not work with DP and Lightning.
Also we can't encode the paths as strings as they have different length.. too complicated to use a Tokenizer and then decode them IMO.
I'll say we go for batch 1 approach with DP and issue a warning.

@popcornell
Copy link
Collaborator

@turpaultn opinions on this ?

@mmuguang
Copy link
Author

@popcornell I have given up using Lightning and rewrite the baseline with pytorch. it can split the filenames and work well with dp . The old version of lightning is so difficult to use.

@mmuguang
Copy link
Author

The lightning also has bugs during training with dp. When using 2 gpus, the loss on second gpu become Nan. Perhaps there are only unlabeled audio on it so the supervised loss is None. The final loss is also Nan and the model can't be trained normally.

@popcornell
Copy link
Collaborator

I think that is expected the batch is divided among GPUs.
Yeah unfortunately multi-GPU is broken currently.
Probably yes we need to dump Lightning from baseline code but maybe we can do it for next year baseline.

@Moadab-AI
Copy link

I think rather than ditching lightning altogether, a much simpler solution would be to index "all" audio files in the dataset and return the corresponding ID/index inall getitem() methods of datasets instead of the filepath strings. This way they would be treated as tensors and there wont be any problem with multigpu run.

@popcornell
Copy link
Collaborator

Thanks @Moadab-AI for the help.

I ve thought about that and there could be a problem because we use ConcatDataset.
I am sure it can be made to work but I think it will be hacky.
We would have to propagate the original individual datasets to the pl.lightningmodule.

Also IDK but it is maybe already very hacky the code IMO and not very readable by newcomers, especially ones who never had hands-on experience with lightning.

What are your thoughts on this ? I would like to hear your feedback.

@turpaultn
Copy link
Collaborator

My opinion on this:

  • We have small models, running them on multiple GPUs does not seem to make that much sense anyway when you know the training time of one model 🤔
    If you have multiple GPUs, you can run multiple experiments in parallel.

If there is a need to run on multiple GPUs for whatever reason:

  • New dataset bigger
  • More complex approach
  • Something else...
    Then I would say let's create a new recipe even with "hacks" for more advanced users 😁

What do you think ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants