-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Errors when training with multiple gpus #38
Comments
AttributeError: 'ConcatDatasetBatchSampler' object has no attribute 'batch_size' |
Hi there, |
It is due to lightning defaulting to Distributed Data Parallel (DDP). You have to do some workarounds to make the custom Sampler work with DDP. Set in the YAML file |
Thanks! @popcornell |
Seems that Lightning does not split the |
Might be a bug of Lightning, I don't know how to fix that easily |
Lightning-AI/pytorch-lightning#1508 seems we have to rewrite the |
@mmuguang an easy fix is to use batch_size = 1 for validation. But then you would probably run evaluation only every X epochs |
It is not very easy to fix this with Lightning. I tried to use Speechbrain dataio.batch.PaddedBatch collate_fn but it did not work with DP and Lightning. |
@turpaultn opinions on this ? |
@popcornell I have given up using Lightning and rewrite the baseline with pytorch. it can split the filenames and work well with dp . The old version of lightning is so difficult to use. |
The lightning also has bugs during training with dp. When using 2 gpus, the loss on second gpu become Nan. Perhaps there are only unlabeled audio on it so the supervised loss is None. The final loss is also Nan and the model can't be trained normally. |
I think that is expected the batch is divided among GPUs. |
I think rather than ditching lightning altogether, a much simpler solution would be to index "all" audio files in the dataset and return the corresponding ID/index inall getitem() methods of datasets instead of the filepath strings. This way they would be treated as tensors and there wont be any problem with multigpu run. |
Thanks @Moadab-AI for the help. I ve thought about that and there could be a problem because we use ConcatDataset. Also IDK but it is maybe already very hacky the code IMO and not very readable by newcomers, especially ones who never had hands-on experience with lightning. What are your thoughts on this ? I would like to hear your feedback. |
My opinion on this:
If there is a need to run on multiple GPUs for whatever reason:
What do you think ? |
It worked well when training with single gpu.
But an error about batchsampler occurs when using multiple gpus
The text was updated successfully, but these errors were encountered: