You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been trying doing a DP based finetuning on a dataset using Pythia 1B model. I receive the following error at epoch 5 when I Increase the dataset size to around 1000.
TypeError: zeros() received an invalid combination of arguments - got (tuple, dtype=type), but expected one of:
(tuple of ints size, *, tuple of names names, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
This is arising from line 60-61 of opacus/data_loader.py which checks if len(batch) > 0 and tries to collate. Where am I going wrong or what can be the workaround to it?
Please help!
P.S. The configrations I use are number of epochs = 5, training set = 1000, batch size 8, and I am using BatchManager with max_physical_batch_size as 8.
The text was updated successfully, but these errors were encountered:
@SoumiDas Looks like a problem with batch size although it looks very strange. I was facing the exact same issue with batch size 8. Later, I changed the batch size to 12 and the problem was resolved.
@kanchanchy, thanks for sharing this! I tried to reproduce the issue with the setting mentioned in the post: number of epochs = 5, training set = 1000, batch size = max_physical_batch_size = 8, with BatchMemoryManager (with noise_multiplier = 0.1 and max_grad_norm =1.0), on a toy dataset and model, but I wasn't able to.
Its possible that I am missing something so it would be great if one of you (@kanchanchy or @SoumiDas) could reproduce this in the bug report Colab, thanks!
Hi,
I have been trying doing a DP based finetuning on a dataset using Pythia 1B model. I receive the following error at epoch 5 when I Increase the dataset size to around 1000.
This is arising from line 60-61 of
opacus/data_loader.py
which checksif len(batch) > 0
and tries to collate. Where am I going wrong or what can be the workaround to it?Please help!
P.S. The configrations I use are number of epochs = 5, training set = 1000, batch size 8, and I am using BatchManager with
max_physical_batch_size as 8
.The text was updated successfully, but these errors were encountered: