Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it necessary to train from 5-bit? #1

Open
talenz opened this issue Mar 18, 2021 · 7 comments
Open

Is it necessary to train from 5-bit? #1

talenz opened this issue Mar 18, 2021 · 7 comments

Comments

@talenz
Copy link

talenz commented Mar 18, 2021

What's the accuracy drop if I'm training 4-bit mobilenet_v2 from full-precision when compared to initialized from 5bit model?

@ChaofanTao
Copy link
Owner

Hi, there is an ignorable accuracy drop between two kinds of initialization in our experiments, possibly due to sufficient training epochs. If training with fewer epochs, e.g. <40, initialized from 5-bit model should generate better results.

@talenz
Copy link
Author

talenz commented Mar 19, 2021

Hi, there is an ignorable accuracy drop between two kinds of initialization in our experiments, possibly due to sufficient training epochs. If training with fewer epochs, e.g. <40, initialized from 5-bit model should generate better results.

Thanks for the rapid reply. I'm training 4bit mobilenet_v2 now. The top1 score on validate set is dramatically fluctuating (about 2% up or down) even the learning rate is 1e-4, is that normal?

@ChaofanTao
Copy link
Owner

Does the fluctuation happen when the learning rate is delayed, e.g. 30 epochs, 60epochs in the default setting? I adopt SGD training with step-wise decay on the learning rate. A cosine scheduler can make the training process more smooth.

@talenz
Copy link
Author

talenz commented Apr 20, 2021

Does the fluctuation happen when the learning rate is delayed, e.g. 30 epochs, 60epochs in the default setting? I adopt SGD training with step-wise decay on the learning rate. A cosine scheduler can make the training process more smooth.

Thanks for the reply~ Is it possible (how?) to use per-channel weight quantization in your method to boost the performance?

@ChaofanTao
Copy link
Owner

Yes, you can.

  1. Write the channel-wise quantization strategy in def weight_quantization(b, grids, power=False): in file models/fat_quantization.py,
  2. And then set channel-wise alpha in class weight_quantize_fn(nn.Module):, e.g. self.register_parameter('wgt_alpha', Parameter(torch.Tensor(num_of_channels)))

@talenz
Copy link
Author

talenz commented Apr 20, 2021

Yes, you can.

  1. Write the channel-wise quantization strategy in def weight_quantization(b, grids, power=False): in file models/fat_quantization.py,
  2. And then set channel-wise alpha in class weight_quantize_fn(nn.Module):, e.g. self.register_parameter('wgt_alpha', Parameter(torch.Tensor(num_of_channels)))

Have you tried it? Did it improve the performance?

@ChaofanTao
Copy link
Owner

It boosts performance, at expense of reducing the compression ratio.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants