why the model size with 32-bit is equal to model with 4-bit #2

xuzhiyuan1022 · 2021-03-30T07:54:22Z

No description provided.

ChaofanTao · 2021-03-30T08:06:05Z

Hi, during training, the parameters in both the trained full-precision model and quantized model are stored in float32 format, and there are limited possible choices of data in the quantized model.

The model size compresses when the quantized model converts the data format to INT during deployment. Also, the parameters of the proposed transform can be removed during deployment.

xuzhiyuan1022 · 2021-03-30T09:17:19Z

thank you for your reply.

grad_alpha = (grad_output * (sign * i + (input_q - input) * (1 - i))).sum()

when update alpha,why use (input_q-input)

ChaofanTao · 2021-03-30T09:57:16Z

The (input_q-input) here considers the difference between input_q and input when updating alpha. That is an optional setting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why the model size with 32-bit is equal to model with 4-bit #2

why the model size with 32-bit is equal to model with 4-bit #2

xuzhiyuan1022 commented Mar 30, 2021

ChaofanTao commented Mar 30, 2021

xuzhiyuan1022 commented Mar 30, 2021

ChaofanTao commented Mar 30, 2021

why the model size with 32-bit is equal to model with 4-bit #2

why the model size with 32-bit is equal to model with 4-bit #2

Comments

xuzhiyuan1022 commented Mar 30, 2021

ChaofanTao commented Mar 30, 2021

xuzhiyuan1022 commented Mar 30, 2021

ChaofanTao commented Mar 30, 2021