| Sign up

PDF (1,013.4 KB)

Cite

EndNote(RIS) BibTeX

Collect

Collect

Submit Manuscript

Article | Open Access

Minimize Quantization Output Error with Bias Compensation

Cheng Gong^¹, Haoshuai Zheng^², Mengting Hu^¹, Zheng Lin^³, Deng-Ping Fan^{²^,⁴}, Yuzhi Zhang^{¹^,⁵}, Tao Li^{²^,⁵}()

1College of Software, Nankai University, Tianjin 300350, China

2College of Computer Science, Nankai University, Tianjin 300350, China

3Haihe Lab of ITAI, Tianjin 300450, China

4Nankai International Advanced Research Institute (Shenzhen Futian), Nankai University, Shenzhen 518045, China

5BNRist, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China

Show Author Information

Abstract

Quantization is a promising method that reduces memory usage and computational intensity of Deep Neural Networks (DNNs), but it often leads to significant output error that hinder model deployment. In this paper, we propose Bias Compensation (BC) to minimize the output error, thus realizing ultra-low-precision quantization without model fine-tuning. Instead of optimizing the non-convex quantization process as in most previous methods, the proposed BC bypasses the step to directly minimize the quantizing output error by identifying a bias vector for compensation. We have established that the minimization of output error through BC is a convex problem and provides an efficient strategy to procure optimal solutions associated with minimal output error, without the need for training or fine-tuning. We conduct extensive experiments on Vision Transformer Models (ViTs) and Large Language Models (LLMs), and the results show that our method notably reduces quantization output error, thereby permitting ultra-low-precision post-training quantization and enhancing the task performance of models. Especially, BC improves the accuracy of ViT-B* with 4-bit PTQ4ViT by 36.89% on the ImageNet-1K task, and decreases the perplexity of OPT-350M with 3-bit GPTQ by 5.97 on WikiText-2. Our codes are publicly available at https://github.com/GongCheng1919/bias-compensation.

Keywords

quantization minimizing output error bias compensation convex optimization large language model quantization

References

[1]

C. Gong, Y. Chen, Y. Lu, T. Li, C. Hao, and D. Chen, VecQ: Minimal loss DNN model compression with vectorized weight quantization, IEEE Trans. Comput., vol. 70, no. 5, pp. 696–710, 2021.

Crossref Google Scholar

[2]

C. Gong, Y. Lu, K. Xie, Z. Jin, T. Li, and Y. Wang, Elastic significant bit quantization and acceleration for deep neural networks, IEEE Trans. Parallel Distrib. Syst., vol. 33, no. 11, pp. 3178–3193, 2022.

Crossref Google Scholar

[3]

Z. Liu, B. Oguz, C. Zhao, E. Chang, P. Stock, Y. Mehdad, Y. Shi, R. Krishnamoorthi, and V. Chandra, LLM-QAT: Data-free quantization aware training for large language models, arXiv preprint arXiv: 2305.17888, 2023.

[4]

M. Nagel, R. A. Amjad, M. V. Baalen, C. Louizos, and T. Blankevoort, Up or down? Adaptive rounding for post-training quantization, in Proc. 37th Int. Conf. Machine Learning, virtual, 2020, pp. 7197–7206.

[5]

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, OPTQ: Accurate quantization for generative pre-trained transformers, in Proc. 11th Int. Conf. Learning Representations (ICLR), Kigali, Rwanda, 2023.

[6]

Z. Yuan, C. Xue, Y. Chen, Q. Wu, and G. Sun, PTQ4ViT: Post-training quantization for vision transformers with twin uniform quantization, in Proc. 17th European Conf. Computer Vision (ECCV), Tel Aviv, Israel, 2022, pp. 191–207.

[7]

R. Banner, Y. Nahshan, and D. Soudry, Post training 4-bit quantization of convolutional networks for rapid-deployment, in Proc. 33rd Conf. Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada, 2019, pp. 7950–7958.

[8]

R. Zhao, Y. Hu, J. Dotzel, C. De Sa, and Z. Zhang, Improving neural network quantization without retraining using outlier channel splitting, in Proc. 36th Int. Conf. Machine Learning, Long Beach, CA, USA, 2019, pp. 7543–7552.

[9]

Y. Choukroun, E. Kravchik, and P. Kisilev, Low-bit quantization of neural networks for efficient inference, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 2019, pp. 3009–3018.

[10]

Y. Li, R. Gong, X. Tan, Y. Yang, P. Hu, Q. Zhang, F. Yu, W. Wang, and S. Gu, BRECQ: Pushing the limit of post-training quantization by block reconstruction, in Proc. 8th Int. Conf. Learning Representations (ICLR), virtual, 2020.

[11]

P. Wang, Q. Chen, X. He, and J. Cheng, Towards accurate post-training network quantization via bit-split and stitching, in Proc. 37th Int. Conf. Machine Learning, virtual, 2020, pp. 9847–9856.

[12]

I. Hubara, Y. Nahshan, Y. Hanani, and R. Banner, D. Soudry, Accurate post training quantization with small calibration sets, in Proc. 38th Int. Conf. Machine Learning, virtual, 2021, pp. 4466–4475.

[13]

E. Frantar and D. Alistarh, Optimal brain compression: A framework for accurate post-training quantization and pruning, in Proc. 36th Conf. Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 2022, pp. 4475–4488.

[14]

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, QLoRA: Efficient finetuning of quantized LLMs, in Proc. 37th Conf. Neural Information Processing Systems (NeurIPS 2023), New Orleans, LA, USA, 2023, pp. 10088–10115.

[15]

E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, LoRA: Low-rank adaptation of large language models, in Proc. 10th Int. Conf. Learning Representations (ICLR), virtual, 2022.

[16]

T. Dettmers, M. Lewis, S. Shleifer, and L. Zettlemoyer, 8-bit optimizers via block-wise quantization, in Proc. 10th Int. Conf. Learning Representations (ICLR), virtual, 2022.

[17]

J. Kim, J. H. Lee, S. Kim, J. Park, K. M. Yoo, S. J. Kwon, and D. Lee, Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization, in Proc. 37th Conf. Neural Information Processing Systems (NeurIPS 2023), New Orleans, LA, USA, 2023, pp. 36187–36207.

[18]

T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, LLM. int8(): 8-bit matrix multiplication for transformers at scale, in Proc. 36th Conf. Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 2022, pp. 30318–30332.

[19]

G. Xiao, J. Lin, M. Seznec, J. Demouth, and S. Han, SmoothQuant: Accurate and efficient post-training quantization for large language models, in Proc. 40th Int. Conf. Machine Learning, Honolulu, HI, USA, 2023, pp. 38087–38099.

[20]

D. Wu, Q. Tang, Y. Zhao, M. Zhang, Y. Fu, and D. Zhang, EasyQuant: Post-training quantization via scale optimization, arXiv preprint arXiv: 2006.16669, 2020.

[21]

Y. Ding, H. Qin, Q. Yan, Z. Chai, J. Liu, X. Wei, and X. Liu, Towards accurate post-training quantization for vision transformer, in Proc. 30th ACM Int. Conf. Multimedia, Lisboa, Portugal, 2022, pp. 5380–5388.

[22]

Z. Li, J. Xiao, L. Yang, and Q. Gu, RepQ-ViT: Scale reparameterization for post-training quantization of vision transformers, in Proc. 2023 IEEE/CVF Int. Conf. Computer Vision (ICCV), Paris, France, 2023, pp. 17181–17190.

[23]

X. Wei, Y. Zhang, Y. Li, X. Zhang, R. Gong, J. Guo, and X. Liu, Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling, in Proc. 2023 Conf. Empirical Methods in Natural Language Processing, Singapore, 2023, pp. 1648–1665.

[24]

J. Lin, J. Tang, H. Tang, S. Yang, W. M. Chen, W. C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, AWQ: Activation-aware weight quantization for LLM compression and acceleration, in Proc. 7th Annu. Conf. Machine Learning and Systems (MLSys 2024), Santa Clara, CA, USA, pp. 87–100.

[25]

Z. Yao, R. Y. Aminabadi, M. Zhang, X. Wu, C. Li, and Y. He, ZeroQuant: Efficient and affordable post-training quantization for large-scale transformers, in Proc. 36th Conf. Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 2022, pp. 27168–27183.

[26]

J. H. Lee, J. Kim, S. Kwon, and D. Lee, FlexRound: Learnable rounding based on element-wise division for post-training quantization, in Proc. 40th Int. Conf. Machine Learning, Honolulu, HI, USA, 2023, pp. 18913–18939.

[27]

B. Hassibi, D. G. Stork, and G. J. Wolff, Optimal Brain Surgeon and general network pruning, in Proc. IEEE Int. Conf. Neural Networks, San Francisco, CA, USA, 1993, pp. 293–299.

[28]

E. Frantar, E. Kurtic, and D. Alistarh, M-FAC: Efficient matrix-free approximations of second-order information, in Proc. 35th Conf. Neural Information Processing Systems (NeurIPS 2021), virtual, 2021, pp. 14873–14886.

[29]

Z. Dong, Z. Yao, A. Gholami, M. Mahoney, and K. Keutzer, HAWQ: Hessian aware quantization of neural networks with mixed-precision, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision (ICCV), Seoul, Republic of Korea, 2019, pp. 293–302.

[30]

Y. Bhalgat, J. Lee, M. Nagel, T. Blankevoort, and N. Kwak, LSQ+: Improving low-bit quantization through learnable offsets and better initialization, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition Workshop (CVPRW), Seattle, WA, USA, 2020, pp. 2978–2985.

[31]

M. Nagel, M. Van Baalen, T. Blankevoort, and M. Welling, Data-free quantization through weight equalization and bias correction, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision (ICCV), Seoul, Republic of Korea, 2019, pp. 1325–1334.

[32]

Y. Liu, H. Yang, Z. Dong, K. Keutzer, L. Du, and S. Zhang, NoisyQuant: Noisy bias-enhanced post-training activation quantization for vision transformers, in Proc. 2023 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, 2023, pp. 20321–20330.

[33]

J. Shin, J. So, S. Park, S. Kang, S. Yoo, and E. Park, NIPQ: Noise proxy-based Integrated Pseudo-Quantization, in Proc. 2023 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, 2023, pp. 3852–3861.

[34]

P. H. P. Savarese, X. Yuan, Y. Li, and M. Maire, Not all bits have equal value: Heterogeneous precisions via trainable noise, in Proc. 36th Conf. Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 2022, pp. 35769–35782.

[35]

B. Dong, W. Wang, D. P. Fan, J. Li, H. Fu, and L. Shao, Polyp-PVT: Polyp segmentation with pyramid vision transformers, CAAI Artif. Intell. Res., vol. 2, p. 9150015, 2023.

Crossref Google Scholar

[36]

J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and F. F. Li, ImageNet: A large-scale hierarchical image database, in Proc. 2009 IEEE Conf. Computer Vision and Pattern Recognition, Miami, FL, USA, 2009, pp. 248–255.

[37]

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, in Proc. 9th Int. Conf. Learning Representations (ICLR), virtual, 2021.

[38]

H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, Training data-efficient image transformers & distillation through attention, in Proc. 38th Int. Conf. Machine Learning, virtual, 2021, pp. 10347–10357.

[39]

Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, in Proc. 2021 IEEE/CVF Int. Conf. Computer Vision (ICCV), Montreal, Canada, 2021, pp. 9992–10002.

[40]

R. Wightman, Pytorch image models, https://github.com/rwightman/pytorch-image-models, 2019.

[41]

W. Huang, Y. Liu, H. Qin, Y. Li, S. Zhang, X. Liu, M. Magno, and X. Qi, Billm: Pushing the limit of post-training quantization for LLMs, arXiv preprint arXiv: 2402.04291, 2024.

[42]

S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, et al., OPT: Open pre-trained transformer language models, arXiv preprint arXiv: 2205.01068, 2022.

[43]

BigScience Workshop, T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, et al., BLOOM: A 176B-Parameter open-access multilingual language model, arXiv preprint arXiv: 2211.05100, 2022.

[44]

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, J. Mach. Learn. Res., vol. 21, no. 1, pp. 5485–5551, 2020.

[45]

S. Merity, C. Xiong, J. Bradbury, and R. Socher, Pointer sentinel mixture models, in Proc. 5th Int. Conf. Learning Representations (ICLR), Toulon, France, 2017.

[46]

M. Marcus, G. Kim, M. A. Marcinkiewicz, R. MacIntyre, A. Bies, M. Ferguson, K. Katz, and B. Schasberger, The Penn Treebank: Annotating predicate argument structure, in Proc. 2nd APRA Human Language Technology Workshop, Plainsboro, NJ, USA, 1994, pp. 114–119.

[47]

T. Dettmers and L. Zettlemoyer, The case for 4-bit precision: k-bit inference scaling laws, in Proc. 40th Int. Conf. Machine Learning, Honolulu, HI, USA, 2023, pp. 7750–7774.

CAAI Artificial Intelligence Research

Article number: 9150036

DOI: 10.26599/AIR.2024.9150036

Cite this article:

Gong C, Zheng H, Hu M, et al. Minimize Quantization Output Error with Bias Compensation. CAAI Artificial Intelligence Research, 2025, 4: 9150036. https://doi.org/10.26599/AIR.2024.9150036

About Us

Learn about Open Access

Tsinghua University Press

Publish with Us

Peer Review Policy

Copyright and Licensing

Article Processing Charge

Contact Us

Journal Collaboration: Yao Meng (Ms.)✉️ +86-10-83470574

Technical Support: Kuo Zhao (Mr.)✉️ +86-10-83470507

Media Contact: Hao Jin (Mr.)✉️ +86-10-83470559

Address: Floor 6, Tower B, Xueyan Building, Shuangqing Road, Haidian District, Beijing 100084, China.

SciOpen——中国科技期刊卓越行动计划支持项目

Copyright © 2025 Tsinghua University Press Ltd.

京ICP备 10035462号-42 京公网安备11010802044758号