AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (1,013.4 KB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Article | Open Access | Online First

Minimize Quantization Output Error with Bias Compensation

Cheng Gong1Haoshuai Zheng2Mengting Hu1Zheng Lin3Deng-Ping Fan2,4Yuzhi Zhang1,5Tao Li2,5( )
College of Software, Nankai University, Tianjin 300350, China
College of Computer Science, Nankai University, Tianjin 300350, China
Haihe Lab of ITAI, Tianjin 300450, China
Nankai International Advanced Research Institute (Shenzhen Futian), Nankai University, Shenzhen 518045, China
BNRist, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
Show Author Information

Abstract

Quantization is a promising method that reduces memory usage and computational intensity of Deep Neural Networks (DNNs), but it often leads to significant output error that hinder model deployment. In this paper, we propose Bias Compensation (BC) to minimize the output error, thus realizing ultra-low-precision quantization without model fine-tuning. Instead of optimizing the non-convex quantization process as in most previous methods, the proposed BC bypasses the step to directly minimize the quantizing output error by identifying a bias vector for compensation. We have established that the minimization of output error through BC is a convex problem and provides an efficient strategy to procure optimal solutions associated with minimal output error, without the need for training or fine-tuning. We conduct extensive experiments on Vision Transformer Models (ViTs) and Large Language Models (LLMs), and the results show that our method notably reduces quantization output error, thereby permitting ultra-low-precision post-training quantization and enhancing the task performance of models. Especially, BC improves the accuracy of ViT-B* with 4-bit PTQ4ViT by 36.89% on the ImageNet-1K task, and decreases the perplexity of OPT-350M with 3-bit GPTQ by 5.97 on WikiText-2. Our codes are publicly available at https://github.com/GongCheng1919/bias-compensation.

References

[1]

C. Gong, Y. Chen, Y. Lu, T. Li, C. Hao, and D. Chen, VecQ: Minimal loss DNN model compression with vectorized weight quantization, IEEE Trans. Comput., vol. 70, no. 5, pp. 696–710, 2021.

[2]

C. Gong, Y. Lu, K. Xie, Z. Jin, T. Li, and Y. Wang, Elastic significant bit quantization and acceleration for deep neural networks, IEEE Trans. Parallel Distrib. Syst., vol. 33, no. 11, pp. 3178–3193, 2022.

[3]
Z. Liu, B. Oguz, C. Zhao, E. Chang, P. Stock, Y. Mehdad, Y. Shi, R. Krishnamoorthi, and V. Chandra, LLM-QAT: Data-free quantization aware training for large language models, arXiv preprint arXiv: 2305.17888, 2023.
[4]
M. Nagel, R. A. Amjad, M. V. Baalen, C. Louizos, and T. Blankevoort, Up or down? Adaptive rounding for post-training quantization, in Proc. 37th Int. Conf. Machine Learning, virtual, 2020, pp. 7197–7206.
[5]
E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, OPTQ: Accurate quantization for generative pre-trained transformers, in Proc. 11th Int. Conf. Learning Representations (ICLR), Kigali, Rwanda, 2023.
[6]
Z. Yuan, C. Xue, Y. Chen, Q. Wu, and G. Sun, PTQ4ViT: Post-training quantization for vision transformers with twin uniform quantization, in Proc. 17th European Conf. Computer Vision (ECCV), Tel Aviv, Israel, 2022, pp. 191–207.
[7]
R. Banner, Y. Nahshan, and D. Soudry, Post training 4-bit quantization of convolutional networks for rapid-deployment, in Proc. 33rd Conf. Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada, 2019, pp. 7950–7958.
[8]
R. Zhao, Y. Hu, J. Dotzel, C. De Sa, and Z. Zhang, Improving neural network quantization without retraining using outlier channel splitting, in Proc. 36th Int. Conf. Machine Learning, Long Beach, CA, USA, 2019, pp. 7543–7552.
[9]
Y. Choukroun, E. Kravchik, and P. Kisilev, Low-bit quantization of neural networks for efficient inference, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 2019, pp. 3009–3018.
[10]
Y. Li, R. Gong, X. Tan, Y. Yang, P. Hu, Q. Zhang, F. Yu, W. Wang, and S. Gu, BRECQ: Pushing the limit of post-training quantization by block reconstruction, in Proc. 8th Int. Conf. Learning Representations (ICLR), virtual, 2020.
[11]
P. Wang, Q. Chen, X. He, and J. Cheng, Towards accurate post-training network quantization via bit-split and stitching, in Proc. 37th Int. Conf. Machine Learning, virtual, 2020, pp. 9847–9856.
[12]
I. Hubara, Y. Nahshan, Y. Hanani, and R. Banner, D. Soudry, Accurate post training quantization with small calibration sets, in Proc. 38th Int. Conf. Machine Learning, virtual, 2021, pp. 4466–4475.
[13]
E. Frantar and D. Alistarh, Optimal brain compression: A framework for accurate post-training quantization and pruning, in Proc. 36th Conf. Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 2022, pp. 4475–4488.
[14]
T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, QLoRA: Efficient finetuning of quantized LLMs, in Proc. 37th Conf. Neural Information Processing Systems (NeurIPS 2023), New Orleans, LA, USA, 2023, pp. 10088–10115.
[15]
E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, LoRA: Low-rank adaptation of large language models, in Proc. 10th Int. Conf. Learning Representations (ICLR), virtual, 2022.
[16]
T. Dettmers, M. Lewis, S. Shleifer, and L. Zettlemoyer, 8-bit optimizers via block-wise quantization, in Proc. 10th Int. Conf. Learning Representations (ICLR), virtual, 2022.
[17]
J. Kim, J. H. Lee, S. Kim, J. Park, K. M. Yoo, S. J. Kwon, and D. Lee, Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization, in Proc. 37th Conf. Neural Information Processing Systems (NeurIPS 2023), New Orleans, LA, USA, 2023, pp. 36187–36207.
[18]
T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, LLM. int8(): 8-bit matrix multiplication for transformers at scale, in Proc. 36th Conf. Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 2022, pp. 30318–30332.
[19]
G. Xiao, J. Lin, M. Seznec, J. Demouth, and S. Han, SmoothQuant: Accurate and efficient post-training quantization for large language models, in Proc. 40th Int. Conf. Machine Learning, Honolulu, HI, USA, 2023, pp. 38087–38099.
[20]
D. Wu, Q. Tang, Y. Zhao, M. Zhang, Y. Fu, and D. Zhang, EasyQuant: Post-training quantization via scale optimization, arXiv preprint arXiv: 2006.16669, 2020.
[21]
Y. Ding, H. Qin, Q. Yan, Z. Chai, J. Liu, X. Wei, and X. Liu, Towards accurate post-training quantization for vision transformer, in Proc. 30th ACM Int. Conf. Multimedia, Lisboa, Portugal, 2022, pp. 5380–5388.
[22]
Z. Li, J. Xiao, L. Yang, and Q. Gu, RepQ-ViT: Scale reparameterization for post-training quantization of vision transformers, in Proc. 2023 IEEE/CVF Int. Conf. Computer Vision (ICCV), Paris, France, 2023, pp. 17181–17190.
[23]
X. Wei, Y. Zhang, Y. Li, X. Zhang, R. Gong, J. Guo, and X. Liu, Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling, in Proc. 2023 Conf. Empirical Methods in Natural Language Processing, Singapore, 2023, pp. 1648–1665.
[24]
J. Lin, J. Tang, H. Tang, S. Yang, W. M. Chen, W. C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, AWQ: Activation-aware weight quantization for LLM compression and acceleration, in Proc. 7th Annu. Conf. Machine Learning and Systems (MLSys 2024), Santa Clara, CA, USA, pp. 87–100.
[25]
Z. Yao, R. Y. Aminabadi, M. Zhang, X. Wu, C. Li, and Y. He, ZeroQuant: Efficient and affordable post-training quantization for large-scale transformers, in Proc. 36th Conf. Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 2022, pp. 27168–27183.
[26]
J. H. Lee, J. Kim, S. Kwon, and D. Lee, FlexRound: Learnable rounding based on element-wise division for post-training quantization, in Proc. 40th Int. Conf. Machine Learning, Honolulu, HI, USA, 2023, pp. 18913–18939.
[27]
B. Hassibi, D. G. Stork, and G. J. Wolff, Optimal Brain Surgeon and general network pruning, in Proc. IEEE Int. Conf. Neural Networks, San Francisco, CA, USA, 1993, pp. 293–299.
[28]
E. Frantar, E. Kurtic, and D. Alistarh, M-FAC: Efficient matrix-free approximations of second-order information, in Proc. 35th Conf. Neural Information Processing Systems (NeurIPS 2021), virtual, 2021, pp. 14873–14886.
[29]
Z. Dong, Z. Yao, A. Gholami, M. Mahoney, and K. Keutzer, HAWQ: Hessian aware quantization of neural networks with mixed-precision, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision (ICCV), Seoul, Republic of Korea, 2019, pp. 293–302.
[30]
Y. Bhalgat, J. Lee, M. Nagel, T. Blankevoort, and N. Kwak, LSQ+: Improving low-bit quantization through learnable offsets and better initialization, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition Workshop (CVPRW), Seattle, WA, USA, 2020, pp. 2978–2985.
[31]
M. Nagel, M. Van Baalen, T. Blankevoort, and M. Welling, Data-free quantization through weight equalization and bias correction, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision (ICCV), Seoul, Republic of Korea, 2019, pp. 1325–1334.
[32]
Y. Liu, H. Yang, Z. Dong, K. Keutzer, L. Du, and S. Zhang, NoisyQuant: Noisy bias-enhanced post-training activation quantization for vision transformers, in Proc. 2023 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, 2023, pp. 20321–20330.
[33]
J. Shin, J. So, S. Park, S. Kang, S. Yoo, and E. Park, NIPQ: Noise proxy-based Integrated Pseudo-Quantization, in Proc. 2023 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, 2023, pp. 3852–3861.
[34]
P. H. P. Savarese, X. Yuan, Y. Li, and M. Maire, Not all bits have equal value: Heterogeneous precisions via trainable noise, in Proc. 36th Conf. Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 2022, pp. 35769–35782.
[35]

B. Dong, W. Wang, D. P. Fan, J. Li, H. Fu, and L. Shao, Polyp-PVT: Polyp segmentation with pyramid vision transformers, CAAI Artif. Intell. Res., vol. 2, p. 9150015, 2023.

[36]
J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and F. F. Li, ImageNet: A large-scale hierarchical image database, in Proc. 2009 IEEE Conf. Computer Vision and Pattern Recognition, Miami, FL, USA, 2009, pp. 248–255.
[37]
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, in Proc. 9th Int. Conf. Learning Representations (ICLR), virtual, 2021.
[38]
H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, Training data-efficient image transformers & distillation through attention, in Proc. 38th Int. Conf. Machine Learning, virtual, 2021, pp. 10347–10357.
[39]
Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, in Proc. 2021 IEEE/CVF Int. Conf. Computer Vision (ICCV), Montreal, Canada, 2021, pp. 9992–10002.
[40]
R. Wightman, Pytorch image models, https://github.com/rwightman/pytorch-image-models, 2019.
[41]
W. Huang, Y. Liu, H. Qin, Y. Li, S. Zhang, X. Liu, M. Magno, and X. Qi, Billm: Pushing the limit of post-training quantization for LLMs, arXiv preprint arXiv: 2402.04291, 2024.
[42]
S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, et al., OPT: Open pre-trained transformer language models, arXiv preprint arXiv: 2205.01068, 2022.
[43]
BigScience Workshop, T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, et al., BLOOM: A 176B-Parameter open-access multilingual language model, arXiv preprint arXiv: 2211.05100, 2022.
[44]

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, J. Mach. Learn. Res., vol. 21, no. 1, pp. 5485–5551, 2020.

[45]
S. Merity, C. Xiong, J. Bradbury, and R. Socher, Pointer sentinel mixture models, in Proc. 5th Int. Conf. Learning Representations (ICLR), Toulon, France, 2017.
[46]
M. Marcus, G. Kim, M. A. Marcinkiewicz, R. MacIntyre, A. Bies, M. Ferguson, K. Katz, and B. Schasberger, The Penn Treebank: Annotating predicate argument structure, in Proc. 2nd APRA Human Language Technology Workshop, Plainsboro, NJ, USA, 1994, pp. 114–119.
[47]
T. Dettmers and L. Zettlemoyer, The case for 4-bit precision: k-bit inference scaling laws, in Proc. 40th Int. Conf. Machine Learning, Honolulu, HI, USA, 2023, pp. 7750–7774.
CAAI Artificial Intelligence Research
Cite this article:
Gong C, Zheng H, Hu M, et al. Minimize Quantization Output Error with Bias Compensation. CAAI Artificial Intelligence Research, 2024, https://doi.org/10.26599/AIR.2024.9150036

377

Views

90

Downloads

0

Crossref

Altmetrics

Received: 12 May 2024
Revised: 17 June 2024
Accepted: 21 June 2024
Published: 11 September 2024
© The author(s) 2025.

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return