[3]
Z. Liu, B. Oguz, C. Zhao, E. Chang, P. Stock, Y. Mehdad, Y. Shi, R. Krishnamoorthi, and V. Chandra, LLM-QAT: Data-free quantization aware training for large language models, arXiv preprint arXiv: 2305.17888, 2023.
[4]
M. Nagel, R. A. Amjad, M. V. Baalen, C. Louizos, and T. Blankevoort, Up or down? Adaptive rounding for post-training quantization, in Proc. 37th Int. Conf. Machine Learning, virtual, 2020, pp. 7197–7206.
[5]
E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, OPTQ: Accurate quantization for generative pre-trained transformers, in Proc. 11th Int. Conf. Learning Representations (ICLR), Kigali, Rwanda, 2023.
[6]
Z. Yuan, C. Xue, Y. Chen, Q. Wu, and G. Sun, PTQ4ViT: Post-training quantization for vision transformers with twin uniform quantization, in Proc. 17th European Conf. Computer Vision (ECCV), Tel Aviv, Israel, 2022, pp. 191–207.
[7]
R. Banner, Y. Nahshan, and D. Soudry, Post training 4-bit quantization of convolutional networks for rapid-deployment, in Proc. 33rd Conf. Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada, 2019, pp. 7950–7958.
[8]
R. Zhao, Y. Hu, J. Dotzel, C. De Sa, and Z. Zhang, Improving neural network quantization without retraining using outlier channel splitting, in Proc. 36th Int. Conf. Machine Learning, Long Beach, CA, USA, 2019, pp. 7543–7552.
[9]
Y. Choukroun, E. Kravchik, and P. Kisilev, Low-bit quantization of neural networks for efficient inference, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 2019, pp. 3009–3018.
[10]
Y. Li, R. Gong, X. Tan, Y. Yang, P. Hu, Q. Zhang, F. Yu, W. Wang, and S. Gu, BRECQ: Pushing the limit of post-training quantization by block reconstruction, in Proc. 8th Int. Conf. Learning Representations (ICLR), virtual, 2020.
[11]
P. Wang, Q. Chen, X. He, and J. Cheng, Towards accurate post-training network quantization via bit-split and stitching, in Proc. 37th Int. Conf. Machine Learning, virtual, 2020, pp. 9847–9856.
[12]
I. Hubara, Y. Nahshan, Y. Hanani, and R. Banner, D. Soudry, Accurate post training quantization with small calibration sets, in Proc. 38th Int. Conf. Machine Learning, virtual, 2021, pp. 4466–4475.
[13]
E. Frantar and D. Alistarh, Optimal brain compression: A framework for accurate post-training quantization and pruning, in Proc. 36th Conf. Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 2022, pp. 4475–4488.
[14]
T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, QLoRA: Efficient finetuning of quantized LLMs, in Proc. 37th Conf. Neural Information Processing Systems (NeurIPS 2023), New Orleans, LA, USA, 2023, pp. 10088–10115.
[15]
E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, LoRA: Low-rank adaptation of large language models, in Proc. 10th Int. Conf. Learning Representations (ICLR), virtual, 2022.
[16]
T. Dettmers, M. Lewis, S. Shleifer, and L. Zettlemoyer, 8-bit optimizers via block-wise quantization, in Proc. 10th Int. Conf. Learning Representations (ICLR), virtual, 2022.
[17]
J. Kim, J. H. Lee, S. Kim, J. Park, K. M. Yoo, S. J. Kwon, and D. Lee, Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization, in Proc. 37th Conf. Neural Information Processing Systems (NeurIPS 2023), New Orleans, LA, USA, 2023, pp. 36187–36207.
[18]
T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, LLM. int8(): 8-bit matrix multiplication for transformers at scale, in Proc. 36th Conf. Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 2022, pp. 30318–30332.
[19]
G. Xiao, J. Lin, M. Seznec, J. Demouth, and S. Han, SmoothQuant: Accurate and efficient post-training quantization for large language models, in Proc. 40th Int. Conf. Machine Learning, Honolulu, HI, USA, 2023, pp. 38087–38099.
[20]
D. Wu, Q. Tang, Y. Zhao, M. Zhang, Y. Fu, and D. Zhang, EasyQuant: Post-training quantization via scale optimization, arXiv preprint arXiv: 2006.16669, 2020.
[21]
Y. Ding, H. Qin, Q. Yan, Z. Chai, J. Liu, X. Wei, and X. Liu, Towards accurate post-training quantization for vision transformer, in Proc. 30th ACM Int. Conf. Multimedia, Lisboa, Portugal, 2022, pp. 5380–5388.
[22]
Z. Li, J. Xiao, L. Yang, and Q. Gu, RepQ-ViT: Scale reparameterization for post-training quantization of vision transformers, in Proc. 2023 IEEE/CVF Int. Conf. Computer Vision (ICCV), Paris, France, 2023, pp. 17181–17190.
[23]
X. Wei, Y. Zhang, Y. Li, X. Zhang, R. Gong, J. Guo, and X. Liu, Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling, in Proc. 2023 Conf. Empirical Methods in Natural Language Processing, Singapore, 2023, pp. 1648–1665.
[24]
J. Lin, J. Tang, H. Tang, S. Yang, W. M. Chen, W. C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, AWQ: Activation-aware weight quantization for LLM compression and acceleration, in Proc. 7th Annu. Conf. Machine Learning and Systems (MLSys 2024), Santa Clara, CA, USA, pp. 87–100.
[25]
Z. Yao, R. Y. Aminabadi, M. Zhang, X. Wu, C. Li, and Y. He, ZeroQuant: Efficient and affordable post-training quantization for large-scale transformers, in Proc. 36th Conf. Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 2022, pp. 27168–27183.
[26]
J. H. Lee, J. Kim, S. Kwon, and D. Lee, FlexRound: Learnable rounding based on element-wise division for post-training quantization, in Proc. 40th Int. Conf. Machine Learning, Honolulu, HI, USA, 2023, pp. 18913–18939.
[27]
B. Hassibi, D. G. Stork, and G. J. Wolff, Optimal Brain Surgeon and general network pruning, in Proc. IEEE Int. Conf. Neural Networks, San Francisco, CA, USA, 1993, pp. 293–299.
[28]
E. Frantar, E. Kurtic, and D. Alistarh, M-FAC: Efficient matrix-free approximations of second-order information, in Proc. 35th Conf. Neural Information Processing Systems (NeurIPS 2021), virtual, 2021, pp. 14873–14886.
[29]
Z. Dong, Z. Yao, A. Gholami, M. Mahoney, and K. Keutzer, HAWQ: Hessian aware quantization of neural networks with mixed-precision, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision (ICCV), Seoul, Republic of Korea, 2019, pp. 293–302.
[30]
Y. Bhalgat, J. Lee, M. Nagel, T. Blankevoort, and N. Kwak, LSQ+: Improving low-bit quantization through learnable offsets and better initialization, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition Workshop (CVPRW), Seattle, WA, USA, 2020, pp. 2978–2985.
[31]
M. Nagel, M. Van Baalen, T. Blankevoort, and M. Welling, Data-free quantization through weight equalization and bias correction, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision (ICCV), Seoul, Republic of Korea, 2019, pp. 1325–1334.
[32]
Y. Liu, H. Yang, Z. Dong, K. Keutzer, L. Du, and S. Zhang, NoisyQuant: Noisy bias-enhanced post-training activation quantization for vision transformers, in Proc. 2023 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, 2023, pp. 20321–20330.
[33]
J. Shin, J. So, S. Park, S. Kang, S. Yoo, and E. Park, NIPQ: Noise proxy-based Integrated Pseudo-Quantization, in Proc. 2023 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, 2023, pp. 3852–3861.
[34]
P. H. P. Savarese, X. Yuan, Y. Li, and M. Maire, Not all bits have equal value: Heterogeneous precisions via trainable noise, in Proc. 36th Conf. Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 2022, pp. 35769–35782.
[36]
J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and F. F. Li, ImageNet: A large-scale hierarchical image database, in Proc. 2009 IEEE Conf. Computer Vision and Pattern Recognition, Miami, FL, USA, 2009, pp. 248–255.
[37]
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, in Proc. 9th Int. Conf. Learning Representations (ICLR), virtual, 2021.
[38]
H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, Training data-efficient image transformers & distillation through attention, in Proc. 38th Int. Conf. Machine Learning, virtual, 2021, pp. 10347–10357.
[39]
Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, in Proc. 2021 IEEE/CVF Int. Conf. Computer Vision (ICCV), Montreal, Canada, 2021, pp. 9992–10002.
[41]
W. Huang, Y. Liu, H. Qin, Y. Li, S. Zhang, X. Liu, M. Magno, and X. Qi, Billm: Pushing the limit of post-training quantization for LLMs, arXiv preprint arXiv: 2402.04291, 2024.
[42]
S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, et al., OPT: Open pre-trained transformer language models, arXiv preprint arXiv: 2205.01068, 2022.
[43]
BigScience Workshop, T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, et al., BLOOM: A 176B-Parameter open-access multilingual language model, arXiv preprint arXiv: 2211.05100, 2022.
[45]
S. Merity, C. Xiong, J. Bradbury, and R. Socher, Pointer sentinel mixture models, in Proc. 5th Int. Conf. Learning Representations (ICLR), Toulon, France, 2017.
[46]
M. Marcus, G. Kim, M. A. Marcinkiewicz, R. MacIntyre, A. Bies, M. Ferguson, K. Katz, and B. Schasberger, The Penn Treebank: Annotating predicate argument structure, in Proc. 2nd APRA Human Language Technology Workshop, Plainsboro, NJ, USA, 1994, pp. 114–119.
[47]
T. Dettmers and L. Zettlemoyer, The case for 4-bit precision: k-bit inference scaling laws, in Proc. 40th Int. Conf. Machine Learning, Honolulu, HI, USA, 2023, pp. 7750–7774.