AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (1.7 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access

Increasing Momentum-Like Factors: A Method for Reducing Training Errors on Multiple GPUs

Yu TangZhigang KanLujia YinZhiquan LaiZhaoning ZhangLinbo Qiao( )Dongsheng Li( )
Science and Technology on Paralled and Distributed Processing Laboratory, and College of Computer Science and Technology, National University of Defense Technology, Changsha 473000, China
Show Author Information

Abstract

In distributed training, increasing batch size can improve parallelism, but it can also bring many difficulties to the training process and cause training errors. In this work, we investigate the occurrence of training errors in theory and train ResNet-50 on CIFAR-10 by using Stochastic Gradient Descent (SGD) and Adaptive moment estimation (Adam) while keeping the total batch size in the parameter server constant and lowering the batch size on each Graphics Processing Unit (GPU). A new method that considers momentum to eliminate training errors in distributed training is proposed. We define a Momentum-like Factor (MF) to represent the influence of former gradients on parameter updates in each iteration. Then, we modify the MF values and conduct experiments to explore how different MF values influence the training performance based on SGD, Adam, and Nesterov accelerated gradient. Experimental results reveal that increasing MFs is a reliable method for reducing training errors in distributed training. The analysis of convergent conditions in distributed training with consideration of a large batch size and multiple GPUs is presented in this paper.

References

[1]
Y. Tang, L. J. Yin, Z. N. Zhang, and D. S. Li, Rise the momentum: A method for reducing the training error on multiple GPUs, in Algorithms and Architectures for Parallel Processing, S. Wen, A. Zomaya, and L. T. Yang, eds. Cham, Germany: Springer, 2020.
[2]
F. Chollet, Xception: Deep learning with depthwise separable convolutions, in Proc. 2017 IEEE Conf. on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 1800-1807.
[3]
G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, Densely connected convolutional networks, in Proc. 2017 IEEE Conf. on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 2261-2269.
[4]
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, and A. C. Berg, SSD: Single shot multibox detector, in Proc. 14th European Conf. on Computer Vision, Amsterdam, the Netherlands, 2016, pp. 21-37.
[5]
J. F. Dai, Y. Li, K. M. He, and J. Sun, R-FCN: Object detection via region-based fully convolutional networks, in Proc. 30th Conf. on Neural Information Processing Systems, Barcelona, Spain, 2016, pp. 379-387.
[6]
R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation. in Proc. 2014 IEEE Conf. on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp. 580-587.
[7]
J. Long, E. Shelhamer, and T. Darrell, Fully convolutional networks for semantic segmentation, in Proc. 2015 IEEE Conf. on Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 3431-3440.
[8]
J. F. Dai, K. M. He, and J. Sun, Instance-aware semantic segmentation via multi-task network cascades, in Proc. 2016 IEEE Conf. on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 3150-3158.
[9]
C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A Alemi, Inception-v4, inception-resnet and the impact of residual connections on learning, in Proc. 31st AAAI Conf. on Artificial Intelligence, San Francisco, CA, USA, 2017.
[10]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. A. Ma, Z. H. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., Imagenet large scale visual recognition challenge, Int. J. Comput. Vis., vol. 115, no, 3, pp. 211-252, 2015.
[11]
Z. Qin, Z. N. Zhang, X. T. Chen, C. J. Wang, and Y. X. Peng, Fd-mobilenet: Improved mobilenet with a fast downsampling strategy, in Proc. 2018 25th IEEE Int. Conf. on Image Processing, Athens, Greece, 2018, pp. 1363-1367.
[12]
D. S. Li, Z. Q. Lai, K. S. Ge, Y. M. Zhang, Z. N. Zhang, Q. L. Wang, and H. M. Wang, HPDL: Towards a general framework for high-performance distributed deep Learning, in Proc. 2019 IEEE 39th Int. Conf. on Distributed Computing Systems, Dallas, TX, USA, 2019.
[13]
F. Tong and X. L. Liu, Samples selection for artificial neural network training in preliminary structural design, Tsinghua Science and Technology, vol. 10, no. 2, pp. 233-239, 2005.
[14]
Z. Y. Hu, D. S. Li, and D. K. Guo, Balance resource allocation for spark jobs based on prediction of the optimal resource, Tsinghua Science and Technology, vol. 25, no. 4, pp. 487-497, 2020.
[15]
L. Guan, T. Sun, L. B. Qiao, Z. H. Yang, D. S. Li, K. S. Ge, and X. C. Lu, An efficient parallel and distributed solution to nonconvex penalized linear SVMs, Front. Inf. Technol. Electron. Eng., vol. 21, no. 4, pp. 587-603, 2020.
[16]
K. S. Ge, H. Y. Su, D. S. Li, and X. C. Lu, Efficient parallel implementation of a density peaks clustering algorithm on graphics processing unit, Front. Inf. Technol Electron. Eng., vol. 18, no. 7, pp. 915-927, 2017.
[17]
M. Li, D. G. Andersen, J. Woo Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B. Y. Su, Scaling distributed machine learning with the parameter server, in Proc. 11th USENIX Symposium on Operating Systems Design and Implementation, Broomfield, CO, USA, 2014.
[18]
P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Q. Jia, and K. M. He, Accurate, large minibatch SGD: Training imagenet in 1 hour, arXiv preprint arXiv: 1706.02677, 2017.
[19]
L. Shen, P. Sun, Y. T. Wang, W. Liu, and T. Zhang, An algorithmic framework of variable metric over-relaxed hybrid proximal extra-gradient method, in Proc. 35th Int. Conf. on Machine Learning, Stockholm, Sweden, 2018.
[20]
L. Shen, W. Liu, G. Z. Yuan, and S. Q. Ma, GSOS: Gauss-Seidel operator splitting algorithm for multi-term nonsmooth convex composite optimization, in Proc. 34th Int. Conf. on Machine Learning, Sydney, Australia, 2017, pp. 3125-3134.
[21]
X. Wang, S. Q. Ma, D. Goldfarb, and W. Liu, Stochastic quasi-Newton methods for nonconvex stochastic optimization, SIAM J. Optim., vol. 27, no. 2, pp. 927-956, 2017.
[22]
Y. You, Z. Zhang, C. J. Hsieh, J. Demmel, and K. Keutzer, ImageNet training in minutes, in Proc. 47th Int. Conf. on Parallel Processing, Eugene, OR, USA, 2018, pp. 1-10.
[23]
Y. You, I. Gitman, and B. Ginsburg, Large batch training of convolutional networks, arXiv preprint arXiv: 1708.03888, 2017.
[24]
T. Akiba, S. Suzuki, and K. Fukuda, Extremely large minibatch SGD: Training ResNet-50 on ImageNet in 15 minutes, arXiv preprint arXiv: 1711.04325, 2017.
[25]
L. Balles, J. Romero, and P. Hennig, Coupling adaptive batch sizes with learning rates, arXiv preprint arXiv: 1612.05086, 2016.
[26]
N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang, On large-batch training for deep learning: generalization gap and sharp minima, in Proc. 5th Int. Conf. on Learning Representations, Toulon, France, 2017.
[27]
K. M. He, X. Y. Zhang, S. Q. Ren, and J. Sun, Deep residual learning for image recognition. in Proc. 2016 IEEE Conf. on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 770-778.
[28]
A. Krizhevsky and G. Hinton, Learning Multiple Layers of Features from Tiny Images, Toronto, Canada: University of Toronto, 2009.
[29]
S. Ghadimi, G. Lan, and H. C. Zhang, Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization, Math. Program., vol. 155, nos. 1&2, pp. 267-305, 2016.
[30]
S. L. Smith and Q. V. Le. A Bayesian perspective on generalization and stochastic gradient descent. in Proc. 6th Int. Conf. on Learning Representations, Vancouver, Canada, 2017.
[31]
D. Masters and C. Luschi, Revisiting small batch training for deep neural networks, arXiv preprint arXiv: 1804.07612, 2018.
[32]
A. Krizhevsky, One weird trick for parallelizing convolutional neural networks, arXiv preprint arXiv: 1404.5997, 2014.
[33]
P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina, Entropy-SGD: Biasing gradient descent into wide valleys, in Proc. 5th Int. Conf. on Learning Representations, Toulon, France, 2016.
[34]
Q. X. Li, C. Tai, and W. E. Stochastic modified equations and adaptive stochastic gradient algorithms, in Proc. 34th Int. Conf. on Machine Learning, Sydney, Australia, 2017, pp. 2101-2110.
[35]
F. Y. Zou, L. Shen, Z. Q. Jie, J. Sun, and W. Liu, Weighted adagrad with unified momentum, arXiv preprint arXiv: 1808.03408, 2018.
[36]
N. Qian, On the momentum term in gradient descent learning algorithms, Neural Netw., vol. 12, no, 1, pp. 145-151, 1999.
[37]
D. P. Kingma and J. Ba, Adam: A method for stochastic optimization. in Proc. 3rd Int. Conf. on Learning Representations, San Diego, CA, USA, 2015.
[38]
J. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, J. Mach. Learn. Res., vol. 12, pp. 2121-2159, 2011.
[39]
T. Tieleman and G. Hinton, Divide the gradient by a running average of its recent magnitude, COURSERA: Neural Netw. Mach. Learn., vol. 4, pp. 26-30, 2012.
[40]
Y. Nesterov, A method for unconstrained convex minimization problem with the rate of convergence O(1/k2), Soviet Math. Dokl., vol. 27, no. 2, pp. 372-376, 1983.
[41]
J. M. Chen, X. H. Pan, R. Monga, S. Bengio, and R. Jozefowicz, Revisiting distributed synchronous SGD, arXiv preprint arXiv: 1604.00981, 2016.
[42]
L. Bottou, F. E. Curtis, and J. Nocedal, Optimization methods for large-scale machine learning, https://doi.org/10.1137/16M1080173, 2018.
[43]
S. Jastrzebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. Storkey, Three factors influencing minima in SGD, arXiv preprint arXiv: 1711.04623, 2017.
[44]
S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in Proc. 32nd Int. Conf. on Machine Learning, Lille, France, 2015, pp. 448-456.
[45]
S. L. Smith, P. J. Kindermans, C. Ying, and Q. V. Le, Don’t decay the learning rate, increase the batch size, in Proc. 6th Int. Conf. on Learning Representations, Vancouver, Canada, 2017.
[46]
T. Q. Chen, M. Li, Y. T. Li, M. Lin, N. Y. Wang, M. J. Wang, T. J. Xiao, B. Xu, C. Y. Zhang, and Z. Zhang, MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems, arXiv preprint arXiv:1512.01274, 2015.
[47]
Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE, vol. 86, no, 11, pp. 2278-2324, 1998.
[48]
M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini, Building a large annotated corpus of English: The Penn Treebank, Comput. Linguist., vol. 19, no, 2, pp. 313-330, 1993.
[49]
I. Sutskever, J. Martens, G. Dahl and G. Hinton, On the importance of initialization and momentum in deep learning, in Proc. 30th Int. Conf. on Machine Learning, 2013, pp. 1139-1147.
[50]
S. Ghadimi and G. H. Lan, Stochastic first- and zeroth-order methods for nonconvex stochastic programming, SIAM J. Optim., vol. 23, no. 4, pp. 2341-2368, 2013.
[51]
F. Y. Zou, L. Shen, Z. Q. Jie, W. Z. Zhang, and W. Li, A sufficient condition for convergences of Adam and RMSProp, in Proc. 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019.
[52]
S. J. Reddi, S. Kale, and S. Kumar, On the convergence of Adam and beyond, in Proc. 6th Int. Conf. on Learning Representations, Vancouver, Canada, 2018.
[53]
S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Comput., vol. 9, no. 8, pp. 1735-1780, 1997.
Tsinghua Science and Technology
Pages 114-126
Cite this article:
Tang Y, Kan Z, Yin L, et al. Increasing Momentum-Like Factors: A Method for Reducing Training Errors on Multiple GPUs. Tsinghua Science and Technology, 2022, 27(1): 114-126. https://doi.org/10.26599/TST.2020.9010023

705

Views

49

Downloads

2

Crossref

2

Web of Science

3

Scopus

0

CSCD

Altmetrics

Received: 19 June 2020
Revised: 10 July 2020
Accepted: 13 July 2020
Published: 17 August 2021
© The author(s) 2022

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return