PDF (1.7 MB)

Cite

Collect

Open Access

Increasing Momentum-Like Factors: A Method for Reducing Training Errors on Multiple GPUs

Yu Tang, Zhigang Kan, Lujia Yin, Zhiquan Lai, Zhaoning Zhang, Linbo Qiao(), Dongsheng Li()

Science and Technology on Paralled and Distributed Processing Laboratory, and College of Computer Science and Technology, National University of Defense Technology, Changsha 473000, China

Show Author Information

Abstract

In distributed training, increasing batch size can improve parallelism, but it can also bring many difficulties to the training process and cause training errors. In this work, we investigate the occurrence of training errors in theory and train ResNet-50 on CIFAR-10 by using Stochastic Gradient Descent (SGD) and Adaptive moment estimation (Adam) while keeping the total batch size in the parameter server constant and lowering the batch size on each Graphics Processing Unit (GPU). A new method that considers momentum to eliminate training errors in distributed training is proposed. We define a Momentum-like Factor (MF) to represent the influence of former gradients on parameter updates in each iteration. Then, we modify the MF values and conduct experiments to explore how different MF values influence the training performance based on SGD, Adam, and Nesterov accelerated gradient. Experimental results reveal that increasing MFs is a reliable method for reducing training errors in distributed training. The analysis of convergent conditions in distributed training with consideration of a large batch size and multiple GPUs is presented in this paper.

Keywords

multiple Graphics Processing Units (GPUs)batch size training error distributed training momentum-like factors

References

[1]

Tang

, L. J.

Yin

, Z. N.

Zhang

, and D. S.

, Rise the momentum: A method for reducing the training error on multiple GPUs, in Algorithms and Architectures for Parallel Processing, S.

Wen

, A.

Zomaya

, and L. T.

Yang

, eds. Cham, Germany: Springer, 2020.

[2]

Chollet

, Xception: Deep learning with depthwise separable convolutions, in Proc. 2017 IEEE Conf. on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 1800-1807.

[3]

Huang

, Z.

Liu

, L.

Van Der Maaten

, and K. Q.

Weinberger

, Densely connected convolutional networks, in Proc. 2017 IEEE Conf. on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 2261-2269.

[4]

Liu

, D.

Anguelov

, D.

Erhan

, C.

Szegedy

, S.

Reed

, C. Y.

, and A. C.

Berg

, SSD: Single shot multibox detector, in Proc. 14th European Conf. on Computer Vision, Amsterdam, the Netherlands, 2016, pp. 21-37.

[5]

J. F.

Dai

, Y.

, K. M.

, and J.

Sun

, R-FCN: Object detection via region-based fully convolutional networks, in Proc. 30th Conf. on Neural Information Processing Systems, Barcelona, Spain, 2016, pp. 379-387.

[6]

Girshick

, J.

Donahue

, T.

Darrell

, and J.

Malik

, Rich feature hierarchies for accurate object detection and semantic segmentation. in Proc. 2014 IEEE Conf. on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp. 580-587.

[7]

Long

, E.

Shelhamer

, and T.

Darrell

, Fully convolutional networks for semantic segmentation, in Proc. 2015 IEEE Conf. on Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 3431-3440.

[8]

J. F.

Dai

, K. M.

, and J.

Sun

, Instance-aware semantic segmentation via multi-task network cascades, in Proc. 2016 IEEE Conf. on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 3150-3158.

[9]

Szegedy

, S.

Ioffe

, V.

Vanhoucke

, and A.

A Alemi

, Inception-v4, inception-resnet and the impact of residual connections on learning, in Proc. 31st AAAI Conf. on Artificial Intelligence, San Francisco, CA, USA, 2017.

[10]

Russakovsky

, J.

Deng

, H.

, J.

Krause

, S.

Satheesh

, S. A.

, Z. H.

Huang

, A.

Karpathy

, A.

Khosla

, M.

Bernstein

, et al., Imagenet large scale visual recognition challenge, Int. J. Comput. Vis., vol. 115, no, 3, pp. 211-252, 2015.

Google Scholar

[11]

Qin

, Z. N.

Zhang

, X. T.

Chen

, C. J.

Wang

, and Y. X.

Peng

, Fd-mobilenet: Improved mobilenet with a fast downsampling strategy, in Proc. 2018 25th IEEE Int. Conf. on Image Processing, Athens, Greece, 2018, pp. 1363-1367.

[12]

D. S.

, Z. Q.

Lai

, K. S.

, Y. M.

Zhang

, Z. N.

Zhang

, Q. L.

Wang

, and H. M.

Wang

, HPDL: Towards a general framework for high-performance distributed deep Learning, in Proc. 2019 IEEE 39th Int. Conf. on Distributed Computing Systems, Dallas, TX, USA, 2019.

[13]

Tong

and X. L.

Liu

, Samples selection for artificial neural network training in preliminary structural design, Tsinghua Science and Technology, vol. 10, no. 2, pp. 233-239, 2005.

Google Scholar

[14]

Z. Y.

, D. S.

, and D. K.

Guo

, Balance resource allocation for spark jobs based on prediction of the optimal resource, Tsinghua Science and Technology, vol. 25, no. 4, pp. 487-497, 2020.

Google Scholar

[15]

Guan

, T.

Sun

, L. B.

Qiao

, Z. H.

Yang

, D. S.

, K. S.

, and X. C.

, An efficient parallel and distributed solution to nonconvex penalized linear SVMs, Front. Inf. Technol. Electron. Eng., vol. 21, no. 4, pp. 587-603, 2020.

Google Scholar

[16]

K. S.

, H. Y.

, D. S.

, and X. C.

, Efficient parallel implementation of a density peaks clustering algorithm on graphics processing unit, Front. Inf. Technol Electron. Eng., vol. 18, no. 7, pp. 915-927, 2017.

Google Scholar

[17]

, D. G.

Andersen

, J.

Woo Park

, A. J.

Smola

, A.

Ahmed

, V.

Josifovski

, J.

Long

, E. J.

Shekita

, and B. Y.

, Scaling distributed machine learning with the parameter server, in Proc. 11th USENIX Symposium on Operating Systems Design and Implementation, Broomfield, CO, USA, 2014.

[18]

Goyal

, P.

Dollár

, R.

Girshick

, P.

Noordhuis

, L.

Wesolowski

, A.

Kyrola

, A.

Tulloch

, Y. Q.

Jia

, and K. M.

, Accurate, large minibatch SGD: Training imagenet in 1 hour, arXiv preprint arXiv: 1706.02677, 2017.

Google Scholar

[19]

Shen

, P.

Sun

, Y. T.

Wang

, W.

Liu

, and T.

Zhang

, An algorithmic framework of variable metric over-relaxed hybrid proximal extra-gradient method, in Proc. 35th Int. Conf. on Machine Learning, Stockholm, Sweden, 2018.

[20]

Shen

, W.

Liu

, G. Z.

Yuan

, and S. Q.

, GSOS: Gauss-Seidel operator splitting algorithm for multi-term nonsmooth convex composite optimization, in Proc. 34th Int. Conf. on Machine Learning, Sydney, Australia, 2017, pp. 3125-3134.

[21]

Wang

, S. Q.

, D.

Goldfarb

, and W.

Liu

, Stochastic quasi-Newton methods for nonconvex stochastic optimization, SIAM J. Optim., vol. 27, no. 2, pp. 927-956, 2017.

Google Scholar

[22]

You

, Z.

Zhang

, C. J.

Hsieh

, J.

Demmel

, and K.

Keutzer

, ImageNet training in minutes, in Proc. 47th Int. Conf. on Parallel Processing, Eugene, OR, USA, 2018, pp. 1-10.

[23]

You

, I.

Gitman

, and B.

Ginsburg

, Large batch training of convolutional networks, arXiv preprint arXiv: 1708.03888, 2017.

Google Scholar

[24]

Akiba

, S.

Suzuki

, and K.

Fukuda

, Extremely large minibatch SGD: Training ResNet-50 on ImageNet in 15 minutes, arXiv preprint arXiv: 1711.04325, 2017.

Google Scholar

[25]

Balles

, J.

Romero

, and P.

Hennig

, Coupling adaptive batch sizes with learning rates, arXiv preprint arXiv: 1612.05086, 2016.

Google Scholar

[26]

N. S.

Keskar

, D.

Mudigere

, J.

Nocedal

, M.

Smelyanskiy

, and P. T. P.

Tang

, On large-batch training for deep learning: generalization gap and sharp minima, in Proc. 5th Int. Conf. on Learning Representations, Toulon, France, 2017.

[27]

K. M.

, X. Y.

Zhang

, S. Q.

Ren

, and J.

Sun

, Deep residual learning for image recognition. in Proc. 2016 IEEE Conf. on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 770-778.

[28]

Krizhevsky

and G.

Hinton

, Learning Multiple Layers of Features from Tiny Images, Toronto, Canada: University of Toronto, 2009.

[29]

Ghadimi

, G.

Lan

, and H. C.

Zhang

, Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization, Math. Program., vol. 155, nos. 1&2, pp. 267-305, 2016.

Google Scholar

[30]

S. L.

Smith

and Q. V.

Le.

A Bayesian perspective on generalization and stochastic gradient descent. in Proc. 6th Int. Conf. on Learning Representations, Vancouver, Canada, 2017.

[31]

Masters

and C.

Luschi

, Revisiting small batch training for deep neural networks, arXiv preprint arXiv: 1804.07612, 2018.

Google Scholar

[32]

Krizhevsky

, One weird trick for parallelizing convolutional neural networks, arXiv preprint arXiv: 1404.5997, 2014.

Google Scholar

[33]

Chaudhari

, A.

Choromanska

, S.

Soatto

, Y.

LeCun

, C.

Baldassi

, C.

Borgs

, J.

Chayes

, L.

Sagun

, and R.

Zecchina

, Entropy-SGD: Biasing gradient descent into wide valleys, in Proc. 5th Int. Conf. on Learning Representations, Toulon, France, 2016.

[34]

Q. X.

, C.

Tai

, and W.

Stochastic modified equations and adaptive stochastic gradient algorithms, in Proc. 34th Int. Conf. on Machine Learning, Sydney, Australia, 2017, pp. 2101-2110.

[35]

F. Y.

Zou

, L.

Shen

, Z. Q.

Jie

, J.

Sun

, and W.

Liu

, Weighted adagrad with unified momentum, arXiv preprint arXiv: 1808.03408, 2018.

Google Scholar

[36]

Qian

, On the momentum term in gradient descent learning algorithms, Neural Netw., vol. 12, no, 1, pp. 145-151, 1999.

Google Scholar

[37]

D. P.

Kingma

and J.

, Adam: A method for stochastic optimization. in Proc. 3rd Int. Conf. on Learning Representations, San Diego, CA, USA, 2015.

[38]

Duchi

, E.

Hazan

, and Y.

Singer

, Adaptive subgradient methods for online learning and stochastic optimization, J. Mach. Learn. Res., vol. 12, pp. 2121-2159, 2011.

Google Scholar

[39]

Tieleman

and G.

Hinton

, Divide the gradient by a running average of its recent magnitude, COURSERA: Neural Netw. Mach. Learn., vol. 4, pp. 26-30, 2012.

Google Scholar

[40]

Nesterov

, A method for unconstrained convex minimization problem with the rate of convergence

O (1 / k^{2}

), Soviet Math. Dokl., vol. 27, no. 2, pp. 372-376, 1983.

Google Scholar

[41]

J. M.

Chen

, X. H.

Pan

, R.

Monga

, S.

Bengio

, and R.

Jozefowicz

, Revisiting distributed synchronous SGD, arXiv preprint arXiv: 1604.00981, 2016.

Google Scholar

[42]

Bottou

, F. E.

Curtis

, and J.

Nocedal

, Optimization methods for large-scale machine learning, https://doi.org/10.1137/16M1080173, 2018.

[43]

Jastrzebski

, Z.

Kenton

, D.

Arpit

, N.

Ballas

, A.

Fischer

, Y.

Bengio

, and A.

Storkey

, Three factors influencing minima in SGD, arXiv preprint arXiv: 1711.04623, 2017.

Google Scholar

[44]

Ioffe

and C.

Szegedy

, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in Proc. 32nd Int. Conf. on Machine Learning, Lille, France, 2015, pp. 448-456.

[45]

S. L.

Smith

, P. J.

Kindermans

, C.

Ying

, and Q. V.

, Don’t decay the learning rate, increase the batch size, in Proc. 6th Int. Conf. on Learning Representations, Vancouver, Canada, 2017.

[46]

T. Q.

Chen

, M.

, Y. T.

, M.

Lin

, N. Y.

Wang

, M. J.

Wang

, T. J.

Xiao

, B.

, C. Y.

Zhang

, and Z.

Zhang

, MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems, arXiv preprint arXiv:1512.01274, 2015.

Google Scholar

[47]

Lecun

, L.

Bottou

, Y.

Bengio

, and P.

Haffner

, Gradient-based learning applied to document recognition, Proc. IEEE, vol. 86, no, 11, pp. 2278-2324, 1998.

Google Scholar

[48]

M. P.

Marcus

, M. A.

Marcinkiewicz

, and B.

Santorini

, Building a large annotated corpus of English: The Penn Treebank, Comput. Linguist., vol. 19, no, 2, pp. 313-330, 1993.

Google Scholar

[49]

Sutskever

, J.

Martens

, G.

Dahl

and G.

Hinton

, On the importance of initialization and momentum in deep learning, in Proc. 30th Int. Conf. on Machine Learning, 2013, pp. 1139-1147.

[50]

Ghadimi

and G. H.

Lan

, Stochastic first- and zeroth-order methods for nonconvex stochastic programming, SIAM J. Optim., vol. 23, no. 4, pp. 2341-2368, 2013.

Google Scholar

[51]

F. Y.

Zou

, L.

Shen

, Z. Q.

Jie

, W. Z.

Zhang

, and W.

, A sufficient condition for convergences of Adam and RMSProp, in Proc. 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019.

[52]

S. J.

Reddi

, S.

Kale

, and S.

Kumar

, On the convergence of Adam and beyond, in Proc. 6th Int. Conf. on Learning Representations, Vancouver, Canada, 2018.

[53]

Hochreiter

and J.

Schmidhuber

, Long short-term memory, Neural Comput., vol. 9, no. 8, pp. 1735-1780, 1997.

Google Scholar

Tsinghua Science and Technology

Volume 27 Issue 1,
February 2022

Pages 114-126

DOI: 10.26599/TST.2020.9010023

Cite this article:

Tang Y, Kan Z, Yin L, et al. Increasing Momentum-Like Factors: A Method for Reducing Training Errors on Multiple GPUs. Tsinghua Science and Technology, 2022, 27(1): 114-126. https://doi.org/10.26599/TST.2020.9010023