Science and Technology on Paralled and Distributed Processing Laboratory, and College of Computer Science and Technology, National University of Defense Technology, Changsha473000, China
Show Author Information
Hide Author Information
Abstract
In distributed training, increasing batch size can improve parallelism, but it can also bring many difficulties to the training process and cause training errors. In this work, we investigate the occurrence of training errors in theory and train ResNet-50 on CIFAR-10 by using Stochastic Gradient Descent (SGD) and Adaptive moment estimation (Adam) while keeping the total batch size in the parameter server constant and lowering the batch size on each Graphics Processing Unit (GPU). A new method that considers momentum to eliminate training errors in distributed training is proposed. We define a Momentum-like Factor (MF) to represent the influence of former gradients on parameter updates in each iteration. Then, we modify the MF values and conduct experiments to explore how different MF values influence the training performance based on SGD, Adam, and Nesterov accelerated gradient. Experimental results reveal that increasing MFs is a reliable method for reducing training errors in distributed training. The analysis of convergent conditions in distributed training with consideration of a large batch size and multiple GPUs is presented in this paper.
No abstract is available for this article. Click the button above to view the PDF directly.
References
[1]
Y.Tang, L. J.Yin, Z. N.Zhang, and D. S.Li, Rise the momentum: A method for reducing the training error on multiple GPUs, in Algorithms and Architectures for Parallel Processing, S.Wen, A.Zomaya, and L. T.Yang, eds. Cham, Germany: Springer, 2020.
[2]
F.Chollet, Xception: Deep learning with depthwise separable convolutions, in Proc. 2017 IEEE Conf. on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 1800-1807.
[3]
G.Huang, Z.Liu, L.Van Der Maaten, and K. Q.Weinberger, Densely connected convolutional networks, in Proc. 2017 IEEE Conf. on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 2261-2269.
[4]
W.Liu, D.Anguelov, D.Erhan, C.Szegedy, S.Reed, C. Y.Fu, and A. C.Berg, SSD: Single shot multibox detector, in Proc. 14th European Conf. on Computer Vision, Amsterdam, the Netherlands, 2016, pp. 21-37.
[5]
J. F.Dai, Y.Li, K. M.He, and J.Sun, R-FCN: Object detection via region-based fully convolutional networks, in Proc. 30th Conf. on Neural Information Processing Systems, Barcelona, Spain, 2016, pp. 379-387.
[6]
R.Girshick, J.Donahue, T.Darrell, and J.Malik, Rich feature hierarchies for accurate object detection and semantic segmentation. in Proc. 2014 IEEE Conf. on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp. 580-587.
[7]
J.Long, E.Shelhamer, and T.Darrell, Fully convolutional networks for semantic segmentation, in Proc. 2015 IEEE Conf. on Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 3431-3440.
[8]
J. F.Dai, K. M.He, and J.Sun, Instance-aware semantic segmentation via multi-task network cascades, in Proc. 2016 IEEE Conf. on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 3150-3158.
[9]
C.Szegedy, S.Ioffe, V.Vanhoucke, and A.A Alemi, Inception-v4, inception-resnet and the impact of residual connections on learning, in Proc. 31st AAAI Conf. on Artificial Intelligence, San Francisco, CA, USA, 2017.
[10]
O.Russakovsky, J.Deng, H.Su, J.Krause, S.Satheesh, S. A.Ma, Z. H.Huang, A.Karpathy, A.Khosla, M.Bernstein, et al., Imagenet large scale visual recognition challenge, Int. J. Comput. Vis., vol. 115, no, 3, pp. 211-252, 2015.
Z.Qin, Z. N.Zhang, X. T.Chen, C. J.Wang, and Y. X.Peng, Fd-mobilenet: Improved mobilenet with a fast downsampling strategy, in Proc. 2018 25th IEEE Int. Conf. on Image Processing, Athens, Greece, 2018, pp. 1363-1367.
[12]
D. S.Li, Z. Q.Lai, K. S.Ge, Y. M.Zhang, Z. N.Zhang, Q. L.Wang, and H. M.Wang, HPDL: Towards a general framework for high-performance distributed deep Learning, in Proc. 2019 IEEE 39th Int. Conf. on Distributed Computing Systems, Dallas, TX, USA, 2019.
[13]
F.Tong and X. L.Liu, Samples selection for artificial neural network training in preliminary structural design, Tsinghua Science and Technology, vol. 10, no. 2, pp. 233-239, 2005.
Z. Y.Hu, D. S.Li, and D. K.Guo, Balance resource allocation for spark jobs based on prediction of the optimal resource, Tsinghua Science and Technology, vol. 25, no. 4, pp. 487-497, 2020.
L.Guan, T.Sun, L. B.Qiao, Z. H.Yang, D. S.Li, K. S.Ge, and X. C.Lu, An efficient parallel and distributed solution to nonconvex penalized linear SVMs, Front. Inf. Technol. Electron. Eng., vol. 21, no. 4, pp. 587-603, 2020.
K. S.Ge, H. Y.Su, D. S.Li, and X. C.Lu, Efficient parallel implementation of a density peaks clustering algorithm on graphics processing unit, Front. Inf. Technol Electron. Eng., vol. 18, no. 7, pp. 915-927, 2017.
M.Li, D. G.Andersen, J.Woo Park, A. J.Smola, A.Ahmed, V.Josifovski, J.Long, E. J.Shekita, and B. Y.Su, Scaling distributed machine learning with the parameter server, in Proc. 11th USENIX Symposium on Operating Systems Design and Implementation, Broomfield, CO, USA, 2014.
[18]
P.Goyal, P.Dollár, R.Girshick, P.Noordhuis, L.Wesolowski, A.Kyrola, A.Tulloch, Y. Q.Jia, and K. M.He, Accurate, large minibatch SGD: Training imagenet in 1 hour, arXiv preprint arXiv: 1706.02677, 2017.
L.Shen, P.Sun, Y. T.Wang, W.Liu, and T.Zhang, An algorithmic framework of variable metric over-relaxed hybrid proximal extra-gradient method, in Proc. 35th Int. Conf. on Machine Learning, Stockholm, Sweden, 2018.
[20]
L.Shen, W.Liu, G. Z.Yuan, and S. Q.Ma, GSOS: Gauss-Seidel operator splitting algorithm for multi-term nonsmooth convex composite optimization, in Proc. 34th Int. Conf. on Machine Learning, Sydney, Australia, 2017, pp. 3125-3134.
[21]
X.Wang, S. Q.Ma, D.Goldfarb, and W.Liu, Stochastic quasi-Newton methods for nonconvex stochastic optimization, SIAM J. Optim., vol. 27, no. 2, pp. 927-956, 2017.
Y.You, Z.Zhang, C. J.Hsieh, J.Demmel, and K.Keutzer, ImageNet training in minutes, in Proc. 47th Int. Conf. on Parallel Processing, Eugene, OR, USA, 2018, pp. 1-10.
[23]
Y.You, I.Gitman, and B.Ginsburg, Large batch training of convolutional networks, arXiv preprint arXiv: 1708.03888, 2017.
N. S.Keskar, D.Mudigere, J.Nocedal, M.Smelyanskiy, and P. T. P.Tang, On large-batch training for deep learning: generalization gap and sharp minima, in Proc. 5th Int. Conf. on Learning Representations, Toulon, France, 2017.
[27]
K. M.He, X. Y.Zhang, S. Q.Ren, and J.Sun, Deep residual learning for image recognition. in Proc. 2016 IEEE Conf. on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 770-778.
[28]
A.Krizhevsky and G.Hinton, Learning Multiple Layers of Features from Tiny Images, Toronto, Canada: University of Toronto, 2009.
[29]
S.Ghadimi, G.Lan, and H. C.Zhang, Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization, Math. Program., vol. 155, nos. 1&2, pp. 267-305, 2016.
S. L.Smith and Q. V.Le.A Bayesian perspective on generalization and stochastic gradient descent. in Proc. 6th Int. Conf. on Learning Representations, Vancouver, Canada, 2017.
[31]
D.Masters and C.Luschi, Revisiting small batch training for deep neural networks, arXiv preprint arXiv: 1804.07612, 2018.
P.Chaudhari, A.Choromanska, S.Soatto, Y.LeCun, C.Baldassi, C.Borgs, J.Chayes, L.Sagun, and R.Zecchina, Entropy-SGD: Biasing gradient descent into wide valleys, in Proc. 5th Int. Conf. on Learning Representations, Toulon, France, 2016.
[34]
Q. X.Li, C.Tai, and W.E.Stochastic modified equations and adaptive stochastic gradient algorithms, in Proc. 34th Int. Conf. on Machine Learning, Sydney, Australia, 2017, pp. 2101-2110.
[35]
F. Y.Zou, L.Shen, Z. Q.Jie, J.Sun, and W.Liu, Weighted adagrad with unified momentum, arXiv preprint arXiv: 1808.03408, 2018.
D. P.Kingma and J.Ba, Adam: A method for stochastic optimization. in Proc. 3rd Int. Conf. on Learning Representations, San Diego, CA, USA, 2015.
[38]
J.Duchi, E.Hazan, and Y.Singer, Adaptive subgradient methods for online learning and stochastic optimization, J. Mach. Learn. Res., vol. 12, pp. 2121-2159, 2011.
T.Tieleman and G.Hinton, Divide the gradient by a running average of its recent magnitude, COURSERA: Neural Netw. Mach. Learn., vol. 4, pp. 26-30, 2012.
Y.Nesterov, A method for unconstrained convex minimization problem with the rate of convergence ), Soviet Math. Dokl., vol. 27, no. 2, pp. 372-376, 1983.
S.Jastrzebski, Z.Kenton, D.Arpit, N.Ballas, A.Fischer, Y.Bengio, and A.Storkey, Three factors influencing minima in SGD, arXiv preprint arXiv: 1711.04623, 2017.
S.Ioffe and C.Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in Proc. 32nd Int. Conf. on Machine Learning, Lille, France, 2015, pp. 448-456.
[45]
S. L.Smith, P. J.Kindermans, C.Ying, and Q. V.Le, Don’t decay the learning rate, increase the batch size, in Proc. 6th Int. Conf. on Learning Representations, Vancouver, Canada, 2017.
[46]
T. Q.Chen, M.Li, Y. T.Li, M.Lin, N. Y.Wang, M. J.Wang, T. J.Xiao, B.Xu, C. Y.Zhang, and Z.Zhang, MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems, arXiv preprint arXiv:1512.01274, 2015.
M. P.Marcus, M. A.Marcinkiewicz, and B.Santorini, Building a large annotated corpus of English: The Penn Treebank, Comput. Linguist., vol. 19, no, 2, pp. 313-330, 1993.
I.Sutskever, J.Martens, G.Dahl and G.Hinton, On the importance of initialization and momentum in deep learning, in Proc. 30th Int. Conf. on Machine Learning, 2013, pp. 1139-1147.
[50]
S.Ghadimi and G. H.Lan, Stochastic first- and zeroth-order methods for nonconvex stochastic programming, SIAM J. Optim., vol. 23, no. 4, pp. 2341-2368, 2013.
F. Y.Zou, L.Shen, Z. Q.Jie, W. Z.Zhang, and W.Li, A sufficient condition for convergences of Adam and RMSProp, in Proc. 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019.
[52]
S. J.Reddi, S.Kale, and S.Kumar, On the convergence of Adam and beyond, in Proc. 6th Int. Conf. on Learning Representations, Vancouver, Canada, 2018.
[53]
S.Hochreiter and J.Schmidhuber, Long short-term memory, Neural Comput., vol. 9, no. 8, pp. 1735-1780, 1997.
Tang Y, Kan Z, Yin L, et al. Increasing Momentum-Like Factors: A Method for Reducing Training Errors on Multiple GPUs. Tsinghua Science and Technology, 2022, 27(1): 114-126. https://doi.org/10.26599/TST.2020.9010023
The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).
2.1 Stochastic gradient descent and its variants
SGD[
18
] is one of the simplest first-order algorithms, but it introduces noise into the gradient and blocks optimization in the training progress[
29
]. SGD randomly selects one sample or a random sample set at one time for parameter updates so that updating the parameters does not create redundancy. When the data size is large, SGD can effectively accelerate the training process.
Smith and Le[
30
] stated that SGD should be interpreted as integrating stochastic equations. They also presented the scale of random fluctuations in the SGD dynamics as
where is the learning rate, is the size of the training set, and is the batch size. If we reduce the learning rate , the noise scale drops, thus leading to an improved training performance. If we keep the learning rate constant, then we could also increase batch size to reduce the negative impact of noise scale. By contrast, a small batch size increases the noise scale and adversely affects the training performance. According to Ref. [
31
], calculating the mean and variance values over a batch makes the loss calculated for a particular example dependent on other examples in the same batch. Therefore, if the batch size is large, a high dependency between samples in the batch limits the training performance. This fluctuation scale can also be considered as a noise factor. When , applying a linear scaling rule[
32
] maintains the mean SGD weight update constant per training sample. A specific description about the linear scaling rule[
32
] is shown in Section 2.2.
SGD has several variants[
33
,
34
,
35
]. A common one is SGD with momentum[
36
]; specifically, Smith and Le[
30
] extended the traditional SGD to include momentum, and found that the noise factor changes into
where is the momentum. Equation (
2
) degenerates into Eq. (
1
) when . If a linear scaling rule is adopted, is constant. Then, we obtain . Increasing results as rises may cause a drop in generalization performance. Adam[
37
] combines the advantages of two optimization algorithms, namely, AdaGrad[
38
] and RMSProp[
39
]. Adam evaluates the first moment estimation of a gradient and the second moment estimation and then calculates the update step. This algorithm is a second-order optimization method, which adjusts the learning rate for each parameter by performing small updates for frequently used parameters and large updates for seldomly used parameters. In the vanilla Adam algorithm[
37
], and are set to control the influence of gradients and the square of gradients on the parameter update, respectively. They play a similar role to momentum in the SGD method with momentum. NAG[
40
] is an improved method of momentum[
36
] and a first-order optimizer. It updates with the gradient by “looking ahead” instead of using the current gradient. Moreover, it calculates the variance of gradients with respect to the last one. By utilizing these values, NAG updates the parameters in the training process.
SGD and its variants, known as stochastic gradient algorithms, can be regarded as first-order or second-order stochastic gradient algorithms, as shown in Algorithms 1 and 2, respectively.
2.1 Stochastic gradient descent and its variants
SGD[
18
] is one of the simplest first-order algorithms, but it introduces noise into the gradient and blocks optimization in the training progress[
29
]. SGD randomly selects one sample or a random sample set at one time for parameter updates so that updating the parameters does not create redundancy. When the data size is large, SGD can effectively accelerate the training process.
Smith and Le[
30
] stated that SGD should be interpreted as integrating stochastic equations. They also presented the scale of random fluctuations in the SGD dynamics as
where is the learning rate, is the size of the training set, and is the batch size. If we reduce the learning rate , the noise scale drops, thus leading to an improved training performance. If we keep the learning rate constant, then we could also increase batch size to reduce the negative impact of noise scale. By contrast, a small batch size increases the noise scale and adversely affects the training performance. According to Ref. [
31
], calculating the mean and variance values over a batch makes the loss calculated for a particular example dependent on other examples in the same batch. Therefore, if the batch size is large, a high dependency between samples in the batch limits the training performance. This fluctuation scale can also be considered as a noise factor. When , applying a linear scaling rule[
32
] maintains the mean SGD weight update constant per training sample. A specific description about the linear scaling rule[
32
] is shown in Section 2.2.
SGD has several variants[
33
,
34
,
35
]. A common one is SGD with momentum[
36
]; specifically, Smith and Le[
30
] extended the traditional SGD to include momentum, and found that the noise factor changes into
where is the momentum. Equation (
2
) degenerates into Eq. (
1
) when . If a linear scaling rule is adopted, is constant. Then, we obtain . Increasing results as rises may cause a drop in generalization performance. Adam[
37
] combines the advantages of two optimization algorithms, namely, AdaGrad[
38
] and RMSProp[
39
]. Adam evaluates the first moment estimation of a gradient and the second moment estimation and then calculates the update step. This algorithm is a second-order optimization method, which adjusts the learning rate for each parameter by performing small updates for frequently used parameters and large updates for seldomly used parameters. In the vanilla Adam algorithm[
37
], and are set to control the influence of gradients and the square of gradients on the parameter update, respectively. They play a similar role to momentum in the SGD method with momentum. NAG[
40
] is an improved method of momentum[
36
] and a first-order optimizer. It updates with the gradient by “looking ahead” instead of using the current gradient. Moreover, it calculates the variance of gradients with respect to the last one. By utilizing these values, NAG updates the parameters in the training process.
SGD and its variants, known as stochastic gradient algorithms, can be regarded as first-order or second-order stochastic gradient algorithms, as shown in Algorithms 1 and 2, respectively.
10.26599/TST.2020.9010023.F001
Parameter update change of increasing the MF value when declines. If we want to obtain the same given the decline of gradients, namely, , we ought to increase from the MF (the brown line) to the (the red line). This method reduces the training errors accordingly.
10.26599/TST.2020.9010023.F002
Parameter update change to increase the MF value on the condition that the gradients do not change. Increasing from MF to MF results in a new parameter update direction, (the red one). This new direction could correct the error caused by a large batch size or multi-GPU training.
3.3 Distributed algorithm
In this section, we present distributed algorithms. As we use the parameter server architecture in our experiments, we show the training algorithms of workers and servers in Algorithms 3 and 4, respectively.
Algorithm 3 refers to the training algorithms on workers in the parameter server. In Algorithm 3, we input all the hyperparameters needed by the experiments and allocate GPUs as workers in the initialization phase. In the t-th iteration, the workers give the servers the signals of Pull trigger and Pull last saved parameter (). These parameters are fed into Algorithms 1 or 2 according to OP. Finally, workers send the Push trigger and Push to the servers. Algorithm 4 shows the training algorithms on servers in the parameter server architecture. It has the same initialization phase as Algorithm 3 but a different process. In the t-th iteration, when receiving the Pull trigger from the workers, the servers Push to the workers. After receiving the signal of Push trigger, the servers are responsible for Pull and save it locally.
Algorithms 1 and 2 are performed on the basis of the first- and second-order stochastic gradient. All of our experiments are based on these algorithms.
3.3 Distributed algorithm
In this section, we present distributed algorithms. As we use the parameter server architecture in our experiments, we show the training algorithms of workers and servers in Algorithms 3 and 4, respectively.
Algorithm 3 refers to the training algorithms on workers in the parameter server. In Algorithm 3, we input all the hyperparameters needed by the experiments and allocate GPUs as workers in the initialization phase. In the t-th iteration, the workers give the servers the signals of Pull trigger and Pull last saved parameter (). These parameters are fed into Algorithms 1 or 2 according to OP. Finally, workers send the Push trigger and Push to the servers. Algorithm 4 shows the training algorithms on servers in the parameter server architecture. It has the same initialization phase as Algorithm 3 but a different process. In the t-th iteration, when receiving the Pull trigger from the workers, the servers Push to the workers. After receiving the signal of Push trigger, the servers are responsible for Pull and save it locally.
Algorithms 1 and 2 are performed on the basis of the first- and second-order stochastic gradient. All of our experiments are based on these algorithms.
10.26599/TST.2020.9010023.F003
Validation accuracy of Resnet50 on CIFAR-10 of different batch sizes on multiple GPUs in the parameter server based on SGD. In these experiments, all hyperparameters, except batch size and number of GPUs, are set to their default values.
10.26599/TST.2020.9010023.F004
Validation accuracy of Resnet50 on CIFAR-10 of different batch sizes on multiple GPUs in the parameter server based on Adam. The hyperparameters are set to be the same as those in the previous SGD experiments.
10.26599/TST.2020.9010023.F005
Validation accuracy of different MFs on multiple GPUs utilizing SGD. In these experiments, we set the MF values 0.9, 0.95, 0.975, and 0.99, respectively.
10.26599/TST.2020.9010023.F006
Validation accuracy of different s on multiple GPUs utilizing Adam. In these experiments, we set the values 0.9, 0.95, 0.975 and 0.99, respectively.
10.26599/TST.2020.9010023.F007
Validation accuracy of ResNet-50 on CIFAR-10 given different batch sizes on multiple GPUs in the parameter server based on NAG. The hyperparameters are set to be the same as those in the previous SGD and Adam experiments.
10.26599/TST.2020.9010023.F008
MLP test results on MNIST of different momentums. The MLP comprises three fully connected layers: two activation layers and a softmax layer.