National Laboratory for Parallel and Distributed Processing (PDL), College of Computer, National University of Defense Technology, Changsha410073, China
Show Author Information
Hide Author Information
Abstract
The proliferation of massive datasets has led to significant interests in distributed algorithms for solving large-scale machine learning problems. However, the communication overhead is a major bottleneck that hampers the scalability of distributed machine learning systems. In this paper, we design two communication-efficient algorithms for distributed learning tasks. The first one is named EF-SIGNGD, in which we use the 1-bit (sign-based) gradient quantization method to save the communication bits. Moreover, the error feedback technique, i.e., incorporating the error made by the compression operator into the next step, is employed for the convergence guarantee. The second algorithm is called LE-SIGNGD, in which we introduce a well-designed lazy gradient aggregation rule to EF-SIGNGD that can detect the gradients with small changes and reuse the outdated information. LE-SIGNGD saves communication costs both in transmitted bits and communication rounds. Furthermore, we show that LE-SIGNGD is convergent under some mild assumptions. The effectiveness of the two proposed algorithms is demonstrated through experiments on both real and synthetic data.
No abstract is available for this article. Click the button above to view the PDF directly.
References
[1]
A.Ahmed, N.Shervashidze, S.Narayanamurthy, V.Josifovski, and A. J.Smola, Distributed large-scale natural graph factorization, in Proc. 22nd Int. Conf. World Wide Web, Rio de Janeiro, Brazil, 2013, pp. 37-48.
[2]
J.Dean, G. S.Corrado, R.Monga, K.Chen, M.Devin, Q. V.Le, M. Z.Mao, M.Ranzato, A.Senior, P.Tucker, et al., Large scale distributed deep networks, in Proc. 25th Int. Conf. Neural Information Processing Systems, Red Hook, NY, USA, 2012, pp. 1223-1231.
[3]
M.Li, D. G.Andersen, A.Smola, and K.Yu, Communication efficient distributed machine learning with the parameter server, in Proc. 27th Int. Conf. Neural Information Processing Systems, Cambridge, MA, USA, 2014, pp. 19-27.
[4]
D. S.Li, Z. Q.Lai, K. S.Ge, Y. M.Zhang, Z. N.Zhang, Q. L.Wang, and H. M.Wang, Hpdl: Towards a general framework for high-performance distributed deep learning, presented at 2019 IEEE 39th Int. Conf. Distributed Computing Systems (ICDCS), Dallas, TX, USA, 2019, pp. 1742-1753.
[5]
K. M.Nan, S. C.Liu, J. Z.Du, and H.Liu, Deep model compression for mobile platforms: A survey, Tsinghua Science and Technology, vol. 24, no. 6, pp. 677-693, 2019.
J. Q.Huang, W. T.Han, X. Y.Wang, and W. G.Chen, Heterogeneous parallel algorithm design and performance optimization for WENO on the Sunway Taihulight supercomputer, Tsinghua Science and Technology, vol. 25, no. 1, pp. 56-67, 2020.
L.Guan, T.Sun, L. B.Qiao, Z. H.Yang, D. S.Li, K. S.Ge, and X. C.Lu, An efficient parallel and distributed solution to nonconvex penalized linear SVMs, Frontiers of Information Technology & Electronic Engineering, vol. 21, no. 4, pp. 587-603, 2020.
G. B.Giannakis, Q.Ling, G.Mateos, I. D.Schizas, and H.Zhu, Decentralized learning for wireless communications and networking, in Splitting Methods in Communication, Imaging, Science, and Engineering, R.Glowinski, S.Osher, and W.Yin, eds. Cham, Germany: Springer, 2016, pp. 461-497.
[10]
M. I.Jordan, J. D.Lee, and Y.Yang, Communication-efficient distributed statistical inference, Journal of the American Statistical Association, vol. 114, no. 526, pp. 668-681, 2019.
A.Nedić, A.Olshevsky, and M. G.Rabbat, Network topology and communication-computation tradeoffs in decentralized optimization, Proceedings of the IEEE, vol. 106, no. 5, pp. 953-976, 2018.
S.Zheng, Z. Y.Huang, and J. T.Kwok, Communication-efficient distributed blockwise momentum SGD with error-feedback, arXiv preprint arXiv: 1905.10936, 2019.
F.Ablayev, M.Ablayev, J. Z.Huang, K.Khadiev, N.Salikhova, and D. M.Wu, On quantum methods for machine learning problems part II: Quantum classification algorithms, Big Data Mining and Analytics, vol. 3, no. 1, pp. 56-67, 2020.
D.Alistarh, D.Grubic, J.Li, R.Tomioka, and M.Vojnovic, QSGD: Communication-efficient SGD via gradient quantization and encoding, presented at Advances in Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 1709-1720.
[16]
J.Sun, T. Y.Chen, G. B.Giannakis, and Z. Y.Yang, Communication-efficient distributed learning via lazily aggregated quantized gradients, presented at Advances in Neural Information Processing Systems, Vancouver, Canada, 2019, pp. 3365-3375.
[17]
F.Seide, H.Fu, J.Droppo, G.Li, and D.Yu, 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs, in Proc. 15th Annu. Conf. Int. Speech Communication Association, Singapore, 2014, pp. 1058-1062.
[18]
J.Bernstein, Y. X.Wang, K.Azizzadenesheli, and A.Anandkumar, signSGD: Compressed optimisation for non-convex problems, in Proc. 35th Int. Conf. Machine Learning, Stockholm, Sweden, 2018, pp. 560-569.
[19]
J.Bernstein, J. W.Zhao, K.Azizzadenesheli, and A.Anandkumar, signSGD with majority vote is communication efficient and fault tolerant, in Proc. 7th Int. Conf. Learning Representations, New Orleans, LA, USA, 2019.
[20]
S. P.Karimireddy, Q.Rebjock, S. U.Stich, and M.Jaggi, Error feedback fixes signSGD and other gradient compression schemes, in Proc. 36th Int. Conf. Machine Learning, Long Beach, CA, USA, 2019, pp. 3252-3261.
[21]
O.Shamir, N.Srebro, and T.Zhang, Communication-efficient distributed optimization using an approximate newton-type method, in Proc. 31st Int. Conf. Machine Learning, Beijing, China, 2014, pp. 1000-1008.
[22]
Y. C.Zhang and X.Lin, Disco: Distributed optimization for self-concordant empirical loss, in Proc. 32nd Int. Conf. Machine Learning, Lille, France, 2015, pp. 362-370.
[23]
A.Mokhtari, Q.Ling, and A.Ribeiro, Network newton distributed optimization methods, IEEE Transactions on Signal Processing, vol. 65, no. 1, pp. 146-161, 2017.
B.McMahan, E.Moore, D.Ramage, S.Hampson, and B. A. Y.Arcas, Communication-efficient learning of deep networks from decentralized data, in Proc. 20th Int. Conf. Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 2017, pp. 1273-1282.
[25]
S. X.Zhang, A. E.Choromanska, and Y.LeCun, Deep learning with elastic averaging SGD, in Proc. 28th Int. Conf. Neural Information Processing Systems, Cambridge, MA, USA, 2015, pp. 685-693.
[26]
T. Y.Chen, G.Giannakis, T.Sun, and W. T.Yin, Lag: Lazily aggregated gradient for communication-efficient distributed learning, in Proc. 32nd Int. Conf. Neural Information Processing Systems, Red Hook, NY, USA, 2018, pp. 5050-5065.
[27]
J. Y.Wang and G.Joshi, Cooperative SGD: A unified framework for the design and analysis of communication-efficient SGD algorithms, arXiv preprint arXiv: 1808.07576, 2018.
T. Y.Chen, Y. J.Sun, and W. T.Yin, LASG: Lazily aggregated stochastic gradients for communication-efficient distributed learning, arXiv preprint arXiv: 2002.11360, 2020.
Y.Dong, J.Chen, Y. H.Tang, J. J.Wu, H. Q.Wang, and E. Q.Zhou, Lazy scheduling based disk energy optimization method, Tsinghua Science and Technology, vol. 25, no. 2, pp. 203-216, 2020.
Y.Arjevani and O.Shamir, Communication complexity of distributed convex learning and optimization, in Proc. 28th Int. Conf. Neural Information Processing Systems, Cambridge, MA, USA, 2015, pp. 1756-1764.
[31]
Y.LeCun, L.Bottou, Y.Bengio, and P.Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.
D.Davis and W. T.Yin, Convergence rate analysis of several splitting schemes, in Splitting Methods in Communication, Imaging, Science, and Engineering, R.Glowinski, S.Osher, and W.Yin, eds. Cham, Germany: Springer, 2016, pp. 115-163.
Deng X, Sun T, Liu F, et al. SIGNGD with Error Feedback Meets Lazily Aggregated Technique: Communication-Efficient Algorithms for Distributed Learning. Tsinghua Science and Technology, 2022, 27(1): 174-185. https://doi.org/10.26599/TST.2021.9010045
The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).
2.1 Algorithm development
In the 1-bit gradient compression methodology, the -th worker only uploads the sign of the gradient computed on its portion of the data, which suggests an update of the following form:
However, Ref. [
20
] presented some counterexamples to substantiate that naively using such a sign-based algorithm may not generalize or even converge. The operator misses massive information about the local gradient’s magnitude and direction.
Thus, we employ an elegant error feedback technique to fix the abovementioned problems. The error feedback technique is performed as follows: scaling the signed vector by the -norm of the gradient to ensure the magnitude of the gradient is not forgotten; locally storing the difference between the actual and compressed gradient, and adding it back into the next step so that the correct direction is not forgotten. We named this sign-based GD method with Error Feedback as EF-SIGNGD (Algorithm 1) and applied it to the distributed system.
More specifically, in Algorithm 1, represents the accumulated error from all compression steps in the previous iterations of worker . This residual error is added to the gradient step to obtain the corrected direction , i.e.,
All workers will upload and to the server, and the compression gradient is defined as the signed vector scaled by , i.e.,
and stores information about the magnitude. represents the parameter dimension. Therefore, the EF-SIGNGD algorithm is updated by
Our focus here is to reduce the number of worker-to-server uplink communications, which also refer to uploads (the same as Ref. [
16
]).
Table 1
shows the communication bits per upload of various algorithms. Compared with GD and SIGNGD, we find a trade-off between the number of communication bits and convergence guarantee. In large-scale machine learning tasks, the dimension of the parameters is usually very large, so the cost of the extra bits is negligible.
10.26599/TST.2021.9010045.T001
Communication bits of different algorithms when training a d-dimensional parameter with M workers.
Algorithm
Number of bits per upload
GD
SIGNGD
EF-SIGNGD
3.1 Lazily aggregated algorithm
The basic idea of the lazily aggregated technique is that if the difference of two consecutive locally compressed gradients is small, then the redundant uploads may be skipped and the previous one in the server can be reused. This idea comes from a simple rewriting of the EF-SIGNGD iteration Eq. (
6
) as
The difference in two consecutive compressed gradients on worker , i.e., , can be viewed as a refinement to . Obtaining this refinement requires a round of communication between the server and the worker . If this refinement is small enough, i.e.,
then we can skip the communication between the server and the worker to reduce the communication rounds.
Generalizing on this intuition, the lazily aggregated EF-SIGNGD algorithm, named LE-SIGNGD, will be updated by
where , and are the sets of workers that do and do not communicate with the server in iteration , respectively. We will only use the fresh compressed gradients from the selected workers in , and reuse the outdated compressed gradients from the rest of workers, which means
Therefore, the iteration process of LE-SIGNGD can also be expressed as
The difference between two compressed gradients in worker at the current iterate and the old copy is defined as
We find that
Combining Eqs. (
18
) and (
20
), we can observe that, instead of requesting all fresh compressed gradients in EF-SIGNGD, the lazily aggregated trick is to obtain by refining the previously aggregated gradient . If is stored in the server, then we can scale down the per-iteration communication rounds from to .
Designing a principled criterion to select a subset of workers that does not communicate with the server at each iteration is critical. Our focus is on the trade-off between the communication cost and the convergence guarantee of the LE-SIGNGD algorithm. Therefore, we compare the one-step descent amount of EF-SIGNGD and that of LE-SIGNGD. For EF-SIGNGD, as shown in Lemma 3, the one-step descent is . The one-step descent of LE-SIGNGD is , which will be specified in the following lemma.
Lemma 4 (LE-SIGNGD descent) Under Assumption 1, is generated by running the one-step LE-SIGNGD iteration (Eq. (
18
)) given . The objective values satisfy
where .
Remark 2 Different from Eq. (
14
) in EF-SIGNGD, the sequence has the following property:
Lemmas 3 and 4 estimate the objective value descent by performing one iteration of the EF-SIGNGD and LE-SIGNGD methods, respectively, conditioned on a common iterate . EF-SIGNGD finds by performing rounds of communication with all the workers, while LE-SIGNGD yields by performing only rounds of communication with a selected subset of workers. Our pursuit is to select a subset to ensure that LE-SIGNGD enjoys larger per-communication descent than EF-SIGNGD; that is
For simplicity, we choose the step size , ignore the same residual error term, and obtain
If we can further show that
then we can prove that Formula (
25
) holds. However, directly checking Formula (
26
) in the local worker is impossible because obtaining the fully aggregated gradient requires information from all the workers. It does not make sense to reduce uploads if the fully aggregated gradient has been obtained. Instead, we approximate in Formula (
26
) as follows:
where are constant weights. The fundamental reason here is that, as is smooth, can be approximated by weighted previous gradients or parameter differences.
Building upon Formulas (
26
) and (
27
), we will include worker in if it satisfies
Although the intuition for Formula (
28
) is not mathematically strict, we will show that the convergence of the algorithm is guaranteed under this selection rule. In summary, LE-SIGNGD can be implemented as follows: At iteration , the server broadcasts the learning parameter to all workers; each worker calculates the local gradient, adds the residual error, and compresses the local information; the worker in selected by Formula (
28
) will upload the local information to the server; the server aggregates the fresh compressed gradient from the selected workers and the outdated gradient information (stored in the server) from to update the parameter. The LE-SIGNGD algorithm is summarized in Algorithm 2.
10.26599/TST.2021.9010045.F001
Objective function value vs. iteration and communication cost in a synthetic dataset.
10.26599/TST.2021.9010045.F002
Objective function value vs. iteration and communication cost in a real dataset.
10.26599/TST.2021.9010045.F003
Gradient norm vs. communication cost in two datasets.