SIGNGD with Error Feedback Meets Lazily Aggregated Technique: Communication-Efficient Algorithms for Distributed Learning

Xiaoge Deng; Tao Sun; Feng Liu; Dongsheng Li

doi:10.26599/TST.2021.9010045

| Sign up

PDF (2.6 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Open Access

SIGNGD with Error Feedback Meets Lazily Aggregated Technique: Communication-Efficient Algorithms for Distributed Learning

Xiaoge Deng, Tao Sun, Feng Liu, Dongsheng Li()

National Laboratory for Parallel and Distributed Processing (PDL), College of Computer, National University of Defense Technology, Changsha 410073, China

Show Author Information

Abstract

The proliferation of massive datasets has led to significant interests in distributed algorithms for solving large-scale machine learning problems. However, the communication overhead is a major bottleneck that hampers the scalability of distributed machine learning systems. In this paper, we design two communication-efficient algorithms for distributed learning tasks. The first one is named _EF-SIGNGD, in which we use the 1-bit (sign-based) gradient quantization method to save the communication bits. Moreover, the error feedback technique, i.e., incorporating the error made by the compression operator into the next step, is employed for the convergence guarantee. The second algorithm is called _LE-SIGNGD, in which we introduce a well-designed lazy gradient aggregation rule to _EF-SIGNGD that can detect the gradients with small changes and reuse the outdated information. _LE-SIGNGD saves communication costs both in transmitted bits and communication rounds. Furthermore, we show that _LE-SIGNGD is convergent under some mild assumptions. The effectiveness of the two proposed algorithms is demonstrated through experiments on both real and synthetic data.

Keywords

distributed learning communication-efficient algorithm convergence analysis

References

[1]

Ahmed

, N.

Shervashidze

, S.

Narayanamurthy

, V.

Josifovski

, and A. J.

Smola

, Distributed large-scale natural graph factorization, in Proc. 22nd Int. Conf. World Wide Web, Rio de Janeiro, Brazil, 2013, pp. 37-48.

[2]

Dean

, G. S.

Corrado

, R.

Monga

, K.

Chen

, M.

Devin

, Q. V.

, M. Z.

Mao

, M.

Ranzato

, A.

Senior

, P.

Tucker

, et al., Large scale distributed deep networks, in Proc. 25th Int. Conf. Neural Information Processing Systems, Red Hook, NY, USA, 2012, pp. 1223-1231.

[3]

, D. G.

Andersen

, A.

Smola

, and K.

, Communication efficient distributed machine learning with the parameter server, in Proc. 27th Int. Conf. Neural Information Processing Systems, Cambridge, MA, USA, 2014, pp. 19-27.

[4]

D. S.

, Z. Q.

Lai

, K. S.

, Y. M.

Zhang

, Z. N.

Zhang

, Q. L.

Wang

, and H. M.

Wang

, Hpdl: Towards a general framework for high-performance distributed deep learning, presented at 2019 IEEE 39th Int. Conf. Distributed Computing Systems (ICDCS), Dallas, TX, USA, 2019, pp. 1742-1753.

[5]

K. M.

Nan

, S. C.

Liu

, J. Z.

, and H.

Liu

, Deep model compression for mobile platforms: A survey, Tsinghua Science and Technology, vol. 24, no. 6, pp. 677-693, 2019.

Google Scholar

[6]

J. Q.

Huang

, W. T.

Han

, X. Y.

Wang

, and W. G.

Chen

, Heterogeneous parallel algorithm design and performance optimization for WENO on the Sunway Taihulight supercomputer, Tsinghua Science and Technology, vol. 25, no. 1, pp. 56-67, 2020.

Google Scholar

[7]

Guan

, T.

Sun

, L. B.

Qiao

, Z. H.

Yang

, D. S.

, K. S.

, and X. C.

, An efficient parallel and distributed solution to nonconvex penalized linear SVMs, Frontiers of Information Technology & Electronic Engineering, vol. 21, no. 4, pp. 587-603, 2020.

Google Scholar

[8]

Nedic

and A.

Ozdaglar

, Distributed subgradient methods for multi-agent optimization, IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48-61, 2009.

Google Scholar

[9]

G. B.

Giannakis

, Q.

Ling

, G.

Mateos

, I. D.

Schizas

, and H.

Zhu

, Decentralized learning for wireless communications and networking, in Splitting Methods in Communication, Imaging, Science, and Engineering, R.

Glowinski

, S.

Osher

, and W.

Yin

, eds. Cham, Germany: Springer, 2016, pp. 461-497.

[10]

M. I.

Jordan

, J. D.

Lee

, and Y.

Yang

, Communication-efficient distributed statistical inference, Journal of the American Statistical Association, vol. 114, no. 526, pp. 668-681, 2019.

Google Scholar

[11]

Nedić

, A.

Olshevsky

, and M. G.

Rabbat

, Network topology and communication-computation tradeoffs in decentralized optimization, Proceedings of the IEEE, vol. 106, no. 5, pp. 953-976, 2018.

Google Scholar

[12]

Zheng

, Z. Y.

Huang

, and J. T.

Kwok

, Communication-efficient distributed blockwise momentum SGD with error-feedback, arXiv preprint arXiv: 1905.10936, 2019.

Google Scholar

[13]

Z. X.

Guo

and S. H.

Zhang

, Sparse deep nonnegative matrix factorization, Big Data Mining and Analytics, vol. 3, no. 1, pp. 13-28, 2020.

Google Scholar

[14]

Ablayev

, M.

Ablayev

, J. Z.

Huang

, K.

Khadiev

, N.

Salikhova

, and D. M.

, On quantum methods for machine learning problems part II: Quantum classification algorithms, Big Data Mining and Analytics, vol. 3, no. 1, pp. 56-67, 2020.

Google Scholar

[15]

Alistarh

, D.

Grubic

, J.

, R.

Tomioka

, and M.

Vojnovic

, QSGD: Communication-efficient SGD via gradient quantization and encoding, presented at Advances in Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 1709-1720.

[16]

Sun

, T. Y.

Chen

, G. B.

Giannakis

, and Z. Y.

Yang

, Communication-efficient distributed learning via lazily aggregated quantized gradients, presented at Advances in Neural Information Processing Systems, Vancouver, Canada, 2019, pp. 3365-3375.

[17]

Seide

, H.

, J.

Droppo

, G.

, and D.

, 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs, in Proc. 15th Annu. Conf. Int. Speech Communication Association, Singapore, 2014, pp. 1058-1062.

[18]

Bernstein

, Y. X.

Wang

, K.

Azizzadenesheli

, and A.

Anandkumar

, signSGD: Compressed optimisation for non-convex problems, in Proc. 35th Int. Conf. Machine Learning, Stockholm, Sweden, 2018, pp. 560-569.

[19]

Bernstein

, J. W.

Zhao

, K.

Azizzadenesheli

, and A.

Anandkumar

, signSGD with majority vote is communication efficient and fault tolerant, in Proc. 7th Int. Conf. Learning Representations, New Orleans, LA, USA, 2019.

[20]

S. P.

Karimireddy

, Q.

Rebjock

, S. U.

Stich

, and M.

Jaggi

, Error feedback fixes signSGD and other gradient compression schemes, in Proc. 36th Int. Conf. Machine Learning, Long Beach, CA, USA, 2019, pp. 3252-3261.

[21]

Shamir

, N.

Srebro

, and T.

Zhang

, Communication-efficient distributed optimization using an approximate newton-type method, in Proc. 31st Int. Conf. Machine Learning, Beijing, China, 2014, pp. 1000-1008.

[22]

Y. C.

Zhang

and X.

Lin

, Disco: Distributed optimization for self-concordant empirical loss, in Proc. 32nd Int. Conf. Machine Learning, Lille, France, 2015, pp. 362-370.

[23]

Mokhtari

, Q.

Ling

, and A.

Ribeiro

, Network newton distributed optimization methods, IEEE Transactions on Signal Processing, vol. 65, no. 1, pp. 146-161, 2017.

Google Scholar

[24]

McMahan

, E.

Moore

, D.

Ramage

, S.

Hampson

, and B. A. Y.

Arcas

, Communication-efficient learning of deep networks from decentralized data, in Proc. 20th Int. Conf. Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 2017, pp. 1273-1282.

[25]

S. X.

Zhang

, A. E.

Choromanska

, and Y.

LeCun

, Deep learning with elastic averaging SGD, in Proc. 28th Int. Conf. Neural Information Processing Systems, Cambridge, MA, USA, 2015, pp. 685-693.

[26]

T. Y.

Chen

, G.

Giannakis

, T.

Sun

, and W. T.

Yin

, Lag: Lazily aggregated gradient for communication-efficient distributed learning, in Proc. 32nd Int. Conf. Neural Information Processing Systems, Red Hook, NY, USA, 2018, pp. 5050-5065.

[27]

J. Y.

Wang

and G.

Joshi

, Cooperative SGD: A unified framework for the design and analysis of communication-efficient SGD algorithms, arXiv preprint arXiv: 1808.07576, 2018.

Google Scholar

[28]

T. Y.

Chen

, Y. J.

Sun

, and W. T.

Yin

, LASG: Lazily aggregated stochastic gradients for communication-efficient distributed learning, arXiv preprint arXiv: 2002.11360, 2020.

Google Scholar

[29]

Dong

, J.

Chen

, Y. H.

Tang

, J. J.

, H. Q.

Wang

, and E. Q.

Zhou

, Lazy scheduling based disk energy optimization method, Tsinghua Science and Technology, vol. 25, no. 2, pp. 203-216, 2020.

Google Scholar

[30]

Arjevani

and O.

Shamir

, Communication complexity of distributed convex learning and optimization, in Proc. 28th Int. Conf. Neural Information Processing Systems, Cambridge, MA, USA, 2015, pp. 1756-1764.

[31]

LeCun

, L.

Bottou

, Y.

Bengio

, and P.

Haffner

, Gradient-based learning applied to document recognition, Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.

Google Scholar

[32]

Davis

and W. T.

Yin

, Convergence rate analysis of several splitting schemes, in Splitting Methods in Communication, Imaging, Science, and Engineering, R.

Glowinski

, S.

Osher

, and W.

Yin

, eds. Cham, Germany: Springer, 2016, pp. 115-163.

Tsinghua Science and Technology

Volume 27 Issue 1,
February 2022

Pages 174-185

DOI: 10.26599/TST.2021.9010045

Cite this article:

Deng X, Sun T, Liu F, et al. SIGNGD with Error Feedback Meets Lazily Aggregated Technique: Communication-Efficient Algorithms for Distributed Learning. Tsinghua Science and Technology, 2022, 27(1): 174-185. https://doi.org/10.26599/TST.2021.9010045