Optimizing Data Distributions Based on Jensen-Shannon Divergence for Federated Learning

Zhiyao Hu; Dongsheng Li; Ke Yang; Ying Xu; Baoyun Peng

doi:10.26599/TST.2023.9010091

| Sign up

PDF (1.2 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Open Access

Optimizing Data Distributions Based on Jensen-Shannon Divergence for Federated Learning

Zhiyao Hu^¹, Dongsheng Li^²(), Ke Yang^¹, Ying Xu^¹, Baoyun Peng^¹

1Academy of Military Sciences, Beijing 100081, China

2College of Computer, National University of Defense Technology, Changsha 410073, China

Show Author Information

Abstract

In current federated learning frameworks, a central server randomly selects a small number of clients to train local models at the beginning of each global iteration. Since clients’ local data are non-dependent and identically distributed, partial local models are not consistent with the global model. Existing studies employ model cleaning methods to find inconsistent local models. Model cleaning methods measure the cosine similarity between local models and the global model. The inconsistent local model is cleaned out and will not be aggregated for the next global model. However, model cleaning methods incur negative effects such as large computation overheads and limited updates. In this paper, we propose a data distribution optimization method, called federated distribution optimization (FedDO), aiming to overcome the shortcomings of model cleaning methods. FedDO calculates the gradient of the Jensen-Shannon divergence to decrease the discrepancy between selected clients’ data distribution and the overall data distribution. We test our method on the multi-classification regression model, the multi-layer perceptron, and the convolutional neural network model on a handwritten digital image dataset. Compared with model cleaning methods, FedDO improves the training accuracy by 1.8%, 2.6%, and 5.6%, respectively.

Keywords

federated learning Jensen-Shannon divergence distribution discrepancy gradient descent

References

[1]

Z. Hu, D. Li, D. Zhang, Y. Zhang, and B. Peng, Optimizing resource allocation for data-parallel jobs via GCN-based prediction, IEEE Trans. Parallel Distrib. Syst., vol. 32, no. 9, pp. 2188–2201, 2021.

Crossref Google Scholar

[2]

H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. Y. Arcas, Communication-efficient learning of deep networks from decentralized data, in Proc. 20th Int. Conf. Artificial Intelligence and Statistics (AISTATS), Fort Lauderdale, FL, USA, 2017, pp. 1273–1282.

[3]

W. Zhang, Z. Li, and X. Chen, Quality-aware user recruitment based on federated learning in mobile crowd sensing, Tsinghua Science and Technology, vol. 26, no. 6, pp. 869–877, 2021.

Crossref Google Scholar

[4]

Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, Federated learning with non-IID data, arXiv preprint arXiv: 1806.00582, 2018.

[5]

Y. Liu, T. Wang, S. Peng, G. Wang, and W. Jia, Edge-based model cleaning and device clustering in federated learning, (in Chinese), Chinese J. Computers, vol. 44, no. 12, pp. 2515–2528, 2021.

Google Scholar

[6]

M. Duan, D. Liu, X. Ji, Y. Wu, L. Liang, X. Chen, Y. Tan, and A. Ren, Flexible clustered federated learning for client-level data distribution shift, IEEE Trans. Parallel Distrib. Syst., vol. 33, no. 11, pp. 2661–2674, 2022.

Google Scholar

[7]

X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, On the convergence of FedAvg on non-IID data, arXiv preprint arXiv: 1907.02189, 2019.

[8]

T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, Federated optimization in heterogeneous networks, arXiv preprint arXiv: 1812.06127, 2018.

[9]

Z. Wang, J. Xin, H. Yang, S. Tian, G. Yu, C. Xu, and Y. Yao, Distributed and weighted extreme learning machine for imbalanced big data learning, Tsinghua Science and Technology, vol. 22, no. 2, pp. 160–173, 2017.

Crossref Google Scholar

[10]

H. Wang, Z. Kaplan, D. Niu, and B. Li, Optimizing federated learning on non-IID data with reinforcement learning, in Proc. IEEE INFOCOM 2020 - IEEE Conf. Computer Communications, Toronto, Canada, 2020, pp. 1698–1707.

Crossref

[11]

N. Abe, M. K. Warmuth, and J. I. Takeuchi, Polynomial learnability of probabilistic concepts with respect to the Kullback-Leibler divergence, in Proc. 4th Annual Workshop on Computational Learning Theory, Santa Cruz, CA, USA, 1991, pp. 277–289.

Crossref

[12]

S. C. Tsai, W. G. Tzeng, and H. L. Wu, On the Jensen-Shannon divergence and variational distance, IEEE Trans. Inf. Theory, vol. 51, no. 9, pp. 3333–3336, 2005.

Crossref Google Scholar

[13]

K. M. Borgwardt, A. Gretton, M. J. Rasch, H. P. Kriegel, B. Schölkopf, and A. J. Smola, Integrating structured biological data by kernel maximum mean discrepancy, Bioinformatics, vol. 22, no. 14, pp. e49–e57, 2006.

Crossref Google Scholar

[14]

I. Yang, A convex optimization approach to distributionally robust Markov decision processes with Wasserstein distance, IEEE Contr. Syst. Lett., vol. 1, no. 1, pp. 164–169, 2017.

Crossref Google Scholar

[15]

J. C. Guella, On Gaussian kernels on Hilbert spaces and kernels on Hyperbolic spaces, arXiv preprint arXiv: 2007.14697, 2020.

[16]

B. Yang, Y. Lei, F. Jia, N. Li, and Z. Du, A polynomial kernel induced distance metric to improve deep transfer learning for fault diagnosis of machines, IEEE Trans. Ind. Electron., vol. 67, no. 11, pp. 9747–9757, 2020.

Crossref Google Scholar

[17]

Y. J. Chang and B. W. Wah, Lagrangian techniques for solving a class of zero-one integer linear programs, in Proc. 19th Annual Int. Computer Software and Applications Conference (COMPSAC'95), Dallas, TX, USA, 1995, pp. 156–161.

Crossref

[18]

R. D. Rodman, Algorithm 166: MonteCarlo, Commun. ACM, vol. 6, no. 4, p. 164, 1963.

Crossref Google Scholar

[19]

M. C. Bartholomew-Biggs, The estimation of the hessian matrix in nonlinear least squares problems with non-zero residuals, Math. Program. Ser. A B, vol. 12, no. 1, pp. 67–80, 1977.

Crossref Google Scholar

[20]

Z. Q. Luo and W. Yu, An introduction to convex optimization for communications and signal processing, IEEE J. Sel. Areas Commun., vol. 24, no. 8, pp. 1426–1438, 2006.

Crossref Google Scholar

[21]

L. Bottou, Large-scale machine learning with stochastic gradient descent, in Proc. Int. Conf. Computational Statistics, Paris, France, 2010, pp. 177–186.

Crossref

[22]

C. Kwak and A. Clayton-Matthews, Multinomial logistic regression, Nurs. Res., vol. 51, no. 6, pp. 404–410, 2002.

Crossref Google Scholar

[23]

Z. Zhang, M. Lyons, M. Schuster, and S. Akamatsu, Comparison between geometry-based and gabor-wavelets-based facial expression recognition using multi-layer perceptron, in Proc. 3rd. Int. Conf. Face and Gesture Recognition, Nara, Japan, 1998, pp. 454–459.

[24]

M. Oquab, L. Bottou, I. Laptev, and J. Sivic, Learning and transferring mid-level image representations using convolutional neural networks, in Proc. 2014 IEEE Conf. Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp. 1717–1724.

Crossref

[25]

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al., TensorFlow: Large-scale machine learning on heterogeneous distributed systems, arXiv preprint arXiv: 1603.04467, 2016.

Tsinghua Science and Technology

Volume 30 Issue 2,
April 2025

Pages 670-681

DOI: 10.26599/TST.2023.9010091

Cite this article:

Hu Z, Li D, Yang K, et al. Optimizing Data Distributions Based on Jensen-Shannon Divergence for Federated Learning. Tsinghua Science and Technology, 2025, 30(2): 670-681. https://doi.org/10.26599/TST.2023.9010091