Improving Parameter Estimation and Defensive Ability of Latent Dirichlet Allocation Model Training Under Rényi Differential Privacy

Tao Huang; Su-Yun Zhao; Hong Chen; Yi-Xuan Liu

doi:10.1007/s11390-022-2425-x

| Sign up

Article Link

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Outline

Abstract

Keywords

Electronic Supplementary Material

References

Show full outline

Hide outline

Regular Paper

Improving Parameter Estimation and Defensive Ability of Latent Dirichlet Allocation Model Training Under Rényi Differential Privacy

Tao Huang^{¹^,²}, Su-Yun Zhao^{¹^,²}, Hong Chen^{¹^,²}(), Yi-Xuan Liu^{¹^,²}

Key Laboratory of Data Engineering and Knowledge Engineering (Renmin University of China), Ministry of Education, Beijing 100087, China

School of Information, Renmin University of China, Beijing 100087, China

Show Author Information

Abstract

Latent Dirichlet allocation (LDA) is a topic model widely used for discovering hidden semantics in massive text corpora. Collapsed Gibbs sampling (CGS), as a widely-used algorithm for learning the parameters of LDA, has the risk of privacy leakage. Specifically, word count statistics and updates of latent topics in CGS, which are essential for parameter estimation, could be employed by adversaries to conduct effective membership inference attacks (MIAs). Till now, there are two kinds of methods exploited in CGS to defend against MIAs: adding noise to word count statistics and utilizing inherent privacy. These two kinds of methods have their respective limitations. Noise sampled from the Laplacian distribution sometimes produces negative word count statistics, which render terrible parameter estimation in CGS. Utilizing inherent privacy could only provide weak guaranteed privacy when defending against MIAs. It is promising to propose an effective framework to obtain accurate parameter estimations with guaranteed differential privacy. The key issue of obtaining accurate parameter estimations when introducing differential privacy in CGS is making good use of the privacy budget such that a precise noise scale is derived. It is the first time that Rényi differential privacy (RDP) has been introduced into CGS and we propose RDP-LDA, an effective framework for analyzing the privacy loss of any differentially private CGS. RDP-LDA could be used to derive a tighter upper bound of privacy loss than the overestimated results of existing differentially private CGS obtained by ε-DP. In RDP-LDA, we propose a novel truncated-Gaussian mechanism that keeps word count statistics non-negative. And we propose distribution perturbation which could provide more rigorous guaranteed privacy than utilizing inherent privacy. Experiments validate that our proposed methods produce more accurate parameter estimation under the JS-divergence metric and obtain lower precision and recall when defending against MIAs.

Keywords

latent Dirichlet allocation parameter estimation membership inference attack Rényi differential privacy

Electronic Supplementary Material

Download File(s)

jcst-37-6-1382-Highlights.pdf (158.1 KB)

References

[1]

Blei D M, Ng A Y, Jordan M. Latent Dirichlet allocation. The Journal of Machine Learning Research, 2003, 3: 933-1022.

Google Scholar

[2]

Wang Y, Zhao X, Sun Z, Yan H, Wang L, Jin Z, Wang L, Gao Y, Zeng J, Yang Q, Law C. Towards topic modeling for big data. arXiv:1405.4402v1, 2014. https://arxiv.org/abs/1405.4402v1, June 2022.

[3]

Yu L, Zhang C, Shao Y, Cui B. LDA^*: A robust and large-scale topic modeling system. Proceedings of the VLDB Endowment, 2017, 10(11): 1406-1417. DOI: 10.14778/3137628.3137649.

Crossref Google Scholar

[4]

Yuan J, Gao F, Ho Q, Dai W, Wei J, Zheng X, Xing E, Liu T, Ma W. LightLDA: Big topic models on modest computer clusters. In Proc. the 24th International Conference on World Wide Web, May 2015, pp.1351-1361. DOI: 10.1145/2736277.2741115.

Crossref

[5]

Zhao F, Ren X, Yang S, Yang X. On privacy protection of latent Dirichlet allocation model training. arXiv:1906.01178, 2019. https://arxiv.org/abs/1906.01178, July 2022.

Crossref

[6]

Zhao F, Ren X, Yang S, Han Q, Zhao P, Yang X. Latent Dirichlet allocation model training with differential privacy. IEEE Trans. Information Forensics and Security, 2021, 16: 1290-1305. DOI: 10.1109/TIFS.2020.3032021.

Crossref Google Scholar

[7]

Cynthia D, Aaron R. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 2013, 9(3/4): 211-407. DOI: 10.1561/0400000042.

Crossref Google Scholar

[8]

Zheng S. The differential privacy of Bayesian inference [Ph. D. Thesis]. Harvard College, 2015.

[9]

Foulds J, Geumlek J, Welling M, Chaudhuri K. On the theory and practice of privacy-preserving Bayesian data analysis. arXiv:1603.07294v2, 2016. https://arxiv.org/abs/1603.07294v2, June 2022.

[10]

Zhu T, Li G, Zhou W, Xiong P, Yuan C. Privacy-preserving topic model for tagging recommender systems. Knowledge and Information Systems, 2016, 46(1): 33-58. DOI: 10.1007/s10115-015-0832-9.

Crossref Google Scholar

[11]

Zhang Z, Rubinstein B I P, Dimitrakakis C. On the differential privacy of Bayesian inference. In Proc. the 30th AAAI Conference on Artificial Intelligence, Feb. 2016, pp.2365-2371. DOI: 10.1609/aaai.v30i1.10254.

Crossref

[12]

Wang Y, Tong Y, Shi D. Federated latent Dirichlet allocation: A local differential privacy based framework. In Proc. the 34th AAAI Conference on Artificial Intelligence, Feb. 2020, pp.6283-6290. DOI: 10.1609/aaai.v34i04.6096.

Crossref

[13]

Abadi M, Chu A, Goodfellow I, McMahan H B, Mironov I, Talwar K, Zhang L. Deep learning with differential privacy. In Proc. the 23rd ACM SIGSAC Conference on Computer and Communications Security, Oct. 2016, pp.308-318. DOI: 10.1145/2976749.2978318.

Crossref

[14]

Yu L, Liu L, Pu C, Gursoy M E, Truex S. Differentially private model publishing for deep learning. In Proc. the 40th IEEE Symp. Security and Privacy, May 2019, pp.332-349. DOI: 10.1109/SP.2019.00019.

Crossref

[15]

Mironov I. Rényi differential privacy. In Proc. the 30th IEEE Computer Security Foundations Symposium, Aug. 2017, pp.263-275. DOI: 10.1109/CSF.2017.11.

Crossref

[16]

Porteous I, Newman D, Ihler A T, Asuncion A U, Smyth P, Welling M. Fast collapsed Gibbs sampling for latent Dirichlet allocation. In Proc. the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2008, pp.569-577. DOI: 10.1145/1401890.1401960.

Crossref

[17]

Xiao H, Stibor T. Efficient collapsed Gibbs sampling for latent Dirichlet allocation. In Proc. the 2nd Asian Conference on Machine Learning, Nov. 2010, pp.63-78.

[18]

MacKay D J C. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003.

[19]

Van Ravenzwaaij D, Cassey P, Brown S D. A simple introduction to Markov chain Monte-Carlo sampling. Psychonomic Bulletin & Review, 2018, 25(1): 143-154. DOI: 10.3758/s13423-016-1015-8.

Crossref Google Scholar

[20]

Hesterberg T. Monte Carlo strategies in scientific computing. Technometrics, 2002, 44(4): 403-404. DOI: 10.1198/tech.2002.s85.

Crossref Google Scholar

[21]

Bun M, Steinke T. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Proc. the 14th International Theory of Cryptography Conference, October 31-November 3, 2016, pp.635-658. DOI: 10.1007/978-3-662-53641-4_24.

Crossref

[22]

Hu C, Cao H, Gong Q. Sub-Gibbs sampling: A new strategy for inferring LDA. In Proc. the 17th IEEE International Conference on Data Mining, Nov. 2017, pp.907-912. 10.1109/ICDM. 2017.113.

Crossref

[23]

Goldberger J, Gordon S, Greenspan H. An efficient image similarity measure based on approximations of KL-divergence between two Gaussian mixtures. In Proc. the 9th IEEE International Conference on Computer Vision, Oct. 2003, pp.487-493. DOI: 10.1109/ICCV.2003.1238387.

Crossref

[24]

Hofmann T. Probabilistic latent semantic analysis. arXiv:1301.6705v1, 2013. https://arxiv.org/abs/1301.6705v1, Jun. 2022.

[25]

Steyvers M, Smyth P, Rosen-Zvi M, Griffiths T L. Probabilistic author-topic models for information discovery. In Proc. the 2004 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2004, pp.306-315. DOI: 10.1145/1014052.1014087.

Crossref

[26]

Salem A, Zhang Y, Humbert M, Berrang P, Fritz M, Backes M. ML-leaks: Model and data independent membership inference attacks and defenses on machine learning models. In Proc. the 26th Annual Network and Distributed System Security Symposium, Feb. 2019. DOI: 10.14722/NDSS.2019.23119.

Crossref

[27]

Yıldırım S, Ermiş B. Exact MCMC with differentially private moves—Revisiting the penalty algorithm in a data privacy framework. Statistics and Computing, 2019, 29(5): 947-963. DOI: 10.1007/s11222-018-9847-x.

Crossref Google Scholar

[28]

Bernstein G, Sheldon D. Differentially private Bayesian inference for exponential families. arXiv:1809.02188v3, 2018. https://arxiv.org/abs/1809.02188v3, Jun. 2022.

Journal of Computer Science and Technology

Volume 37 Issue 6,
November 2022

Pages 1382-1397

DOI: 10.1007/s11390-022-2425-x

Cite this article:

Huang T, Zhao S-Y, Chen H, et al. Improving Parameter Estimation and Defensive Ability of Latent Dirichlet Allocation Model Training Under Rényi Differential Privacy. Journal of Computer Science and Technology, 2022, 37(6): 1382-1397. https://doi.org/10.1007/s11390-022-2425-x