AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
Article Link
Collect
Submit Manuscript
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Regular Paper

Improving Parameter Estimation and Defensive Ability of Latent Dirichlet Allocation Model Training Under Rényi Differential Privacy

Key Laboratory of Data Engineering and Knowledge Engineering (Renmin University of China), Ministry of Education, Beijing 100087, China
School of Information, Renmin University of China, Beijing 100087, China
Show Author Information

Abstract

Latent Dirichlet allocation (LDA) is a topic model widely used for discovering hidden semantics in massive text corpora. Collapsed Gibbs sampling (CGS), as a widely-used algorithm for learning the parameters of LDA, has the risk of privacy leakage. Specifically, word count statistics and updates of latent topics in CGS, which are essential for parameter estimation, could be employed by adversaries to conduct effective membership inference attacks (MIAs). Till now, there are two kinds of methods exploited in CGS to defend against MIAs: adding noise to word count statistics and utilizing inherent privacy. These two kinds of methods have their respective limitations. Noise sampled from the Laplacian distribution sometimes produces negative word count statistics, which render terrible parameter estimation in CGS. Utilizing inherent privacy could only provide weak guaranteed privacy when defending against MIAs. It is promising to propose an effective framework to obtain accurate parameter estimations with guaranteed differential privacy. The key issue of obtaining accurate parameter estimations when introducing differential privacy in CGS is making good use of the privacy budget such that a precise noise scale is derived. It is the first time that Rényi differential privacy (RDP) has been introduced into CGS and we propose RDP-LDA, an effective framework for analyzing the privacy loss of any differentially private CGS. RDP-LDA could be used to derive a tighter upper bound of privacy loss than the overestimated results of existing differentially private CGS obtained by ε-DP. In RDP-LDA, we propose a novel truncated-Gaussian mechanism that keeps word count statistics non-negative. And we propose distribution perturbation which could provide more rigorous guaranteed privacy than utilizing inherent privacy. Experiments validate that our proposed methods produce more accurate parameter estimation under the JS-divergence metric and obtain lower precision and recall when defending against MIAs.

Electronic Supplementary Material

Download File(s)
jcst-37-6-1382-Highlights.pdf (158.1 KB)

References

[1]

Blei D M, Ng A Y, Jordan M. Latent Dirichlet allocation. The Journal of Machine Learning Research, 2003, 3: 933-1022.

[2]
Wang Y, Zhao X, Sun Z, Yan H, Wang L, Jin Z, Wang L, Gao Y, Zeng J, Yang Q, Law C. Towards topic modeling for big data. arXiv:1405.4402v1, 2014. https://arxiv.org/abs/1405.4402v1, June 2022.
[3]

Yu L, Zhang C, Shao Y, Cui B. LDA*: A robust and large-scale topic modeling system. Proceedings of the VLDB Endowment, 2017, 10(11): 1406-1417. DOI: 10.14778/3137628.3137649.

[4]
Yuan J, Gao F, Ho Q, Dai W, Wei J, Zheng X, Xing E, Liu T, Ma W. LightLDA: Big topic models on modest computer clusters. In Proc. the 24th International Conference on World Wide Web, May 2015, pp.1351-1361. DOI: 10.1145/2736277.2741115.
[5]
Zhao F, Ren X, Yang S, Yang X. On privacy protection of latent Dirichlet allocation model training. arXiv:1906.01178, 2019. https://arxiv.org/abs/1906.01178, July 2022.
[6]

Zhao F, Ren X, Yang S, Han Q, Zhao P, Yang X. Latent Dirichlet allocation model training with differential privacy. IEEE Trans. Information Forensics and Security, 2021, 16: 1290-1305. DOI: 10.1109/TIFS.2020.3032021.

[7]

Cynthia D, Aaron R. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 2013, 9(3/4): 211-407. DOI: 10.1561/0400000042.

[8]
Zheng S. The differential privacy of Bayesian inference [Ph. D. Thesis]. Harvard College, 2015.
[9]
Foulds J, Geumlek J, Welling M, Chaudhuri K. On the theory and practice of privacy-preserving Bayesian data analysis. arXiv:1603.07294v2, 2016. https://arxiv.org/abs/1603.07294v2, June 2022.
[10]

Zhu T, Li G, Zhou W, Xiong P, Yuan C. Privacy-preserving topic model for tagging recommender systems. Knowledge and Information Systems, 2016, 46(1): 33-58. DOI: 10.1007/s10115-015-0832-9.

[11]
Zhang Z, Rubinstein B I P, Dimitrakakis C. On the differential privacy of Bayesian inference. In Proc. the 30th AAAI Conference on Artificial Intelligence, Feb. 2016, pp.2365-2371. DOI: 10.1609/aaai.v30i1.10254.
[12]
Wang Y, Tong Y, Shi D. Federated latent Dirichlet allocation: A local differential privacy based framework. In Proc. the 34th AAAI Conference on Artificial Intelligence, Feb. 2020, pp.6283-6290. DOI: 10.1609/aaai.v34i04.6096.
[13]
Abadi M, Chu A, Goodfellow I, McMahan H B, Mironov I, Talwar K, Zhang L. Deep learning with differential privacy. In Proc. the 23rd ACM SIGSAC Conference on Computer and Communications Security, Oct. 2016, pp.308-318. DOI: 10.1145/2976749.2978318.
[14]
Yu L, Liu L, Pu C, Gursoy M E, Truex S. Differentially private model publishing for deep learning. In Proc. the 40th IEEE Symp. Security and Privacy, May 2019, pp.332-349. DOI: 10.1109/SP.2019.00019.
[15]
Mironov I. Rényi differential privacy. In Proc. the 30th IEEE Computer Security Foundations Symposium, Aug. 2017, pp.263-275. DOI: 10.1109/CSF.2017.11.
[16]
Porteous I, Newman D, Ihler A T, Asuncion A U, Smyth P, Welling M. Fast collapsed Gibbs sampling for latent Dirichlet allocation. In Proc. the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2008, pp.569-577. DOI: 10.1145/1401890.1401960.
[17]
Xiao H, Stibor T. Efficient collapsed Gibbs sampling for latent Dirichlet allocation. In Proc. the 2nd Asian Conference on Machine Learning, Nov. 2010, pp.63-78.
[18]

MacKay D J C. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003.

[19]

Van Ravenzwaaij D, Cassey P, Brown S D. A simple introduction to Markov chain Monte-Carlo sampling. Psychonomic Bulletin & Review, 2018, 25(1): 143-154. DOI: 10.3758/s13423-016-1015-8.

[20]

Hesterberg T. Monte Carlo strategies in scientific computing. Technometrics, 2002, 44(4): 403-404. DOI: 10.1198/tech.2002.s85.

[21]
Bun M, Steinke T. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Proc. the 14th International Theory of Cryptography Conference, October 31-November 3, 2016, pp.635-658. DOI: 10.1007/978-3-662-53641-4_24.
[22]
Hu C, Cao H, Gong Q. Sub-Gibbs sampling: A new strategy for inferring LDA. In Proc. the 17th IEEE International Conference on Data Mining, Nov. 2017, pp.907-912. 10.1109/ICDM. 2017.113.
[23]
Goldberger J, Gordon S, Greenspan H. An efficient image similarity measure based on approximations of KL-divergence between two Gaussian mixtures. In Proc. the 9th IEEE International Conference on Computer Vision, Oct. 2003, pp.487-493. DOI: 10.1109/ICCV.2003.1238387.
[24]
Hofmann T. Probabilistic latent semantic analysis. arXiv:1301.6705v1, 2013. https://arxiv.org/abs/1301.6705v1, Jun. 2022.
[25]
Steyvers M, Smyth P, Rosen-Zvi M, Griffiths T L. Probabilistic author-topic models for information discovery. In Proc. the 2004 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2004, pp.306-315. DOI: 10.1145/1014052.1014087.
[26]
Salem A, Zhang Y, Humbert M, Berrang P, Fritz M, Backes M. ML-leaks: Model and data independent membership inference attacks and defenses on machine learning models. In Proc. the 26th Annual Network and Distributed System Security Symposium, Feb. 2019. DOI: 10.14722/NDSS.2019.23119.
[27]

Yıldırım S, Ermiş B. Exact MCMC with differentially private moves—Revisiting the penalty algorithm in a data privacy framework. Statistics and Computing, 2019, 29(5): 947-963. DOI: 10.1007/s11222-018-9847-x.

[28]
Bernstein G, Sheldon D. Differentially private Bayesian inference for exponential families. arXiv:1809.02188v3, 2018. https://arxiv.org/abs/1809.02188v3, Jun. 2022.
Journal of Computer Science and Technology
Pages 1382-1397
Cite this article:
Huang T, Zhao S-Y, Chen H, et al. Improving Parameter Estimation and Defensive Ability of Latent Dirichlet Allocation Model Training Under Rényi Differential Privacy. Journal of Computer Science and Technology, 2022, 37(6): 1382-1397. https://doi.org/10.1007/s11390-022-2425-x

403

Views

0

Crossref

0

Web of Science

0

Scopus

0

CSCD

Altmetrics

Received: 15 April 2022
Accepted: 06 September 2022
Published: 30 November 2022
©Institute of Computing Technology, Chinese Academy of Sciences 2022
Return