AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (2.2 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access

ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform

National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China and Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210023, China.
Microsoft Research, Beijing 100080, China.
Huawei Technologies Co., Ltd., Shenzhen 518129, China.
Show Author Information

Abstract

Recently, topic models such as Latent Dirichlet Allocation (LDA) have been widely used in large-scale web mining. Many large-scale LDA training systems have been developed, which usually prefer a customized design from top to bottom with sophisticated synchronization support. We propose an LDA training system named ZenLDA, which follows a generalized design for the distributed data-parallel platform. The novelty of ZenLDA consists of three main aspects: (1) it converts the commonly used serial Collapsed Gibbs Sampling (CGS) inference algorithm to a Monte-Carlo Collapsed Bayesian (MCCB) estimation method, which is embarrassingly parallel; (2) it decomposes the LDA inference formula into parts that can be sampled more efficiently to reduce computation complexity; (3) it proposes a distributed LDA training framework, which represents the corpus as a directed graph with the parameters annotated as corresponding vertices and implements ZenLDA and other well-known inference methods based on Spark. Experimental results indicate that MCCB converges with accuracy similar to that of CGS, while running much faster. On top of MCCB, the ZenLDA formula decomposition achieved the fastest speed among other well-known inference methods. ZenLDA also showed good scalability when dealing with large-scale topic models on the data-parallel platform. Overall, ZenLDA could achieve comparable and even better computing performance with state-of-the-art dedicated systems.

References

[1]
Hofmann T., Probabilistic latent semantic analysis, in Proc. 5th Conf. Uncertainty in Artificial Intelligence, Stockholm, Sweden, 1999, pp. 289-296.
[2]
Blei D. M., Ng A. Y., and Jordan M. I., Latent Dirichlet allocation, J. Mach. Learn. Res., vol. 3, pp. 993-1022, 2003.
[3]
Newman D., Asuncion A., Smyth P., and Welling M., Distributed algorithms for topic models, J. Mach. Learn. Res., vol. 10, pp. 1801-1828, 2009.
[4]
Wang Y., Bai H. J., Stanton M., Chen W. Y., and Chang E. Y., PLDA: Parallel latent dirichlet allocation for large-scale applications, in Proc. 5th Int. Conf. Algorithmic Aspects in Information and Management, San Francisco, CA, USA, 2009, pp. 301-314.
[5]
Ahmed A., Aly M., Gonzalez J., Narayanamurthy S., and Smola A. J., Scalable inference in latent variable models, in Proc. 5th ACM Int. Conf. Web Search and Data Mining, Seattle, WA, USA, 2012, pp. 123-132.
[6]
Liu Z. Y., Zhang Y. Z., Chang E. Y., and Sun M. S., PLDA+: Parallel latent dirichlet allocation with data placement and pipeline processing, ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, p. 26, 2011.
[7]
Ho Q. R., Cipar J., Cui H. G., Kim J. K., Lee S., Gibbons P. B., Gibson G. A., Ganger G. R., and Xing E. P., More effective distributed ML via a stale synchronous parallel parameter server, in Proc. 26th Int. Conf. Neural Information Processing Systems, Lake Tahoe, NV, USA, 2013, pp. 1223-1231.
[8]
Yuan J. H., Gao F., Ho Q. R., Dai W., Wei J. L., Zheng X., Xing E. P., Liu T. Y., and Ma W. Y., LightLDA: Big topic models on modest computer clusters, in Proc. 24th Int. Conf. World Wide Web, Florence, Italy, 2015, pp. 1351-1361.
[9]
Xing E. P., Ho Q. R., Dai W., Kim J. K., Wei J. L., Lee S., Zheng X., Xie P. T., Kumar A., and Yu Y. L., Petuum: A new platform for distributed machine learning on big data, in Proc. 21th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Sydney, Australia, 2015, pp. 1335-1344.
[10]
MPI Forum, Message passing interface, http://mpiforum.org/, 2017.
[11]
OpenMP ARB, OpenMP specifications, http://www.openmp.org/specifications/, 2017.
[12]
Wang Y., Zhao X. M., Sun Z. L., Yan H., Wang L. F., Jin Z. H., Wang L. B., Gao Y., Law C., and Zeng J., Peacock: Learning long-tail topic features for industrial applications, ACM Trans. Intell. Syst. Technol., vol. 6, no. 4, p. 47, 2015.
[13]
Tora S. and Eguchi K., MPI/OpenMP hybrid parallel inference methods for latent dirichlet allocation—Approximation and evaluation, IEICE Trans. Inf. Syst., vol. E96.D, no. 5, pp. 1006-1015, 2013.
[14]
Li M., Andersen D. G., Park J. W., Smola A. J., Ahmed A., Josifovski V., Long J., Shekita E. J., and Su B. Y., Scaling distributed machine learning with the parameter server, in Proc. 11th USENIX Symposium on Operating Systems Design and Implementation, Broomfield, CO, USA, 2014, pp. 583-598.
[15]
Dean J. and Ghemawat S., MapReduce: Simplified data processing on large clusters, in Proc. 6th Conf. Symposium on Opearting Systems Design & Implementation, San Francisco, CA, USA, 2004, p. 10.
[16]
Isard M., Budiu M., Yu Y., Birrell A., and Fetterly D., Dryad: Distributed data-parallel programs from sequential building blocks, in Proc. 2nd ACM SIGOPS/EuroSys European Conf. Computer Systems 2007, Lisbon, Portugal, 2007, pp. 59-72.
[17]
Zaharia M., Chowdhury M., Das T., Dave A., Ma J., McCauley M., Franklin M. J., Shenker S., and Stoica I., Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, in Proc. 9th USENIX Conf. Networked Systems Design and Implementation, San Jose, CA, USA, 2012, p. 2.
[18]
Meng X. R., Bradley J., Yavuz B., Sparks E., Venkataraman S., Liu D., Freeman J., Tsai D. B., Amde M., Owen S., et al., MLlib: Machine learning in apache spark, J. Mach. Learn. Res., vol. 17, no. 1, pp. 1235-1241, 2016.
[19]
Qiu Z. L., Wu B., Wang B., Shi C., and Yu L., Gibbs collapsed sampling for latent Dirichlet allocation on spark, in Proc. 3rd Int. Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, New York, NY, USA, 2014, pp. 17-28.
[20]
Bradley J., Topic modeling with LDA: MLlib meets GraphX, https://databricks.com/blog/2015/03/25/topicmodeling-with-lda-mllib-meets-graphx.html, 2015.
[21]
Microsoft Research, Distributed machine learning toolkit, https://github.com/microsoft/dmtk, 2017.
[22]
Griffiths T. L. and Steyvers M., Finding scientific topics, Proc. Natl. Acad. Sci. USA, vol. 101, no. S1, pp. 5228-5235, 2004.
[23]
Li A. Q., Ahmed A., Ravi S., and Smola A. J., Reducing the sampling complexity of topic models, in Proc. 20th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, New York, NY, USA, 2014, pp. 891-900.
[24]
Yu H. F., Hsieh C. J., Yun H., Vishwanathan S. V. N., and Dhillon I. S., A scalable asynchronous distributed algorithm for topic modeling, in Proc. 24th Int. Conf. World Wide Web, Florence, Italy, 2015, pp. 1340-1350.
[25]
Teh Y. W., Newman D., and Welling M., A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation, in Proc. 19th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2006, pp. 1353-1360.
[26]
Chen J. F., Li K. W., Zhu J., and Chen W., WarpLDA: A cache efficient O(1) algorithm for latent Dirichlet allocation, Proc. VLDB Endowment, vol. 9, no. 10, pp. 744-755, 2016.
[27]
Zaheer M., Wick M., Tristan J. B., Smola A., and Steele Jr G. L., Exponential stochastic cellular automata for massively parallel inference, in Proc. 19th Int. Conf. Artificial Intelligence and Statistics (AISTATS) 2016, Cadiz, Spain, 2016, pp. 966-975.
[28]
Gonzalez J. E., Xin R. S., Dave A., Crankshaw D., Franklin M. J., and Stoica I., GraphX: Graph processing in a distributed dataflow framework, in Proc. 11th USENIX Conf. Operating Systems Design and Implementation, Broomfield, CO, USA, 2014, pp. 599-613.
[29]
Yao L. M., Mimno D., and McCallum A., Efficient methods for topic model inference on streaming document collections, in Proc. 15th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Paris, France, 2009, pp. 937-946.
[30]
Gonzalez J. E., Low Y., Gu H. J., Bickson D., and Guestrin C., PowerGraph: Distributed graph-parallel computation on natural graphs, in Proc. 10th USENIX Conf. Operating Systems Design and Implementation, Hollywood, CA, USA, 2012, pp. 17-30.
[31]
Xie C., Yan L., Li W. J., and Zhang Z. H., Distributed power-law graph computing: theoretical and empirical analysis, in Proc. 27th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2014, pp. 1673-1681.
[32]
Esoteric Software, Kryo, https://github.com/Esoteric-Software/kryo, 2017.
[33]
Lemire D., JavaFastPFOR: A simple integer compression library in Java, https://github.com/lemire/JavaFastPFOR, 2017.
[34]
Hall D., Breeze, https://github.com/scalanlp/breeze, 2017.
[35]
Vavilapalli V. K., Murthy A. C., Douglas C., Agarwal S., Konar M., Evans R., Graves T., Lowe J., Shah H., Seth S., et al., Apache Hadoop YARN: Yet another resource negotiator, in Proc. 4th Annual Symposium on Cloud Computing, Santa Clara, CA, USA, 2013.
[36]
Wallach H. M., Mimno D., and McCallum A., Rethinking LDA: Why priors matter, in Proc. 23rd Annu. Conf. on Neural Information Processing Systems, Vancouver, Canada, 2009, pp. 1973-1981.
Big Data Mining and Analytics
Pages 57-74
Cite this article:
Zhao B, Zhou H, Li G, et al. ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform. Big Data Mining and Analytics, 2018, 1(1): 57-74. https://doi.org/10.26599/BDMA.2018.9020006

1017

Views

124

Downloads

17

Crossref

13

Web of Science

22

Scopus

0

CSCD

Altmetrics

Received: 22 August 2017
Accepted: 30 November 2017
Published: 25 January 2018
© The author(s) 2018
Return