Stochastic Variational Inference-Based Parallel and Online Supervised Topic Model for Large-Scale Text Processing

Yang Li; Wen-Zhuo Song; Bo Yang

doi:10.1007/s11390-018-1871-y

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Journals A - Z

About Us

Publish with Us

Support

Article Link

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Regular Paper

Stochastic Variational Inference-Based Parallel and Online Supervised Topic Model for Large-Scale Text Processing

Yang Li^{¹^,²^,³}, Wen-Zhuo Song^{¹^,²}, Bo Yang^{¹^,²}(

)

College of Computer Science and Technology, Jilin University, Changchun, 130012 China

Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education, Jilin University, Changchun 130012, China

Aviation University of Air Force, Changchun, 130062, China

Show Author Information

Abstract

Topic modeling is a mainstream and effective technology to deal with text data, with wide applications in text analysis, natural language, personalized recommendation, computer vision, etc. Among all the known topic models, supervised Latent Dirichlet Allocation (sLDA) is acknowledged as a popular and competitive supervised topic model. However, the gradual increase of the scale of datasets makes sLDA more and more inefficient and time-consuming, and limits its applications in a very narrow range. To solve it, a parallel online sLDA, named PO-sLDA (Parallel and Online sLDA), is proposed in this study. It uses the stochastic variational inference as the learning method to make the training procedure more rapid and efficient, and a parallel computing mechanism implemented via the MapReduce framework is proposed to promote the capacity of cloud computing and big data processing. The online training capacity supported by PO-sLDA expands the application scope of this approach, making it instrumental for real-life applications with high real-time demand. The validation using two datasets with different sizes shows that the proposed approach has the comparative accuracy as the sLDA and can efficiently accelerate the training procedure. Moreover, its good convergence and online training capacity make it lucrative for the large-scale text data analyzing and processing.

Keywords

cloud computing topic modeling online learning large-scale text classification stochastic variational inference

Electronic Supplementary Material

Download File(s)

jcst-33-5-1007-Highlights.pdf (230.4 KB)

References

[1]

Blei D M, Ng A Y, Jordan M Y. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003, 3: 993-1022.

Google Scholar

[2]

Blei D M, Mcauliffe J D. Supervised topic models. In Proc. Advances in Neural Information Processing Systems, December 2010, pp.327-332.

[3]

Wang C, Blei D M, Li F F. Simultaneous image classification and annotation. In Proc. IEEE Conference on Computer Vision & Pattern Recognition, January 2009, pp.1903-1910.

[4]

Zhu J, Ahmed A, Xing E P. MedLDA: Maximum margin supervised topic models. Journal of Machine Learning Research, 2012, 13(1): 2237-2278.

Google Scholar

[5]

Hoffman M, Blei D M, Wang C, Paisley J. Stochastic variational inference. Computer Science, 2013, 14(1): 1303-1347.

Google Scholar

[6]

Hoffman S, Leibler R A. On information and sufficiency. Annals of Mathematical Statistics, 1951, 22(22): 79-86.

Crossref Google Scholar

[7]

Amari S I. Differential geometry of curved exponential families-curvatures and information loss. Annals of Statistics, 1982, 10(2): 357-385.

Crossref Google Scholar

[8]

Song W Z, Yang B, Zhao X H, Li F. A fast and scalable supervised topic model using stochastic variational inference and MapReduce. In Proc. the 5th IEEE International Conference on Network Infrastructure and Digital Content, September 2016, pp.94-98.

Crossref

[9]

Li F F, Perona P. A Bayesian hierarchical model for learning natural scene categories. In Proc. the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2005, pp.524-531.

[10]

Griffiths T L, Steyvers M, Blei D M, Tenenbaum J B. Integrating topics and syntax. In Proc. Advances in Neural Information Processing Systems, December 2004, pp.537-544.

[11]

Wang C, Blei D M. Collaborative topic modeling for recommending scientific articles. In Proc. the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2011, pp.448-456.

Crossref

[12]

Lacoste-Julien S, Sha F, Jordan M I. DiscLDA: Discriminative learning for dimensionality reduction and classification. In Proc. Advances in Neural Information Processing Systems, January 2008, pp.897-904.

[13]

Ramage D, Hall D, Nallapati R, Manning C D. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proc. the Conference on Empirical Methods in Natural Language Processing, August 2009, pp.248-256.

Crossref

[14]

Perotte A, Bartlett N, Bartlett N, Wood F. Hierarchically supervised latent Dirichlet allocation. In Proc. Advances in Neural Information Processing Systems, January 2011, pp.2609-2617.

[15]

Boyd-Graber J, Resnik P. Holistic sentiment analysis across languages: Multilingual supervised latent Dirichlet allocation. In Proc. the Conference on Empirical Methods in Natural Language Processing, October 2010, pp.45-55.

[16]

Chen J S, He J, Shen Y L, Xiao L, He X D, Gao J F, Song X Y, Deng L. End-to-end learning of LDA by mirror-descent back propagation over a deep architecture. In Proc. the 28th International Conference on Neural Information Processing Systems, December 2015, pp.1765-1773.

[17]

Zhai K, Boyd-Graber J, Asadi N, Alkhoujia K L. Mr.LDA: A flexible large scale topic modeling package using variational inference in MapReduce. In Proc. ACM International Conference on World Wide Web, April 2012, pp.879-888.

Crossref

[18]

White T. Hadoop: The Definitive Guide (2nd edition). Yahoo Press, 2010.

[19]

Yu H F, Hsieh C J, Yun H, Vishwanathan S V N, Dhilon I S. A scalable asynchronous distributed algorithm for topic modeling. In Proc. ACM International Conference on World Wide Web, May 2015, pp.1340-1350.

Crossref

[20]

Liu X S, Zeng J, Yang X et al. Scalable parallel EM algorithms for latent Dirichlet allocation in multi-core systems. In Proc. the 24th ACM International Conference on World Wide Web, May 2015, pp.669-679.

Crossref

[21]

Yuan J, Gao F, Ho Q et al. LightLDA: Big topic models on modest computer clusters. In Proc. the 24th ACM International Conference on World Wide Web, May 2015, pp.1351-1361.

Crossref

[22]

Raman P, Zhang J, Yu H F, Ji S H. Extreme stochastic variational inference: Distribution and asynchronous. arXiv preprint arXiv: 1605.09499, 2016. http://cn.arxiv.org/abs/1605.09499, Aug. 2018.

[23]

Hoffman M D, Blei D M, Bach F R. Online learning for latent Dirichlet allocation. In Proc. Advances in Neural Information Processing Systems, November 2010, pp.856-864.

[24]

Jordan M I. Learning in Graphical Models. MIT Press Cambridge, 1999.

Crossref

[25]

Shvachko K, Kuang H, Radia S, Chansler R. The Hadoop distributed file system. In Proc. the 26th Symposium on Mass Storage Systems and Technologies (MSST), May 2010.

Crossref

[26]

Leskovec J, Rajaraman A, Ullman J D. Mining of Massive Datasets. Cambridge University Press, 2014.

Crossref

[27]

Yano T, Smith N A, Wilkerson J D. Textual predictors of bill survival in congressional committees. In Proc. the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, June 2012, pp.793-802.

[28]

Partalas I, Kosmopoulos A, Baskiotis N et al. LSHTC: A benchmark for large-scale text classification. arXiv preprint arXiv: 1503.08581, 2015. http://arxiv.org/abs/1503.08581, Aug. 2018.

[29]

Bizer C, Lehmann J, Kobilarov G et al. DBpedia-A crystallization point for the Web of Data. Web Semantics: Science, Services and Agents on the World Wide Web, 2009, 7(3): 154-165.

Crossref Google Scholar

Journal of Computer Science and Technology

Volume 33 Issue 5,
September 2018

Pages 1007-1022

DOI: 10.1007/s11390-018-1871-y

Cite this article:

Li Y, Song W-Z, Yang B. Stochastic Variational Inference-Based Parallel and Online Supervised Topic Model for Large-Scale Text Processing. Journal of Computer Science and Technology, 2018, 33(5): 1007-1022. https://doi.org/10.1007/s11390-018-1871-y

299

Views

Crossref

N/A

Web of Science

Scopus

CSCD

Google Scholar
Citation

Altmetrics

Received: 19 September 2017

Revised: 09 July 2018

Published: 12 September 2018