AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
Article Link
Collect
Submit Manuscript
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Regular Paper

Stochastic Variational Inference-Based Parallel and Online Supervised Topic Model for Large-Scale Text Processing

Yang Li1,2,3Wen-Zhuo Song1,2Bo Yang1,2( )
College of Computer Science and Technology, Jilin University, Changchun, 130012 China
Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education, Jilin University, Changchun 130012, China
Aviation University of Air Force, Changchun, 130062, China
Show Author Information

Abstract

Topic modeling is a mainstream and effective technology to deal with text data, with wide applications in text analysis, natural language, personalized recommendation, computer vision, etc. Among all the known topic models, supervised Latent Dirichlet Allocation (sLDA) is acknowledged as a popular and competitive supervised topic model. However, the gradual increase of the scale of datasets makes sLDA more and more inefficient and time-consuming, and limits its applications in a very narrow range. To solve it, a parallel online sLDA, named PO-sLDA (Parallel and Online sLDA), is proposed in this study. It uses the stochastic variational inference as the learning method to make the training procedure more rapid and efficient, and a parallel computing mechanism implemented via the MapReduce framework is proposed to promote the capacity of cloud computing and big data processing. The online training capacity supported by PO-sLDA expands the application scope of this approach, making it instrumental for real-life applications with high real-time demand. The validation using two datasets with different sizes shows that the proposed approach has the comparative accuracy as the sLDA and can efficiently accelerate the training procedure. Moreover, its good convergence and online training capacity make it lucrative for the large-scale text data analyzing and processing.

Electronic Supplementary Material

Download File(s)
jcst-33-5-1007-Highlights.pdf (230.4 KB)

References

[1]

Blei D M, Ng A Y, Jordan M Y. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003, 3: 993-1022.

[2]
Blei D M, Mcauliffe J D. Supervised topic models. In Proc. Advances in Neural Information Processing Systems, December 2010, pp.327-332.
[3]
Wang C, Blei D M, Li F F. Simultaneous image classification and annotation. In Proc. IEEE Conference on Computer Vision & Pattern Recognition, January 2009, pp.1903-1910.
[4]

Zhu J, Ahmed A, Xing E P. MedLDA: Maximum margin supervised topic models. Journal of Machine Learning Research, 2012, 13(1): 2237-2278.

[5]

Hoffman M, Blei D M, Wang C, Paisley J. Stochastic variational inference. Computer Science, 2013, 14(1): 1303-1347.

[6]

Hoffman S, Leibler R A. On information and sufficiency. Annals of Mathematical Statistics, 1951, 22(22): 79-86.

[7]

Amari S I. Differential geometry of curved exponential families-curvatures and information loss. Annals of Statistics, 1982, 10(2): 357-385.

[8]
Song W Z, Yang B, Zhao X H, Li F. A fast and scalable supervised topic model using stochastic variational inference and MapReduce. In Proc. the 5th IEEE International Conference on Network Infrastructure and Digital Content, September 2016, pp.94-98.
[9]
Li F F, Perona P. A Bayesian hierarchical model for learning natural scene categories. In Proc. the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2005, pp.524-531.
[10]
Griffiths T L, Steyvers M, Blei D M, Tenenbaum J B. Integrating topics and syntax. In Proc. Advances in Neural Information Processing Systems, December 2004, pp.537-544.
[11]
Wang C, Blei D M. Collaborative topic modeling for recommending scientific articles. In Proc. the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2011, pp.448-456.
[12]
Lacoste-Julien S, Sha F, Jordan M I. DiscLDA: Discriminative learning for dimensionality reduction and classification. In Proc. Advances in Neural Information Processing Systems, January 2008, pp.897-904.
[13]
Ramage D, Hall D, Nallapati R, Manning C D. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proc. the Conference on Empirical Methods in Natural Language Processing, August 2009, pp.248-256.
[14]
Perotte A, Bartlett N, Bartlett N, Wood F. Hierarchically supervised latent Dirichlet allocation. In Proc. Advances in Neural Information Processing Systems, January 2011, pp.2609-2617.
[15]
Boyd-Graber J, Resnik P. Holistic sentiment analysis across languages: Multilingual supervised latent Dirichlet allocation. In Proc. the Conference on Empirical Methods in Natural Language Processing, October 2010, pp.45-55.
[16]
Chen J S, He J, Shen Y L, Xiao L, He X D, Gao J F, Song X Y, Deng L. End-to-end learning of LDA by mirror-descent back propagation over a deep architecture. In Proc. the 28th International Conference on Neural Information Processing Systems, December 2015, pp.1765-1773.
[17]
Zhai K, Boyd-Graber J, Asadi N, Alkhoujia K L. Mr.LDA: A flexible large scale topic modeling package using variational inference in MapReduce. In Proc. ACM International Conference on World Wide Web, April 2012, pp.879-888.
[18]

White T. Hadoop: The Definitive Guide (2nd edition). Yahoo Press, 2010.

[19]
Yu H F, Hsieh C J, Yun H, Vishwanathan S V N, Dhilon I S. A scalable asynchronous distributed algorithm for topic modeling. In Proc. ACM International Conference on World Wide Web, May 2015, pp.1340-1350.
[20]
Liu X S, Zeng J, Yang X et al. Scalable parallel EM algorithms for latent Dirichlet allocation in multi-core systems. In Proc. the 24th ACM International Conference on World Wide Web, May 2015, pp.669-679.
[21]
Yuan J, Gao F, Ho Q et al. LightLDA: Big topic models on modest computer clusters. In Proc. the 24th ACM International Conference on World Wide Web, May 2015, pp.1351-1361.
[22]
Raman P, Zhang J, Yu H F, Ji S H. Extreme stochastic variational inference: Distribution and asynchronous. arXiv preprint arXiv: 1605.09499, 2016. http://cn.arxiv.org/abs/1605.09499, Aug. 2018.
[23]
Hoffman M D, Blei D M, Bach F R. Online learning for latent Dirichlet allocation. In Proc. Advances in Neural Information Processing Systems, November 2010, pp.856-864.
[24]

Jordan M I. Learning in Graphical Models. MIT Press Cambridge, 1999.

[25]
Shvachko K, Kuang H, Radia S, Chansler R. The Hadoop distributed file system. In Proc. the 26th Symposium on Mass Storage Systems and Technologies (MSST), May 2010.
[26]

Leskovec J, Rajaraman A, Ullman J D. Mining of Massive Datasets. Cambridge University Press, 2014.

[27]
Yano T, Smith N A, Wilkerson J D. Textual predictors of bill survival in congressional committees. In Proc. the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, June 2012, pp.793-802.
[28]
Partalas I, Kosmopoulos A, Baskiotis N et al. LSHTC: A benchmark for large-scale text classification. arXiv preprint arXiv: 1503.08581, 2015. http://arxiv.org/abs/1503.08581, Aug. 2018.
[29]

Bizer C, Lehmann J, Kobilarov G et al. DBpedia-A crystallization point for the Web of Data. Web Semantics: Science, Services and Agents on the World Wide Web, 2009, 7(3): 154-165.

Journal of Computer Science and Technology
Pages 1007-1022
Cite this article:
Li Y, Song W-Z, Yang B. Stochastic Variational Inference-Based Parallel and Online Supervised Topic Model for Large-Scale Text Processing. Journal of Computer Science and Technology, 2018, 33(5): 1007-1022. https://doi.org/10.1007/s11390-018-1871-y

299

Views

2

Crossref

N/A

Web of Science

3

Scopus

0

CSCD

Altmetrics

Received: 19 September 2017
Revised: 09 July 2018
Published: 12 September 2018
©2018 LLC & Science Press, China
Return