Measuring Similarity of Academic Articles with Semantic Profile and Joint Word Embedding

Ming Liu; Bo Lang; Zepeng Gu; Ahmed Zeeshan

doi:10.23919/TST.2017.8195345

| Sign up

PDF (6.2 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Open Access

Measuring Similarity of Academic Articles with Semantic Profile and Joint Word Embedding

Ming Liu(), Bo Lang, Zepeng Gu, Ahmed Zeeshan

State Key Laboratory of Software Development Environment, Beihang University, Beijing 100191, China.

Show Author Information

Abstract

Long-document semantic measurement has great significance in many applications such as semantic searchs, plagiarism detection, and automatic technical surveys. However, research efforts have mainly focused on the semantic similarity of short texts. Document-level semantic measurement remains an open issue due to problems such as the omission of background knowledge and topic transition. In this paper, we propose a novel semantic matching method for long documents in the academic domain. To accurately represent the general meaning of an academic article, we construct a semantic profile in which key semantic elements such as the research purpose, methodology, and domain are included and enriched. As such, we can obtain the overall semantic similarity of two papers by computing the distance between their profiles. The distances between the concepts of two different semantic profiles are measured by word vectors. To improve the semantic representation quality of word vectors, we propose a joint word-embedding model for incorporating a domain-specific semantic relation constraint into the traditional context constraint. Our experimental results demonstrate that, in the measurement of document semantic similarity, our approach achieves substantial improvement over state-of-the-art methods, and our joint word-embedding model produces significantly better word representations than traditional word-embedding models.

Keywords

document semantic similarity text understanding semantic enrichment word embedding scientific literature analysis

References

[1]

Tenenbaum

J. B.

, Kemp

, Griffiths

T. L.

, and Goodman

N. D.

, How to grow a mind: Statistics, structure, and abstraction, Science, vol. 331, no. 6022, pp. 1279-1285, 2011.

Crossref Google Scholar

[2]

Pan

J. Y.

, Cheng

C. P. J.

, Lau

G. T.

, and Law

K. H.

, Utilizing statistical semantic similarity techniques for ontology mapping—with applications to AEC standard models, Tsinghua Sci. Technol., vol. 13, no. S1, pp. 217-222, 2008.

Crossref Google Scholar

[3]

Leacock

and Chodorow

, Combining Local Context and WordNet Similarity for Word Sense Identification. The MIT Press, 1998.

[4]

Mikolov

, Chen

, Corrado

, and Dean

, Efficient estimation of word representations in vector space, arXiv preprint arXiv: 1301.3781, 2013.

[5]

Resnik

, Using information content to evaluate semantic similarity in a taxonomy, in Proc. 14th Int. Joint Conf. Artificial Intelligence, Montreal, Canada, 1995.

[6]

Rus

, Lintean

M. C.

, Graesser

, and McNamara

, Assessing student paraphrases using lexical semantics and word weighting, in Proc. 14th Int. Conf. Artificial Intelligence in Education, Brighton, UK, 2009.

[7]

Corley

and Mihalcea

, Measuring the semantic similarity of texts, in Proc. ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, Ann Arbor, MI, USA, 2005, pp. 13-18.

Crossref

[8]

, Luo

X. F.

, Zhang

S. X.

, Wei

, Mei

, and Hu

C. P.

, Mining temporal explicit and implicit semantic relations between entities using web search engines, Future Generat. Comput. Syst., vol. 37, pp. 468-477, 2014.

Crossref Google Scholar

[9]

, Luo

X. F.

, Yu

, and Xu

W. M.

, Measuring semantic similarity between words by removing noise and redundancy in web snippets, Concurr. Comput. Pract. Exp., vol. 23, no. 18, pp. 2496-2510, 2011.

Crossref Google Scholar

[10]

, Luo

X. F.

, Mei

, and Hu

C. P.

, Measuring the semantic discrimination capability of association relations, Concurr. Comput. Pract. Exp., vol. 26, no. 2, pp. 380-395, 2014.

Crossref Google Scholar

[11]

Agirre

, Banea

, Cardie

, Cer

, Diab

, Gonzalez-Agirre

, Guo

W. W.

, Lopez-Gazpio

, Maritxalar

, Mihalcea

, et al., SemEval-2015 task 2: Semantic textual similarity, English, Spanish and pilot on interpretability, in Proc. 9th Int. Workshop on Semantic Evaluation (SemEval 2015), Denver, CO, USA, 2015.

Crossref

[12]

Šaric

, Glavaš

, Karan

, Šnajder

, and Bašic

B. D.

, Takelab: Systems for measuring semantic text similarity, in Proc. 6th Int. Workshop on Semantic Evaluation, Montréal, Canada, 2012, pp. 441-448.

[13]

Bär

, Biemann

, Gurevych

, and Zesch

, UKP: Computing semantic textual similarity by combining multiple content similarity measures, in Proc. 1st Joint Conf. Lexical and Computational Semantics, Montréal, Canada, 2012.

[14]

Han

, Zhu

, Chen

, Zheng

, and Lu

, A comparative analysis on weibo and twitter, Tsinghua Sci. Technol., vol. 21, no. 1, pp. 1-16, 2016.

Crossref Google Scholar

[15]

Zhang

M. Y.

, Qin

, Liu

, and Zheng

, Triple based background knowledge ranking for document enrichment, in Proc. COLING 2014, the 25th Int. Conf. Computational Linguistics: Technical Papers, Dublin, Ireland, 2014.

[16]

Schuhmacher

and Ponzetto

S. P.

, Knowledge-based graph document modeling, in Proc.7th ACM Int. Conf. Web Search and Data Mining, New York, NY, USA, 2014, pp. 543-552.

Crossref

[17]

Ramage

, Rafferty

A. N.

, and Manning

C. D.

, Random walks for text semantic similarity, in Proc. 2009 Workshop on Graph-Based Methods for Natural Language Processing, Suntec, Singapore, 2009, pp. 23-31.

Crossref

[18]

Zhang

M. Y.

, Qin

, Zheng

, Hirst

, and Liu

, Encoding distributional semantics into triple-based knowledge ranking for document enrichment, in Proc. 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Int. Joint Conf. Natural Language Processing, Beijing, China, 2015.

Crossref

[19]

Salton

, Wong

, and Yang

C. S.

, A vector space model for automatic indexing, Commun. ACM, vol. 11, no. 11, pp. 613-620, 1975.

Crossref Google Scholar

[20]

Miller

G. A.

, WordNet: A lexical database for English, Commun. ACM, vol. 38, no. 11, pp. 39-41, 1995.

Crossref Google Scholar

[21]

Bollacker

, Evans

, Paritosh

, Sturge

, and Taylor

, Freebase: A collaboratively created graph database for structuring human knowledge, in Proc. 2008 ACM SIGMOD Int. Conf. Management of Data, Vancouver, Canada, 2008, pp. 1247-1250.

Crossref

[22]

Landauer

T. K.

, Foltz

P. W.

, and Laham

, An introduction to latent semantic analysis, Dis. Process., vol. 25, nos. 2&3, pp. 259-284, 1998.

Crossref Google Scholar

[23]

Wang

D. Q.

, Zhang

, Liu

X. L.

, and Wang

, Unsupervised feature selection through Gram-Schmidt orthogonalization—A word co-occurrence perspective, Neurocomputing, vol. 173, pp. 845-854, 2016.

Crossref Google Scholar

[24]

Blei

D. M.

, Ng

A. Y.

, and Jordan

M. I.

, Latent dirichlet allocation, J. Mach. Learn. Res., vol. 3, pp. 993-1022, 2003.

Google Scholar

[25]

Bengio

, Schwenk

, Senécal

J. S.

, Morin

, and Gauvain

J. L.

, Neural probabilistic language models, in Innovations in Machine Learning, Holmes

D. E.

and Jain

L. C.

, eds. Springer, 2006, pp. 137-186.

[26]

Q. V.

and Mikolov

, Distributed representations of sentences and documents, arXiv preprint arXiv: 1405.4053, 2014.

[27]

Auer

, Bizer

, Kobilarov

, Lehmann

, Cyganiak

, and Ives

, DBpedia: A nucleus for a web of open data, in The Semantic Web, Aberer

, Choi

K. S.

, Noy

, Allemang

, Lee

K. I.

, Nixon

, Golbeck

, Mika

, Maynard

, Mizoguchi

, et al., eds. Springer, 2007.

[28]

Pennington

, Socher

, and Manning

C. D.

, GloVe: Global vectors for word representation, in Proc. 2014 Conf. Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 2014.

[29]

Gabrilovich

and Markovitch

, Computing semantic relatedness using Wikipedia-based explicit semantic analysis, in Proc. 20th Int.Joint Conf. Artifical Intelligence, Hyderabad, India, 2007, pp. 1606-1611.

[30]

Rafi

and Shaikh

M. S.

, An improved semantic similarity measure for document clustering based on topic maps, arXiv preprint arXiv: 1303.4087, 2013.

[31]

Rus

, Niraula

, and Banjade

, Similarity measures based on latent Dirichlet allocation, in Computational Linguistics and Intelligent Text Processing, Gelbukh

, ed. Springer, 2013, pp. 459-470.

Crossref

[32]

Rus

, Lintean

, Banjade

, Niraula

, and Stefanescu

, SEMILAR: The semantic similarity toolkit, in Proc. 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, 2013, pp. 163-168.

[33]

Z. B.

and Palmer

, Verb semantics and lexical selection, in Proc. 32nd Annual Meeting on Association for Computational Linguistics, Las Cruces, NM, USA, 1994, pp. 133-138.

[34]

Fried

and Duh

, Incorporating both distributional and relational semantics in word representations, arXiv preprint arXiv: 1412.4369, 2014.

[35]

and Dredze

, Improving lexical embeddings with semantic knowledge, in Proc. 52nd Annual Meeting of the Association for Computational Linguistics (Short Papers), Baltimore, MD, USA, 2014, pp. 545-550.

Crossref

[36]

Radev

D. R.

, Muthukrishnan

, and Qazvinian

, The ACL anthology network corpus, in Proc. 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries, Suntec, Singapore, 2009, pp. 54-61.

Crossref

[37]

Dolan

, Quirk

, and Brockett

, Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources, in Proc. 20th Int. Conf. Computational Linguistics, Geneva, Switzerland, 2004, p. 350.

Crossref

[38]

Rus

, Lintean

, Moldovan

, Baggett

, Niraula

, and Morgan

, The SIMILAR corpus: A resource to foster the qualitative understanding of semantic similarity of texts, in Proc. 8th Language Resources and Evaluation Conf., Instanbul, Turkey, 2012, pp. 23-25.

Tsinghua Science and Technology

Volume 22 Issue 6,
December 2017

Pages 619-632

DOI: 10.23919/TST.2017.8195345

Cite this article:

Liu M, Lang B, Gu Z, et al. Measuring Similarity of Academic Articles with Semantic Profile and Joint Word Embedding. Tsinghua Science and Technology, 2017, 22(6): 619-632. https://doi.org/10.23919/TST.2017.8195345