Authorship identification method based on the embedding of the syntax tree node

Yang ZHANG; Minghu JIANG

doi:10.16511/j.cnki.qhdxxb.2023.21.013

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Journals A - Z

About Us

Publish with Us

Support

PDF (1.2 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Publishing Language: Chinese

Authorship identification method based on the embedding of the syntax tree node

Yang ZHANG, Minghu JIANG(

)

Computational Linguistics Laboratory, Department of Chinese, School of Humanities, Tsinghua University, Beijing 100084, China

Show Author Information

Abstract

Objective

Authorship identification is a study for inferring authorship of an unknown text by analyzing its stylometry or writing style. The traditional research on authorship identification is generally based on the empirical knowledge of literature or linguistics, whereas modern research mostly relies on mathematical methods to quantify the author's writing style. Currently, researchers have proposed various feature combinations and neural network models. Some feature combinations can achieve better results with traditional machine learning classifiers, while some neural network models can autonomously learn the relationship between the input text and corresponding author to extract text features implicitly. However, the current research mostly focuses on character and lexicon features. Furthermore, the exploration of syntactic features is limited. How to use the dependency relationship between different words in a sentence and combine syntactic features with neural networks still remains unclear. This paper proposes an authorship identification method based on the syntax tree node embedding, which introduces syntactic features into a deep learning model.

Methods

We believe that an author's writing style is mainly reflected in the way he chooses words and constructs sentences. Therefore, this paper mainly develops the authorship identification model from the perspectives of words and sentences. The attention mechanism is used to construct sentence-level features. First, an embedding representation of the syntax tree node is proposed, and the syntax tree node is expressed as a sum of embeddings corresponding to all its dependency arcs. Thus, the information on sentence structure and the association between words are introduced into the neural network model. Then, a syntactic attention network using different embedding methods to vectorize text features, such as dependencies, part-of-speech tags, and words, is constructed, and a syntax-aware vector is obtained through this network. Furthermore, the sentence attention network is used to extract the features from the syntax-aware vector to distinguish between different authors, thereby generating the sentence representation. Finally, the result is obtained by the classifier and the correct rate is used to evaluate the result.

Results

Experiments on CCAT10, CCAT50, IMDb62, and the Chinese novel data sets show that an increase in the number of authors causes a downward trend in the accuracy rate of the model proposed in the paper. In some data points, an increase in the number of authors result ed in an increase instead of a decrease in the correct rate. This shows that the ability of the model proposed in this study to capture the writing style of different authors is considerably different. Furthermore, when we change the number of authors on the IMDb dataset, the correct rate of the model in the paper is found to be slightly lower than the BertAA model in the case of 5 authors; however, the rate is higher than the BertAA model in the case of 10, 25, and 50 authors. Additionally, when the experimental results of the model are compared to other models on the CCAT10, CCAT50, and IMDb62 data sets, the performance of this model is observed to be ranked as second or third.

Conclusions

The attention mechanism demonstrated its efficiency in text feature mining, which can fully capture an author's style that is reflected in different parts of the document. The integration of lexical and syntactic features based on the attention mechanism enhances the overall performance of the model. Our model performs well on different Chinese and English datasets. Notably, the introduction of dependency syntactic combination provides more space for the interpretation of the model, which can explain the text styles of different authors at the word selection and sentence construction levels.

Keywords

attention mechanism dependency authorship identification node of the syntax tree

CLC number: TP391 Document code: A Article ID: 1000-0054(2023)09-1390-09

References

[1]

DAELEMANS W. Explanation in computational stylometry[C]//Proceedings of the 14th International Conference on Computational Linguistics and Intelligent Text Processing. Karlovasi, Greece: Springer, 2013: 451-462.

Crossref

[2]

STAMATATOS E. Ensemble-based author identification using character N-grams[C]//Proceedings of the 3rd International Workshop on Text-based Information Retrieval. 2006: 41-46.

[3]

MARTINC M, ŠKRJANEC I, ZUPAN K, et al. PAN 2017: Author profiling-gender and language variety prediction[C]//Working Notes of the Conference and Labs of the Evaluation Forum 2017. Dublin, Ireland: CLEF, 2017.

[4]

SARI Y, VLACHOS A, STEVENSON M. Continuous N-gram representations for authorship attribution[C]//Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Valencia, Spain: Association for Computational Linguistics, 2017: 267-273.

Crossref

[5]

MARTÍN-DEL-CAMPO-RODRÍGUEZ C, ALVAREZ D A P, SIFUENTES C E M, et al. Authorship attribution through punctuation n-grams and averaged combination of SVM[C]//Working Notes of the Conference and Labs of the Evaluation Forum 2019. Lugano, Switzerland: CLEF, 2019.

[6]

SIDOROV G, VELASQUEZ F, STAMATATOS E, et al. Syntactic N-grams as machine learning features for natural language processing[J]. Expert Systems with Applications, 2014, 41(3): 853-860.

Crossref Google Scholar

[7]

KEŠELJ V, PENG F C, CERCONE N, et al. N-gram-based author profiles for authorship attribution[C]//Proceedings of the Pacific Association for Computational Linguistics. Halifax, Canada: Pacific Association for Computational Linguistics, 2003: 255-264.

[8]

HOUVARDAS J, STAMATATOS E. N-gram feature selection for authorship identification[C]//Proceedings of the 12th International Conference on Artificial Intelligence: Methodology, Systems, and Applications. Varna, Bulgaria: Springer, 2006: 77-86.

Crossref

[9]

GARCÍA A M, MARTÍN J C. Function words in authorship attribution studies[J]. Literary and Linguistic Computing, 2007, 22(1): 49-66.

Crossref Google Scholar

[10]

TSCHUGGNALL M, SPECHT G. Enhancing authorship attribution by utilizing syntax tree profiles[C]//Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. Gothenburg, Sweden: Association for Computational Linguistics, 2014: 195-199.

Crossref

[11]

BOUANANI S E M E, KASSOU I. Authorship analysis studies: A survey[J]. International Journal of Computer Applications, 2014, 86(12): 22-29.

Crossref Google Scholar

[12]

RAGHAVAN S, KOVASHKA A, MOONEY R. Authorship attribution using probabilistic context-free grammars[C]//Proceedings of the ACL 2010 Conference Short Papers. Uppsala, Sweden: Association for Computational Linguistics, 2010: 38-42.

[13]

JAFARIAKINABAD F, TARNPRADAB S, HUA K A. Syntactic recurrent neural network for authorship attribution[R/OL]. (2019-02-27)[2022-03-18]. https://arxiv.org/pdf/1902.09723.pdf.

[14]

JAFARIAKINABAD F, HUA K A. Style-aware neural model with application in authorship attribution[C]//Proceedings of the 18th IEEE International Conference on Machine Learning and Applications. Boca Raton, USA: IEEE, 2019: 325-328.

Crossref

[15]

ZHANG R C, HU Z Y, GUO H Y, et al. Syntax encoding with application in authorship attribution[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, 2018: 2742-2753.

Crossref

[16]

STAMATATOS E. Author identification: Using text sampling to handle the class imbalance problem[J]. Information Processing & Management, 2008, 44(2): 790-799.

Crossref Google Scholar

[17]

SEROUSSI Y, ZUKERMAN I, BOHNERT F. Collaborative inference of sentiments from texts[C]//Proceedings of the 18th International Conference on User Modeling, Adaptation, and Personalization. Big Island, USA: Springer, 2010: 195-206.

Crossref

[18]

ZHANG Y, JIANG M H. Authorship identification of text based on attention mechanism[J]. Journal of Computer Applications, 2021, 41(7): 1897-1901. (in Chinese)

Google Scholar

[19]

SAPKOTA U, BETHARD S, MONTES-Y-GÓMEZ M, et al. Not all character n-grams are created equal: A study in authorship attribution[C]//Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Denver, Colorado: Association for Computational Linguistics, 2015: 93-102.

Crossref

[20]

PLAKIAS S, STAMATATOS E. Tensor space models for authorship identification[C]//Proceedings of the 5th Hellenic Conference on Artificial Intelligence: Theories, Models and Applications. Syros, Greece: Springer, 2008: 239-249.

Crossref

[21]

ESCALANTE H J, SOLORIO T, MONTES-Y-GÓMEZ M. Local histograms of character n-grams for authorship attribu- tion[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Portland, USA: Association for Computational Linguistics, 2011: 288-298.

[22]

SEROUSSI Y, ZUKERMAN I, BOHNERT F. Authorship attribution with topic models[J]. Computational Linguistics, 2014, 40(2): 269-310.

Crossref Google Scholar

[23]

RUDER S, GHAFFARI P, BRESLIN J G, et al. Character-level and multi-channel convolutional neural networks for large-scale authorship attribution[R/OL]. (2016-09-21)[2022-03-18]. https://arxiv.org/pdf/1609.06686.pdf.

[24]

FABIEN M, VILLATORO-TELLO E, MOTLICEK P, et al. BertAA: BERT fine-tuning for authorship attribution[C]//Proceedings of the 17th International Conference on Natural Language Processing. Patna, India: ACL, 2020: 127-137.

[25]

WU H Y. Syntactic structure modeling and application based on neural networks[D]. Beijing: Tsinghua University, 2020. (in Chinese)

Journal of Tsinghua University (Science and Technology)

Volume 63 Issue 9,
September 2023

Pages 1390-1398

DOI: 10.16511/j.cnki.qhdxxb.2023.21.013

Cite this article:

ZHANG Y, JIANG M. Authorship identification method based on the embedding of the syntax tree node. Journal of Tsinghua University (Science and Technology), 2023, 63(9): 1390-1398. https://doi.org/10.16511/j.cnki.qhdxxb.2023.21.013

119

Views

Downloads

Crossref

Scopus

CSCD

Google Scholar
Citation

Altmetrics

Received: 25 April 2022

Published: 15 September 2023