Publications
Sort:
Issue
Authorship identification method based on the embedding of the syntax tree node
Journal of Tsinghua University (Science and Technology) 2023, 63 (9): 1390-1398
Published: 15 September 2023
Abstract PDF (1.2 MB) Collect
Downloads:1
Objective

Authorship identification is a study for inferring authorship of an unknown text by analyzing its stylometry or writing style. The traditional research on authorship identification is generally based on the empirical knowledge of literature or linguistics, whereas modern research mostly relies on mathematical methods to quantify the author's writing style. Currently, researchers have proposed various feature combinations and neural network models. Some feature combinations can achieve better results with traditional machine learning classifiers, while some neural network models can autonomously learn the relationship between the input text and corresponding author to extract text features implicitly. However, the current research mostly focuses on character and lexicon features. Furthermore, the exploration of syntactic features is limited. How to use the dependency relationship between different words in a sentence and combine syntactic features with neural networks still remains unclear. This paper proposes an authorship identification method based on the syntax tree node embedding, which introduces syntactic features into a deep learning model.

Methods

We believe that an author's writing style is mainly reflected in the way he chooses words and constructs sentences. Therefore, this paper mainly develops the authorship identification model from the perspectives of words and sentences. The attention mechanism is used to construct sentence-level features. First, an embedding representation of the syntax tree node is proposed, and the syntax tree node is expressed as a sum of embeddings corresponding to all its dependency arcs. Thus, the information on sentence structure and the association between words are introduced into the neural network model. Then, a syntactic attention network using different embedding methods to vectorize text features, such as dependencies, part-of-speech tags, and words, is constructed, and a syntax-aware vector is obtained through this network. Furthermore, the sentence attention network is used to extract the features from the syntax-aware vector to distinguish between different authors, thereby generating the sentence representation. Finally, the result is obtained by the classifier and the correct rate is used to evaluate the result.

Results

Experiments on CCAT10, CCAT50, IMDb62, and the Chinese novel data sets show that an increase in the number of authors causes a downward trend in the accuracy rate of the model proposed in the paper. In some data points, an increase in the number of authors result ed in an increase instead of a decrease in the correct rate. This shows that the ability of the model proposed in this study to capture the writing style of different authors is considerably different. Furthermore, when we change the number of authors on the IMDb dataset, the correct rate of the model in the paper is found to be slightly lower than the BertAA model in the case of 5 authors; however, the rate is higher than the BertAA model in the case of 10, 25, and 50 authors. Additionally, when the experimental results of the model are compared to other models on the CCAT10, CCAT50, and IMDb62 data sets, the performance of this model is observed to be ranked as second or third.

Conclusions

The attention mechanism demonstrated its efficiency in text feature mining, which can fully capture an author's style that is reflected in different parts of the document. The integration of lexical and syntactic features based on the attention mechanism enhances the overall performance of the model. Our model performs well on different Chinese and English datasets. Notably, the introduction of dependency syntactic combination provides more space for the interpretation of the model, which can explain the text styles of different authors at the word selection and sentence construction levels.

Issue
Semantic and syntactic processing of Chinese [S+V+O] simple sentence structures—ERPs evidence
Journal of Tsinghua University (Science and Technology) 2022, 62 (12): 2053-2060
Published: 15 December 2022
Abstract PDF (3.6 MB) Collect
Downloads:1

Syntax-first models and semantic priority are two opposing views in sentence processing theory. This study took N400 and P600 effects as the main analysis objects to explore the cognitive processing mechanism of Chinese sentences with semantic violations, with syntax violations and with both semantic and syntax violations of the "subject (noun) + predicate (verb) + object (noun)" structure without modifiers (referred to as Chinese [S+V+O] simple sentence structure) in the brain. The results of figures showed that semantic violation sentences, syntactic violation sentences, and combined violation sentences all triggered the N400 effect between 300 and 400 ms. Among them, the N400 amplitude of the semantic violation sentence and the syntactic violation sentence were similar, but the N400 amplitude for sentences with both semantic and syntax violations were more negative than the N400 amplitude with only one semantic and sytax violation. Only the semantical violations produced the P600 tendency. The research results indicated that Chinese sentences with the [S+V+O] simple structure might not fit with syntax-first model. The results also showed that the brain response to this sentence structure differs from the EEG amplitude caused by the "ba" sentence and the "bei" sentence. Thus, this research concludes that sentence processing in brain might differ for language types and language structures.

Total 2