Chinese-oriented entity recognition method of character vocabulary combination sequence

Qingren WANG; Yinzi WANG; Hong ZHONG; Yiwen ZHANG

doi:10.16511/j.cnki.qhdxxb.2023.21.009

| Sign up

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Outline

Abstract

Keywords

References

Show full outline

Hide outline

Publishing Language: Chinese

Chinese-oriented entity recognition method of character vocabulary combination sequence

Qingren WANG, Yinzi WANG, Hong ZHONG(), Yiwen ZHANG

College of Computer Science and Technology, Anhui University, Hefei 230093, China

Show Author Information

Abstract

Objective

As the core task of information extraction, named entity recognition recognizes various types of named entities from the text. The task of Chinese-named entity recognition has benefited from the application of deep learning in character vocabulary representation, feature extraction, and other aspects, achieving rich results. However, this task still faces the challenge of a lack of vocabulary information, which has been regarded as one of the primary impediments to the development of a high-performance Chinese-named entity recognition (NER) system. While the automatically constructed dictionary contains rich lexical boundary information and lexical semantic information, the integration of word knowledge in the Chinese NER task still faces challenges, such as the effective integration of the semantic information of self-matching words and their context information into Chinese characters. Furthermore, although graph neural networks can be used to extract feature information from various Chinese character-vocabulary interaction diagrams in feature extraction, the challenge of how to fuse features based on the importance of the information from the respective interaction diagrams into the original input sequence is yet to be solved.

Methods

This paper proposes a Chinese-oriented entity recognition method of Chinese-vocabulary combination sequence. (1) First, this method proposes a Chinese-vocabulary combination sequence embedding structure that primarily uses self-matching words to replace the Chinese characters in the Chinese character sequence under consideration. To make complete use of the self-matching vocabulary information, we also constructed a sequence for the self-matching vocabulary and vectorized the vocabulary and Chinese characters. At the coding level, we obtained the context information of the Chinese character sequence, the vocabulary sequence, and the Chinese-word combination sequence using the BiLSTM model and then fused the information from the words in the Chinese word combination sequence into the corresponding words in the vocabulary sequence. Furthermore, the graph neural network was used to extract the features of different Chinese-vocabulary interaction diagrams so that the enhanced vocabulary information can be integrated into Chinese characters, which can not only make complete use of the vocabulary boundary information but also integrate the context information of the self-matching vocabulary sequence into characters while capturing the semantic information between the Chinese characters and words, further enriching the character features. Finally, the conditional random field was used to decode and label the entities. (2) Considering the importance of different Chinese character-word interaction diagram information to the original input Chinese character sequence is not the same, this method proposes a multigraph attention fusion structure. It assigns a score to the correlation of the Chinese character sequence based on different Chinese character-word interaction diagram information, differentiates between structural features based on their importance, and fuses different Chinese character-word interaction diagram information into the Chinese character sequence based on their proportions.

Results

The F1 value of the new method was higher than that of the original method on Weibo, Resume, OntoNotes4.0, and MSRA data by 3.17% (Weibo_all), 1.21%, 1.33%, and 0.43%, respectively, thus verifying the feasibility of the new method on Chinese NER tasks.

Conclusions

The experiment revealed that the proposed method is more effective than the original method.

Keywords

natural language processing named entity recognition graph attention neural network character-word combination embedding multigraph attention

CLC number: TP393.1 Document code: A Article ID: 1000-0054(2023)09-1326-13

References

[1]

JU S G, LI T N, SUN J P. Chinese fine-grained name entity recognition based on associated memory networks[J]. Journal of Software, 2021, 32(8): 2545-2556. (in Chinese)

Google Scholar

[2]

SUN J, GAO J F, ZHANG L, et al. Chinese named entity identification using class-based language model[C]// COLING 2002: The 19th International Conference on Computational Linguistics. Taipei, China: Association for Computational Linguistics, 2002: 1-7.