A Tibetan Sentence Boundary Disambiguation Model Considering the Components on Information on Both Sides of Shad

Fenfang Li; Hui Lv; Yiming Gao; Dolha; Yan Li; Qingguo Zhou

doi:10.26599/TST.2022.9010055

| Sign up

PDF (4.8 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Open Access

A Tibetan Sentence Boundary Disambiguation Model Considering the Components on Information on Both Sides of Shad

Fenfang Li^¹, Hui Lv^¹, Yiming Gao^¹, Dolha^², Yan Li^¹, Qingguo Zhou^¹()

1School of Information Science and Engineering, Lanzhou University, Lanzhou 730000, China

2Key Laboratory of China’s National Linguistic Information Technology, Northwest Minzu University, Lanzhou 730030, China

Show Author Information

Abstract

Sentence Boundary Disambiguation (SBD) is a preprocessing step for natural language processing. Segmenting text into sentences is essential for Deep Learning (DL) and pretraining language models. Tibetan punctuation marks may involve ambiguity about the sentences’ beginnings and endings. Hence, the ambiguous punctuation marks must be distinguished, and the sentence structure must be correctly encoded in language models. This study proposed a component-level Tibetan SBD approach based on the DL model. The models can reduce the error amplification caused by word segmentation and part-of-speech tagging. Although most SBD methods have only considered text on the left side of punctuation marks, this study considers the text on both sides. In this study, 465 669 Tibetan sentences are adopted, and a Bidirectional Long Short-Term Memory (Bi-LSTM) model is used to perform SBD. The experimental results show that the F1-score of the Bi-LSTM model reached 96 $%$ , the most efficient among the six models. Experiments are performed on low-resource languages such as Turkish and Romanian, and high-resource languages such as English and German, to verify the models’ generalization.

Keywords

Sentence Boundary Disambiguation (SBD)punctuation marks ambiguity Bidirectional Long Short-Term Memory (Bi-LSTM) model

References

[1]

Sirts

and K.

Peekman

, Evaluating sentence segmentation and word tokenization systems on Estonian web texts, in Human Language Technologies-the Baltic Perspective (HIT 2020). Amsterdam, the Netherlands: IOS Press, 2020, pp. 174–181.

Crossref

[2]

Asghar

, S.

Akbar

, M. Z.

Asghar

, B.

Ahmad

, M. S.

Al-Rakhami

, and A.

Gumaei

, Detection and classification of psychopathic personality trait from social media text using deep learning model, Comput. Math. Methods Med., vol. 2021, p. 5512241, 2021.

Crossref Google Scholar

[3]

H. B.

Wang

, J. X.

Wang

, Q.

Shen

, Y. T.

Xian

, and Y. F.

Zhang

, Maximum entropy Thai sentence segmentation combined with Thai grammar rules correction, Univ. Politehn. Bucharest Sci. Bull. Seri. C-Electr. Eng. Comput. Sci., vol. 82, no. 1, pp. 19–34, 2020.

Google Scholar

[4]

T. N.

, T. Y.

Chong

, V. H.

, V. T.

Pham

, and E. S.

Chng

, Improving efficiency of sentence boundary detection by feature selection, In Intelligent Information and Database Systems, N. T.

Nguyen

, B.

Trawiński

, H.

Fujita

, and T. P.

Hong

, eds. Berlin, Germany: Springer, 2016, pp. 169–174.

[5]

Zhao

, A.

Zhang

, Y.

Liu

, and H.

Fei

, Encoding multi-granularity structural information for joint Chinese word segmentation and POS tagging, Pattern Recogn. Lett., vol. 138, pp. 163–169, 2020.

Crossref Google Scholar

[6]

Elnagar

, R.

Al-Debsi

, and O.

Einea

, Arabic text classification using deep learning models, Informat. Process. Manag., vol. 57, no. 1, p. 102121, 2020.

Crossref Google Scholar

[7]

, H.

, S.

Liu

, X.

Zhu

, and M.

Huang

, SentiLR: Linguistic knowledge enhanced language representation for sentiment analysis, in Proc. Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 2020, pp. 6975–6988.

Google Scholar

[8]

Ainslie

, S.

Ontañón

, C.

Alberti

, V.

Cvicek

, Z.

Fisher

, P.

Pham

, A.

Ravula

, S.

Sanghai

, Q.

Wang

, and L.

Yang

, ETC: Encoding long and structured inputs in transformers, in Proc. 2020 Conf. Empirical Methods in Natural Language Processing EMNLP, Punta Cana, Dominican Republic, 2020, pp. 268–284.

Crossref Google Scholar

[9]

, C.

, M.

Yan

, W.

Wang

, S.

Huang

, F.

Huang

, and L.

, PALM: Pre-training an autoencoding & autoregressive language model for context-conditioned generation, in Proc. 2020 Conf. Empirical Methods in Natural Language Processing EMNLP, Punta Cana, Dominican Republic, 2020, pp. 8681–8691.

Crossref Google Scholar

[10]

Singh

, B. P.

Singh

, A. K.

Poddar

, and A.

Singh

, Sentence boundary detection for Hindi-English social media text, in Recent Findings in Intelligent Computing Techniques, P. K.

, S.

Bakshi

, I. K.

Hatzilygeroudis

, and M. N.

Sahoo

, eds. Singapore: Springer, 2018, pp. 207–215.

Crossref

[11]

Özbey

and Ö.

Dinçsoy

, Sentence boundary detection in Turkish news with regular expressions, in Proc. of 2019 27^th Signal Processing and Communications Applications Conf., Sivas, Turkey, 2019, pp. 1–4.

Crossref Google Scholar

[12]

Mekki

, I.

Zribi

, M.

Ellouze

, and L. H.

Belguith

, Sentence boundary detection of various forms of Tunisian Arabic, Lang. Res. Eval., vol. 56, no. 1, pp. 357–385, 2022.

Crossref Google Scholar

[13]

Sun

and C.

, News text classification method and simulation based on the hybrid deep learning model, Complexity, vol. 2021, p. 8064579, 2021.

Crossref Google Scholar

[14]

Minaee

, N.

Kalchbrenner

, E.

Cambria

, N.

Nikzad

, M.

Chenaghlu

, and J.

Gao

, Deep learning-based text classification: A comprehensive review, ACM Comput. Surv., vol. 54, no. 3, p. 62, 2021.

Crossref Google Scholar

[15]

Wan

and X.

, Tibetan syntactic parsing based on syllables, in Proc. 3^rd Int. Conf. Mechatronics and Industrial Informatics, Zhuhai, China, 2015, pp. 753–756.

Crossref Google Scholar

[16]

Maimaiti

, Y.

Liu

, H.

Luan

, and M.

Sun

, Enriching the transfer learning with pre-trained lexicon embedding for low-resource neural machine translation, Tsinghua Science and Technology, vol. 27, no. 1, pp. 150–163, 2022.

Crossref Google Scholar

[17]

Lobsang

, W.

, K.

Honda

, J.

Wei

, W.

Guan

, Q.

Fang

, and J.

Dang

, Tibetan vowel analysis with a multi-modal Mandarin-Tibetan speech corpus, in Proc. 2016 Asia-Pacific Signal and Information Processing Association Ann. Summit and Conf. (APSIPA ), Jeju, Republic of Korea, 2016, pp. 1–6.

Crossref Google Scholar

[18]

F. C.

Wan

, H. Z.

, X. H.

, and X. Z.

, Tibetan syntactic parsing for Tibetan-Chinese machine translation, in Proc. Int. Conf. Advanced Computer Science and Engineering (ACSE 2014), Guangzhou, China, 2014, pp. 371–376.

Crossref Google Scholar

[19]

Liang

, F.

Tian

, and B.

Sun

, Current status of Tibetan sentiment analysis and cross-language analysis, in Proc. 2018 6^th Int. Conf. Machinery, Materials and Computing Technology (ICMMCT 2018), Jinan, China, 2018, pp. 324–329.

Crossref Google Scholar

[20]

Bie

and Y.

Yang

, A multitask multiview neural network for end-to-end aspect-based sentiment analysis, Big Data Mining and Analytics, vol. 4, no. 3, pp. 195–207, 2021.

Crossref Google Scholar

[21]

, L.

Xie

, and X.

Xiao

, A bidirectional LSTM approach with word embeddings for sentence boundary detection, J. Signal Process. Syst., vol. 90, no. 7, pp. 1063–1075, 2018.

Crossref Google Scholar

[22]

, D.

Liu

, W.

Zhu

, Y.

Zhang

, and S.

Zhao

, Attention-based LSTM, GRU and CNN for short text classification, J. Intell. Fuzzy Syst. Appl. Eng. Technol,, vol. 39, no. 1, pp. 333–340, 2020.

Crossref Google Scholar

[23]

Hochreiter

and J.

Schmidhuber

, Long short-term memory, Neural Computat., vol. 9, no. 8, pp. 1735–1780, 1997.

Crossref Google Scholar

[24]

T. A.

, Sequence labeling approach to the task of sentence boundary detection, in Proc. 4^th Int. Conf. Machine Learning and Soft Computing (ICMLSC 2020), Haiphong City, Vietnam, 2020, pp. 144–148.

Google Scholar

[25]

Wang

, J.

, X.

Zhang

, and S.

Liu

, A short text classification method based on

N

-gram and CNN, Chin. J. Electron., vol. 29, no. 2, pp. 248–254, 2020.

Crossref Google Scholar

[26]

Gao

, M.

Wang

, Y.

, and C.

Zhang

, Human motion sequence recognition based on correlation feature selection and multilayer perceptron, in Proc. SPIE 11584, 2020 Int. Conf. Image, Video Processing and Artificial Intelligence, Shanghai, China, 2020, p. 115841D.

Crossref Google Scholar

[27]

Liu

, E.

Shriberg

, A.

Stolcke

, D.

Hillard

, M.

Ostendorf

, and M.

Harper

, Enriching speech recognition with automatic detection of sentence boundaries and disfluencies, IEEE Trans. Audio Speech Lang. Process., vol. 14, no. 5, pp. 1526–1540, 2006.

Crossref Google Scholar

[28]

and B.

Gong

, Word embedding and text classification based on deep learning methods, in Proc. 2020 2^nd Int. Conf. Computer Science Communication and Network Security (CSCNS2020), Sanya, China, 2020, p. 06022.

Crossref Google Scholar

[29]

Al-Doulat

, I.

Obaidat

, and M.

Lee

, Unstructured medical text classification using linguistic analysis: A supervised deep learning approach, in Proc. 2019 IEEE/ACS 16^th Int. Conf. Computer Systems and Applications (AICCSA 2019), Abu Dhabi, the United Arab Emirates, 2019, pp. 1–7.

Crossref Google Scholar

[30]

Zhang

, B.

, W.

Wang

, S.

Wan

, and W.

Chen

, MII: A novel text classification model combining deep active learning with Bert, Comput. Mater. Con., vol. 63, no. 3, pp. 1499–1514, 2020.

Crossref Google Scholar

[31]

M. V.

Abrahams

and M. G.

Kattenfeld

, The role of turbidity as a constraint on predator-prey interactions in aquatic environments, Behav. Ecol. Sociobiol., vol. 40, no. 3, pp. 169–174, 1997.

Crossref Google Scholar

[32]

Read

, R.

Ridan

, S.

Oepen

, and L. J.

Solberg

, Sentence boundary detection: A long solved problem? in Proc. COLING 2012: Posters, Mumbai, India, 2012, pp. 985–994.

Google Scholar

[33]

M. D.

Riley

, Some applications of tree-based modelling to speech and language, in Proc. Workshop on Speech and Natural Language, Cape Cod, MA, USA, 1989, pp. 339–352.

Crossref Google Scholar

[34]

D. D.

Palmer

and M. A.

Hearst

, Adaptive multilingual sentence boundary disambiguation, Computat. Linguist., vol. 23, no. 2, pp. 241–267, 1997.

Google Scholar

[35]

J. C.

Reynar

and A.

Ratnaparkhi

, A maximum entropy approach to identifying sentence boundaries, in Proc. 5^th Conf. Applied Natural Language Processing, Washington, DC, USA, 1997, pp. 16–19.

Crossref Google Scholar

[36]

Gillick

, Sentence boundary detection and the problem with the U.S., in Proc. Human Language Technologies: 2009 Ann. Conf. North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, Boulder, CO, USA, 2009, pp. 241–244.

Crossref Google Scholar

[37]

Mikheev

, Tagging sentence boundaries, in Proc. 1^st North American Chapter of the Association for Computational Linguistics Conf., Seattle, WA, USA, 2000, pp. 264–271.

Google Scholar

[38]

Mikheev

, Periods, capitalized words, etc., Comput. Linguist., vol. 28, no. 3, pp. 289–318, 2002.

Crossref Google Scholar

[39]

Kiss

and J.

Strunk

, Unsupervised multilingual sentence boundary detection, Comput. Linguist., vol. 32, no. 4, pp. 485–525, 2006.

Crossref Google Scholar

[40]

Hellwig

, Detecting sentence boundaries in Sanskrit texts, in Proc. COLING 2016, 26^th Int. Conf. Computational Linguistics: Technical Papers, Osaka, Japan, 2016, pp. 288–297.

Google Scholar

[41]

H. H.

Hock

, Some issues in Sanskrit syntax, in Proc. Seminar on Sanskrit Syntax and Discourse Structures, Pairs, France, 2013, pp. 13–15.

Google Scholar

[42]

Yang

, A.

Stolcke

, E.

Shriberg

, and M. P.

Harper

, Using conditional random fields for sentence boundary detection in speech, in Proc. 43^rd Ann. Meeting of the Association for Computational Linguistics, Ann Arbor, MI, USA, 2005, pp. 451–458.

Google Scholar

[43]

Zhao

, C.

Wang

, and G.

, A CRF sequence labeling approach to Chinese punctuation prediction, in Proc. 26^th Pacific Asia Conf. Language, Information, and Computation, Bali, Indonesia, 2012, pp. 508–514.

Google Scholar

[44]

W. N.

Zhao

, H. D.

Liu

, X.

, J.

, and P.

Zhang

, The Tibetan sentence boundary identification based on legal texts, (in Chinese), in Proc. National Symp. on Computational Linguistics for Young People (YWCL2010), Wuhan, China, 2010, pp. 490–496.

Google Scholar

[45]

Cai

and T.

, Researches of speech classification methods based on Tibetan repertoire, (in Chinese), J. Northwest Univ. Nat. (Nat. Sci.), vol. 26, no. 2, pp. 39–42, 2005.

Google Scholar

[46]

Q. J.

Ren

and J. C. R.

, Research on automatic recognition method of Tibetan sentence boundary, (in Chinese), China Comput. Commun., vol. 8, no. 316, pp. 62–63, 2014.

Google Scholar

[47]

Z. T.

Cai

, Research on the automatic identification of Tibetan sentence boundaries with maximum entropy classifier, (in Chinese), Comput. Eng. Sci., vol. 34, no. 6, pp. 187–190, 2012.

Google Scholar

[48]

, Z.

Cai

, W.

Jiang

, Y.

, and Q.

Liu

, A maximum entropy and rules approach to identifying Tibetan sentence boundaries, (in Chinese), J. Chin. Informat. Proc., vol. 25, no. 4, pp. 39–44, 2011.

Google Scholar

[49]

W. Z.

, Z.

Wanme

, and Z.

Nima

, Method of identification of Tibetan sentence boundary, (in Chinese), J. Tibet Univ., vol. 27, no. 2, pp. 70–76, 2012.

Google Scholar

[50]

Zhao

, X.

, H.

Liu

, L.

Wang

, and J.

, Modern Tibetan auxiliary ending sentence boundary detection, (in Chinese), J. Chin. Informat. Proc., vol. 27, no. 1, pp. 115–119, 2013.

Google Scholar

[51]

Zha

and B.

Luo

, Based on function words and sentence patterns Tibetan sentence extraction method, (in Chinese), J. Northwest Minzu Univ. (Nat. Sci.)), vol. 39, no. 4, pp. 39–43&62, 2018.

Google Scholar

[52]

C. Z. M.

Que

, Q. C. R.

Hua

, R. D. Z.

Cai

, and W. J.

Xia

, Tibetan sentence boundary recognition based on mixed strategy, (in Chinese), J. Inner Mongolia Normal Univ. (Nat. Sci. Ed.), vol. 48, no. 5, pp. 400–405, 2019.

Google Scholar

[53]

Koehn

, EUROPARL: A parallel corpus for statistical machine translation, in Proc. Machine Translation Summit X: Papers, Phuket, Thailand, 2005, pp. 79–86.

Google Scholar

Tsinghua Science and Technology

Volume 28 Issue 6,
December 2023

Pages 1085-1100

DOI: 10.26599/TST.2022.9010055

Cite this article:

Li F, Lv H, Gao Y, et al. A Tibetan Sentence Boundary Disambiguation Model Considering the Components on Information on Both Sides of Shad. Tsinghua Science and Technology, 2023, 28(6): 1085-1100. https://doi.org/10.26599/TST.2022.9010055