| Sign up

PDF (2.3 MB)

Cite

Collect

Submit Manuscript

Open Access

RecBERT: Semantic recommendation engine with large language model enhanced query segmentation for k-nearest neighbors ranking retrieval

Richard Wu^¹()

1Dublin Unified School District and the SF Artificial Intelligence Club, Dublin, CA 94568, USA

Show Author Information

Abstract

The increasing amount of user traffic on Internet discussion forums has led to a huge amount of unstructured natural language data in the form of user comments. Most modern recommendation systems rely on manual tagging, relying on administrators to label the features of a class, or story, which a user comment corresponds to. Another common approach is to use pre-trained word embeddings to compare class descriptions for textual similarity, then use a distance metric such as cosine similarity or Euclidean distance to find top k neighbors. However, neither approach is able to fully utilize this user-generated unstructured natural language data, reducing the scope of these recommendation systems. This paper studies the application of domain adaptation on a transformer for the set of user comments to be indexed, and the use of simple contrastive learning for the sentence transformer fine-tuning process to generate meaningful semantic embeddings for the various user comments that apply to each class. In order to match a query containing content from multiple user comments belonging to the same class, the construction of a subquery channel for computing class-level similarity is proposed. This channel uses query segmentation of the aggregate query into subqueries, performing k-nearest neighbors (KNN) search on each individual subquery. RecBERT achieves state-of-the-art performance, outperforming other state-of-the-art models in accuracy, precision, recall, and F1 score for classifying comments between four and eight classes, respectively. RecBERT outperforms the most precise state-of-the-art model (distilRoBERTa) in precision by 6.97% for matching comments between eight classes.

Keywords

sentence transformer simple contrastive learning large language models query segmentation k-nearest neighbors

References

[1]

Z. S. Harris, Distributional structure, in Papers on Syntax, H. Hiż Ed. Dordrecht, the Netherlands: Springer, 1981, pp. 3–22.

Crossref

[2]

S. W. Kim and J. M. Gil, Research paper classification systems based on TF-IDF and LDA schemes, Hum. Centric Comput. Inf. Sci., vol. 9, no. 1, p. 30, 2019.

Crossref Google Scholar

[3]

C. C. Aggarwal, A. Hinneburg, and D. A. Keim, On the surprising behavior of distance metrics in high dimensional space, in Proc. 8th Int. Conf. Database Theory, London, UK, 2001, pp. 420–434.

Crossref

[4]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, arXiv preprint arXiv: 1706.03762, 2017.

[5]

J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv: 1810.04805, 2018.

[6]

L. Minto, M. Haller, B. Livshits, and H. Haddadi, Stronger privacy for federated collaborative filtering with implicit feedback, in Proc. 15th ACM Conf. Recommender Systems, Amsterdam, the Netherlands, 2021, pp. 342–350.

Crossref

[7]

R. Chen, Q. Hua, Y. S. Chang, B. Wang, L. Zhang, and X. Kong, A survey of collaborative filtering-based recommender systems: From traditional methods to hybrid methods based on social networks, IEEE Access, vol. 6, pp. 64301–64320, 2018.

Crossref Google Scholar

[8]

Y. Shi, M. Larson, and A. Hanjalic, Collaborative filtering beyond the user-item matrix: A survey of the state of the art and future challenges, ACM Comput. Surv., vol. 47, no. 1, p. 3, 2014.

Crossref

[9]

B. K. Mylavarapu, Collaborative filtering and artificial neural network based recommendation system for advanced applications, J. Comput. Commun, vol. 6, no. 12, pp. 1–14, 2018.

Crossref Google Scholar

[10]

G. Linden, B. Smith, and J. York, Amazon. com recommendations: Item-to-item collaborative filtering, IEEE Internet Comput., vol. 7, no. 1, pp. 76–80, 2003.

Crossref Google Scholar

[11]

A. van den Oord, S. Dieleman, and B. Schrauwen, Deep content-based music recommendation, in Proc. 26th Int. Conf. Neural Information Processing Systems - Volume 2, Lake Tahoe, Nevada, 2013, pp. 2643–2651.

[12]

J. Lian, F. Zhang, X. Xie, and G. Sun, CCCFNet: A content-boosted collaborative filtering neural network for cross domain recommender systems, in Proc. 26th Int. Conf. World Wide Web Companion, Perth, Australia, 2017, pp. 817–818.

Crossref

[13]

W. Zhao, B. Wang, M. Yang, J. Ye, Z. Zhao, X. Chen, and Y. Shen, Leveraging long and short-term information in content-aware movie recommendation via adversarial training, IEEE Trans. Cybern., vol. 50, no. 11, pp. 4680–4693, 2020.

Crossref Google Scholar

[14]

J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang, BioBERT: a pre-trained biomedical language representation model for biomedical text mining, Bioinformatics, vol. 36, no. 4, pp. 1234–1240, 2020.

Crossref Google Scholar

[15]

L. Rasmy, Y. Xiang, Z. Xie, C. Tao, and D. Zhi, Med-BERT: Pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction, NPJ Digit. Med., vol. 4, no. 1, p. 86, 2021.

Crossref Google Scholar

[16]

H. Choi, J. Kim, S. Joe and Y. Gwon, Evaluation of BERT and ALBERT sentence embedding performance on downstream NLP tasks, in Proc. 2020 25th Int. Conf. Pattern Recognition (ICPR), Milan, Italy, 2021, pp. 5482–5487.

Crossref

[17]

T. Jiang, S. Huang, Z. Q. Zhang, D. Wang, F. Zhuang, F. Wei, H. Huang, L. Zhang, and Q. Zhang, PromptBERT: Improving BERT sentence embeddings with prompts, in Proc. 2022 Conf. Empirical Methods in Natural Language Processing, Abu Dhabi, UAE, 2022, pp. 8826–8837.

Crossref

[18]

N. Reimers and I. Gurevych, Sentence-BERT: Sentence embeddings using Siamese BERT-networks, in Proc. 2019 Conf. Empirical Methods in Natural Language Processing and 9th Int. Joint Conf. Natural Language Processing, Hong Kong, China, 2019, pp. 3982–3992.

Crossref

[19]

T. Gao, X. Yao, and D. Chen, SimCSE: Simple contrastive learning of sentence embeddings, arXiv preprint arXiv: 2104.08821, 2021.

Crossref

[20]

P. Röttger and J. Pierrehumbert, Temporal adaptation of BERT and performance on downstream document classification: Insights from social media, in Findings of the Association for Computational Linguistics : EMNLP 2021, M. F. Moens, X. Huang, L. Specia, and S. W. T. Yih, Eds. Kerrville, TX, USA: Association for Computational Linguistics, 2020, pp. 2400–2412.

Crossref

[21]

C. Liu, W. Zhu, X. Zhang, and Q. Zhai, Sentence part-enhanced BERT with respect to downstream tasks, Complex Intell. Syst., vol. 9, no. 1, pp. 463–474, 2023.

Crossref Google Scholar

[22]

N. Kassner and H. Schütze, BERT-kNN: Adding a kNN search component to pretrained language models for better QA, in Findings of the Association for Computational Linguistics : EMNLP 2020, T. Cohn, Y. He, and Y. Liu, Eds. Kerrville, TX, USA: Association for Computational Linguistics, 2020, pp. 3424–3430.

Crossref

[23]

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. , Language models are few-shot learners, arXiv preprint arXiv: 2005.14165, 2020.

[24]

Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, et al. , Constitutional AI: Harmlessness from AI feedback, arXiv preprint arXiv: 2212.08073, 2022.

[25]

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., LLaMA: Open and efficient foundation language models, arXiv preprint arXiv: 2302.13971, 2023.

[26]

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, QLoRA: Efficient finetuning of quantized LLMs, arXiv preprint arXiv: 2305.14314, 2023.

[27]

Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, RoBERTa: A robustly optimized BERT pretraining approach, arXiv preprint arXiv: 1907.11692, 2019.

[28]

V. Sanh, L. Debut, J. Chaumond, and T. Wolf, DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter, arXiv preprint arXiv: 1910.01108, 2019.

[29]

S. Arora, W. Hu, and P. K. Kothari, An analysis of the t-SNE algorithm for data visualization, in Proc. 31st Conf. Learning Theory, Stockholm, Sweden, 2018, pp. 1455–1462.

Intelligent and Converged Networks

Volume 5 Issue 1,
March 2024

Pages 42-52

DOI: 10.23919/ICN.2024.0004

Cite this article:

Wu R. RecBERT: Semantic recommendation engine with large language model enhanced query segmentation for k-nearest neighbors ranking retrieval. Intelligent and Converged Networks, 2024, 5(1): 42-52. https://doi.org/10.23919/ICN.2024.0004