DEVELOPMENT OF THE RETRIEVAL-AUGMENTED GENERATION (RAG) SYSTEM FOR THE KAZAKH LANGUAGE USING HYBRID INFORMATION METHODS

Authors

DOI:

https://doi.org/10.26577/jpcsit4120265

Keywords:

Retrieval-Augmented Generation, Kazakh language, hybrid search, BM25, natural language processing

Abstract

Abstract. This study presents the development and experimental evaluation of the Retrieval-Augmented Generation (RAG) system for the Kazakh language with an emphasis on comparative analysis of information retrieval methods. The main purpose of the work was to test the hypothesis of the superiority of a hybrid approach combining statistical (BM25) and semantic (vector search) methods over individual approaches. Based on the corpus of legal documents of the Republic of Kazakhstan, 270 experiments were conducted using three data extraction methods combined with six modern large language models (LLM). The results demonstrate that the hybrid method achieves the highest accuracy (82.2%), statistically significantly surpassing vector search by 3.3% (p < 0.01) and BM25 by 5.5% (p < 0.001). All three methods showed accuracy above 75%, which confirms the high efficiency of RAG systems for the Kazakh language. The analysis also revealed that hybrid search provides the greatest stability of results when working with different language models. This study makes a significant contribution to the development of RAG systems for languages with limited resources, offering an empirically based methodology to improve the accuracy and reliability of response generation.

Downloads

Download data is not yet available.

Author Biographies

Nurlykhan Kalzhanov, Al-Farabi Kazakh National University, Almaty, Kazakhstan

Nurlykhan Kalzhanov is a Master’s student in Computer Engineering at al-Farabi Kazakh National University (Almaty, Kazakhstan, nurkal022@gmail.com). His research interests include machine learning, information retrieval, and natural language processing, with a particular focus on Retrieval-Augmented Generation (RAG) systems for low-resource languages. He has participated in research projects related to artificial intelligence applications and computational linguistics.

Sauirbek Artykbay , Al-Farabi Kazakh National University, Almaty, Kazakhstan

Sauirbek Artykbay is a Master’s student in Computer Engineering at al-Farabi Kazakh National University (Almaty, Kazakhstan, artikbaisauirbek@gmail.com). His research interests include machine learning, information retrieval, and natural language processing, with a particular focus on Semantic search systems for low-resource languages.

Akniyet Kalzhan , Al-Farabi Kazakh National University, Almaty, Kazakhstan

Akniyet Kalzhan is a Bachelor's student in Data Science at al-Farabi Kazakh National University (Almaty, Kazakhstan, aknietkalzhan@gmail.com). Her research interests include computer vision, large language models (LLMs), and data science. She is currently working on her undergraduate thesis focused on knowledge distillation in large language models. Akniyet has completed internships in two research laboratories at al-Farabi Kazakh National University, where she gained practical experience in artificial intelligence and machine learning applications.

        128 11

Downloads

How to Cite

Kalzhanov, N., Artykbay , S. ., & Kalzhan , A. . (2026). DEVELOPMENT OF THE RETRIEVAL-AUGMENTED GENERATION (RAG) SYSTEM FOR THE KAZAKH LANGUAGE USING HYBRID INFORMATION METHODS. Journal of Problems in Computer Science and Information Technologies, 4(1), 48–65. https://doi.org/10.26577/jpcsit4120265