DEVELOPMENT OF THE RETRIEVAL-AUGMENTED GENERATION (RAG) SYSTEM FOR THE KAZAKH LANGUAGE USING HYBRID INFORMATION METHODS
DOI:
https://doi.org/10.26577/jpcsit4120265Keywords:
Retrieval-Augmented Generation, Kazakh language, hybrid search, BM25, natural language processingAbstract
Abstract. This study presents the development and experimental evaluation of the Retrieval-Augmented Generation (RAG) system for the Kazakh language with an emphasis on comparative analysis of information retrieval methods. The main purpose of the work was to test the hypothesis of the superiority of a hybrid approach combining statistical (BM25) and semantic (vector search) methods over individual approaches. Based on the corpus of legal documents of the Republic of Kazakhstan, 270 experiments were conducted using three data extraction methods combined with six modern large language models (LLM). The results demonstrate that the hybrid method achieves the highest accuracy (82.2%), statistically significantly surpassing vector search by 3.3% (p < 0.01) and BM25 by 5.5% (p < 0.001). All three methods showed accuracy above 75%, which confirms the high efficiency of RAG systems for the Kazakh language. The analysis also revealed that hybrid search provides the greatest stability of results when working with different language models. This study makes a significant contribution to the development of RAG systems for languages with limited resources, offering an empirically based methodology to improve the accuracy and reliability of response generation.
