APPLICATION OF RULE-BASED METHOD FOR AUTOMATIC EXTRACTION OF TAGS FROM COLUMN-STYLE PDF-DOCUMENTS

Authors

DOI:

https://doi.org/10.26577/jpcsit2025336

Keywords:

PDF parsing, Rule-based extraction, Metadata extraction, Document structure recognition, Text mining, Low-resource language processing, Knowledge base for LLMs

Abstract

This study presents a rule-based hybrid pipeline for the automated extraction of structured metadata from PDF versions of Kazakh-language newspaper articles, focusing on the national newspaper Egemen Qazaqstan. The primary goal is to support the development of a machine-readable knowledge base for future use in training large language models (LLMs) and building an AI-powered assistant for data journalism in Kazakhstan. The pipeline integrates three open-source parsers – pdfminer.six, PyMuPDF, and pdfplumber – to extract key elements such as title, author, date, abstract, text, journal name, and category. To evaluate extraction quality, we compared the results of the automated parser against manually annotated reference files across three real-world issues of the newspaper. The evaluation employed three complementary metrics: Precision, Textual Semantic Similarity (TSS), and Holistic Precision, which jointly assess both exact and semantic matches. The experimental results show that most tags – especially structured fields like date, journal, and category – achieved perfect Holistic Precision (1.00), while more variable fields like title still scored above 0.85. The validated pipeline was then applied to the full corpus of 2,140 newspaper PDFs published between 2017 and March 2025, successfully converting 159,135 articles into structured JSON format. This enriched corpus serves as a foundational knowledge base for Kazakh-language AI systems in journalism and media analysis.

Downloads

Download data is not yet available.

Author Biographies

Assel Ospan, Al Farabi Kazakh National University, Almaty, Kazakhstan

Assel Ospan is a senior lecturer at the Department of Artificial Intelligence and Big Data, al-Farabi Kazakh National University (Almaty, Kazakhstan, assel.ospan@kaznu.edu.kz). Her research focuses on the development of large language models for the Kazakh language, intelligent information extraction, and knowledge base construction. She actively participates in national AI research initiatives and has authored several publications on NLP and data journalism.
ORCID iD: 0000-0002-1860-6997.

Madina Mansurova, Al Farabi Kazakh National University, Almaty, Kazakhstan

Madina Mansurova is the head of the Department of Artificial Intelligence and Big Data, Professor, al-Farabi Kazakh National University (Almaty, Kazakhstan, madina.mansurova@kaznu.edu.kz). She has been successfully working in higher education and actively contributing to the advancement of new technologies. Prof.Mansurova is the author of more than 100 scientific articles, 10 monographs, 2 textbooks approved by the Ministry of Education and Science of the Republic of Kazakhstan, 5 patents for useful models in the field of automation and control, and over 40 copyrights on intellectual property. Since 2012, she has been the scientific supervisor of grant and program-targeted funding projects of the Ministry of Education and Science of the Republic of Kazakhstan. She has published 108 articles indexed in Scopus and Web of Science, with a Hirsch index of 7 in Scopus and 205 citations..
ORCID iD: 0000-0002-9680-2758

Kanat Auyesbay, Al Farabi Kazakh National University, Almaty, Kazakhstan

Kanat Auyesbay is the Dean of the Faculty of Journalism at Al-Farabi Kazakh National University (Almaty, Kazakhstan, kanat.auyesbay@kaznu.edu.kz). He is a journalist-educator who bridges the fields of media and higher education. Dr. Auesbay holds a Candidate of Philological Sciences degree (equivalent to PhD) and has extensive experience in both media production and academic leadership. As a recipient of the Bolashak International Scholarship, Kanat Auesbay completed a research and teaching internship at the University of East Anglia, UK (Norwich, 2013–2014). He served as Chairman of the State Attestation Commission at the Faculty of Journalism and Political Science of L.N. Gumilyov Eurasian National University (2023–2024). Since 2018, he has been a corresponding member of the Kazakhstan Academy of Pedagogical Sciences and a member of the Educational-Methodical Association under the Republican Educational-Methodical Council (ROƏK) for Journalism and Information (2019–2021). He has also served on the expert commission for training specialists abroad under the Bolashak program and has supervised and reviewed numerous theses and doctoral dissertations in media studies.  ORCID iD: 0009-0001-3529-9888

Talshyn Sarsembayeva, Al Farabi Kazakh National University, Almaty, Kazakhstan

Talshyn Sarsembayeva is a a senior lecturer at the Department of Artificial Intelligence and Big Data, al-Farabi Kazakh National University (Almaty, Kazakhstan, talshyn.sagdatbek@kaznu.edu.kz). Her work focuses on the integration of artificial intelligence and data processing tools in journalistic practice. She has contributed to projects involving the structuring of large-scale media archives and the development of AI-assisted systems for Kazakh-language content. ORCID iD: 0000-0001-7668-2640.

Aman Mussa, Al Farabi Kazakh National University, Almaty, Kazakhstan

Aman Mussa is a research assistant at the Department of Artificial Intelligence and Big Data, al-Farabi Kazakh National University (Almaty, Kazakhstan, mussa.aman0519@gmail.com). He is engaged in the development of rule-based and hybrid NLP pipelines, with a focus on Kazakh-language PDF processing. His work supports large-scale knowledge base generation for intelligent assistants in data journalism. ORCID iD: 0009-0001-9972-7677.

        156 87

Downloads

How to Cite

Ospan, A., Mansurova, M., Auyesbay, K., Sarsembayeva, T., & Mussa, A. (2025). APPLICATION OF RULE-BASED METHOD FOR AUTOMATIC EXTRACTION OF TAGS FROM COLUMN-STYLE PDF-DOCUMENTS . Journal of Problems in Computer Science and Information Technologies, 3(3), 52–67. https://doi.org/10.26577/jpcsit2025336