APPLICATION OF RULE-BASED METHOD FOR AUTOMATIC EXTRACTION OF TAGS FROM COLUMN-STYLE PDF-DOCUMENTS
DOI:
https://doi.org/10.26577/jpcsit2025336Keywords:
PDF parsing, Rule-based extraction, Metadata extraction, Document structure recognition, Text mining, Low-resource language processing, Knowledge base for LLMsAbstract
This study presents a rule-based hybrid pipeline for the automated extraction of structured metadata from PDF versions of Kazakh-language newspaper articles, focusing on the national newspaper Egemen Qazaqstan. The primary goal is to support the development of a machine-readable knowledge base for future use in training large language models (LLMs) and building an AI-powered assistant for data journalism in Kazakhstan. The pipeline integrates three open-source parsers – pdfminer.six, PyMuPDF, and pdfplumber – to extract key elements such as title, author, date, abstract, text, journal name, and category. To evaluate extraction quality, we compared the results of the automated parser against manually annotated reference files across three real-world issues of the newspaper. The evaluation employed three complementary metrics: Precision, Textual Semantic Similarity (TSS), and Holistic Precision, which jointly assess both exact and semantic matches. The experimental results show that most tags – especially structured fields like date, journal, and category – achieved perfect Holistic Precision (1.00), while more variable fields like title still scored above 0.85. The validated pipeline was then applied to the full corpus of 2,140 newspaper PDFs published between 2017 and March 2025, successfully converting 159,135 articles into structured JSON format. This enriched corpus serves as a foundational knowledge base for Kazakh-language AI systems in journalism and media analysis.
