UTAR Institutional Repository

Large language model application integration for extracting unstructured data

Ho, Joe Ee (2025) Large language model application integration for extracting unstructured data. Final Year Project, UTAR.

[img]
Preview
PDF
Download (4Mb) | Preview

    Abstract

    Large Language Models (LLMs) have unlocked new opportunities in processing unstructured information, which is increasingly prevalent in industries such as logistics, healthcare, and finance. To address this growing need despite the inherent inconsistency of LLM , this project presents a data extraction framework that could reliably integrate LLMs into data processing pipelines to extract structured information from unstructured data sources, particularly images containing handwriting. The methodology involved the development of a data processing pipeline that incorporate LLMs to automate data extraction while addressing the unpredictability of LLM outputs. A novel “Sieve Methodology” was introduced to enhance output reliability through multiple iterations of validation, comparison, and refinement of the data extracted by the LLMs. This approach was prominent in making sure that the outputs were suitable for downstream processing while optimizing batch processing efficiency. The system also features support for concurrency, confidence scoring and automated template matching with cropping to improve responsiveness and scalability in real-world deployment scenarios. The framework was rigorously tested on handwritten parcel data with printed text, checkbox and signature, which is a common challenge in the logistics industry. The results were significant: the project achieved up to 96.93% of accuracy, a reduction in processing time of up to 89.46%, and a 37.70% reduction in token usage compared to baseline method where images are directly fed into LLM for processing. These outcomes clearly demonstrate the effectiveness of the proposed framework in enhancing the efficiency and reliability of data extraction processes. In the end, this project successfully developed a dual-functionality system that allows the integration of LLMs as both a standalone tool with GUI and an intermediate module (web API) within automated workflows, highlighting its practical applicability and contribution to the advancement of intelligent data extraction systems.

    Item Type: Final Year Project / Dissertation / Thesis (Final Year Project)
    Subjects: T Technology > T Technology (General)
    Divisions: Faculty of Information and Communication Technology > Bachelor of Computer Science (Honours)
    Depositing User: ML Main Library
    Date Deposited: 29 Aug 2025 11:22
    Last Modified: 29 Aug 2025 11:22
    URI: http://eprints.utar.edu.my/id/eprint/7312

    Actions (login required)

    View Item