Robustness of Structured Data Extraction from In-plane Rotated Documents using Multi-Modal Large Language Models (LLM)
Keywords:
multi-modal large language models, data extraction, document skew, skew correction, document understanding, OCR, computer visionAbstract
Multi-modal large language models (LLMs) have shown remarkable performance in various natural language processing tasks, including data extraction from documents. However, the accuracy of these models can be significantly affected by document in-plane rotation, also known as skew, a common issue in real-world scenarios for scanned documents. This study investigates the impact of document skew on the data extraction accuracy of three state-of-the-art multi-modal LLMs: Anthropic Claude V3 Sonnet, GPT-4-Turbo, and Llava:v1.6. We focus on extracting specific entities from synthetically generated sample documents with varying degrees of skewness. The results demonstrate that document skew adversely affects the data extraction accuracy of all the tested LLMs, with the severity of the impact varying across models. We identify the safe in-plane rotation angles (SIPRA) for each model and investigate the effects of skew on model hallucinations. Furthermore, we explore existing skew detection and correction mechanisms and discuss their potential limitations. We propose alternative approaches, including developing new multi-modal architectures that are inherently more robust to document skew and incorporating skewing techniques during the pre-training phase of the models. Additionally, we highlight the need for more comprehensive testing on a wider range of document quality and conditions to fully understand the challenges and opportunities associated with using multi-modal LLMs for information extraction in real-world scenarios.
References
Kartikay Bagla, Shivam Gupta, Ankit Kumar, and Anuj Gupta. 2024. Noisy Text Data: foible of popular Transformer based NLP models. In Proceedings of the Third International Conference on AI-ML Systems (AIMLSystems '23). Association for Computing Machinery, New York, NY, USA, Article 33, 1–5. https://doi.org/10.1145/3639856.3639889
Zhang, Yunxiang, Liangming Pan, Samson Tan, and Min-Yen Kan. "Interpreting the robustness of neural NLP models to textual perturbations." arXiv preprint arXiv:2110.07159 (2021).
Bagla, Kartikay, Ankit Kumar, Shivam Gupta, and Anuj Gupta. "Noisy Text Data: Achilles' Heel of popular transformer based NLP models." arXiv preprint arXiv:2110.03353 (2021).
Gatos, B., Pratikakis, I., & Perantonis, S. J. (1997). Adaptive degraded document image binarization. Pattern Recognition, 39(3), 317-327.
Hull, Jonathan. (1998). Document image skew detection: Survey and annotated bibliography. 10.1142/9789812797704_0003.
Dobai, Lorand, and Mihai Teletin. "A document detection technique using convolutional neural networks for optical character recognition systems." In ESANN. 2019.
Aradhya V., Kumar G.H., Shivakumara P. An accurate and efficient skew estimation technique for South Indian documents: A new boundary growing and nearest neighbor clustering based approach. Int. J. Robot. Autom. 2007;22:272–280. doi: 10.2316/Journal.206.2007.4.206-2992.
Al-Khatatneh A., Pitchay S.A., Al-Qudah M. A Review of Skew Detection Techniques for Document; Proceedings of the 17th UKSIM-AMSS International Conference on Modelling and Simulation; Washington, DC, USA. 25–27 March 2015; pp. 316–321.
Sun S.J. Skew detection using wavelet decomposition and projection profile analysis. Pattern Recognit. Lett. 2007;28:555–562.
Bekir Y. Projection profile analysis for skew angle estimation of woven fabric images. J. Text. Inst. Part 3 Technol. New Century.
Shafii M. Optical Character Recognition of Printed Persian/Arabic Documents. University of Windsor (Canada); Windsor, ON, Canada: 2014.
Mascaro A.A., Cavalcanti R.D.C., Mello R.A.B. Fast and robust skew estimation of scanned documents through background area information. Pattern Recognit. Lett. 2010;31:1403–1411. doi: 10.1016/j.patrec.2010.03.016.
Wu, Jiayang, Wensheng Gan, Zefeng Chen, Shicheng Wan, and S. Yu Philip. "Multimodal large language models: A survey." In 2023 IEEE International Conference on Big Data (BigData), pp. 2247-2256. IEEE, 2023.
Summaira, Jabeen, Xi Li, Amin Muhammad Shoib, and Jabbar Abdul. "A review on methods and applications in multimodal deep learning." arXiv preprint arXiv:2202.09195 (2022).
Yu Long, Pengjie Tang, Hanli Wang, and Jian Yu. 2021. Improving reasoning with contrastive visual information for visual question answering. Electronics Letters 57, 20 (2021), 758–760
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. 2024b. LLaVA-NeXT: Improved reasoning, OCR, and world knowledge.
Perot, Vincent, Kai Kang, Florian Luisier, Guolong Su, Xiaoyu Sun, Ramya Sree Boppana, Zilong Wang, Jiaqi Mu, Hao Zhang, and Nan Hua. "LMDX: Language Model-based Document Information Extraction and Localization." arXiv preprint arXiv:2309.10952 (2023).
Bai, Zechen, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. "Hallucination of Multimodal Large Language Models: A Survey." arXiv preprint arXiv:2404.18930 (2024).
Liu, Fuxiao, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. "Aligning large multi-modal model with robust instruction tuning." arXiv preprint arXiv:2306.14565 (2023).
Yu, Qifan, Juncheng Li, Longhui Wei, Liang Pang, Wentao Ye, Bosheng Qin, Siliang Tang, Qi Tian, and Yueting Zhuang. "Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data." arXiv preprint arXiv:2311.13614 (2023).
Lei Wang, Jiabang He, Shenshen Li, Ning Liu, and Ee-Peng Lim. 2023. Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites. arXiv preprint arXiv:2312.01701 (2023).
Zihao Yue, Liang Zhang, and Qin Jin. 2024. Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective. arXiv preprint arXiv:2402.14545 (2024).
Downloads
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
License Terms
Ownership and Licensing:
Authors of this research paper submitted to the journal owned and operated by The Science Brigade Group retain the copyright of their work while granting the journal certain rights. Authors maintain ownership of the copyright and have granted the journal a right of first publication. Simultaneously, authors agreed to license their research papers under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License.
License Permissions:
Under the CC BY-NC-SA 4.0 License, others are permitted to share and adapt the work, as long as proper attribution is given to the authors and acknowledgement is made of the initial publication in the Journal. This license allows for the broad dissemination and utilization of research papers.
Additional Distribution Arrangements:
Authors are free to enter into separate contractual arrangements for the non-exclusive distribution of the journal's published version of the work. This may include posting the work to institutional repositories, publishing it in journals or books, or other forms of dissemination. In such cases, authors are requested to acknowledge the initial publication of the work in this Journal.
Online Posting:
Authors are encouraged to share their work online, including in institutional repositories, disciplinary repositories, or on their personal websites. This permission applies both prior to and during the submission process to the Journal. Online sharing enhances the visibility and accessibility of the research papers.
Responsibility and Liability:
Authors are responsible for ensuring that their research papers do not infringe upon the copyright, privacy, or other rights of any third party. The Science Brigade Publishers disclaim any liability or responsibility for any copyright infringement or violation of third-party rights in the research papers.