(2018)
Abstract.
Multilingual OCR and information extraction from receipts remains challenging, particularly for complex scripts like Arabic. We introduce ReceiptSense, a comprehensive dataset designed for Arabic-English receipt understanding comprising 20,000 annotated receipts from diverse retail settings, 30,000 OCR-annotated images, and 10,000 item-level annotations, and a new Receipt QA subset with 1265 receipt images paired with 40 question-answer pairs each to support LLM evaluation for receipt understanding. The dataset captures merchant names, item descriptions, prices, receipt numbers, and dates to support object detection, OCR, and information extraction tasks. We establish baseline performance using traditional methods (Tesseract OCR) and advanced neural networ…
(2018)
Abstract.
Multilingual OCR and information extraction from receipts remains challenging, particularly for complex scripts like Arabic. We introduce ReceiptSense, a comprehensive dataset designed for Arabic-English receipt understanding comprising 20,000 annotated receipts from diverse retail settings, 30,000 OCR-annotated images, and 10,000 item-level annotations, and a new Receipt QA subset with 1265 receipt images paired with 40 question-answer pairs each to support LLM evaluation for receipt understanding. The dataset captures merchant names, item descriptions, prices, receipt numbers, and dates to support object detection, OCR, and information extraction tasks. We establish baseline performance using traditional methods (Tesseract OCR) and advanced neural networks, demonstrating the dataset’s effectiveness for processing complex, noisy real-world receipt layouts. Our publicly accessible dataset advances automated multilingual document processing research111https://github.com/Update-For-Integrated-Business-AI/CORU.
Receipt Understanding, Multilingual OCR, Information Extraction
††copyright: acmlicensed††journalyear: 2018††doi: XXXXXXX.XXXXXXX††conference: Make sure to enter the correct conference title from your rights confirmation email; June 03–05, 2018; Woodstock, NY††isbn: 978-1-4503-XXXX-X/2018/06††ccs: Computing methodologies Computer vision tasks††ccs: Computing methodologies Computer vision††ccs: Computing methodologies Information extraction
1. Introduction
Optical Character Recognition (OCR) (Nguyen et al., 2021a; Hwang et al., 2019; Nguyen et al., 2021b) converts images of characters into digital text. While deep learning has improved OCR performance, challenges remain in post-OCR parsing—predicting semantic labels from noisy, unstructured OCR output, especially for receipts with diverse layouts, fonts, and degraded quality. Existing datasets fall short: standard OCR datasets lack parsing labels, while parsing datasets contain clean text that does not reflect real OCR errors. Recent efforts like SROIE (Huang et al., 2019) and CORD (Park et al., 2019) offer annotated receipts but do not comprehensively address multilingual understanding. We introduce ReceiptSense, the first public dataset for advanced multilingual post-OCR parsing, featuring diverse Arabic-English receipts with detailed OCR, key information, and item-level annotations. Its four components—text-level annotations, key information labels, and item-level data—support OCR enhancement, information extraction, QA and multilingual text analysis. Figure 1 shows examples highlighting the variety of formats and complex layouts.
The primary contributions of the ReceiptSense dataset are as follows: (1) Key Information Detection: 20,000 human-annotated receipts extracting merchant names, dates, receipt numbers, item lists, and totals for key-value information extraction from noisy text. (2) Large-Scale OCR Dataset: 30,000 receipt images with box-level text annotations addressing challenges from diverse layouts, fonts, and noisy appearances. (3) Detailed Item Analysis: 10,000 individual items annotated for item-level analysis in expense management, inventory tracking, and retail analytics. (4) Receipt QA: includes 1,265 receipt images, where each is paired with 40 question-answer pairs covering merchant, date, totals, items, taxes, and payment methods. (5) Baseline Performance Metrics: Comprehensive benchmarks using traditional and modern deep learning approaches across object detection, OCR, and information extraction tasks.
ReceiptSense introduces several novel contributions compared to existing datasets (SROIE, CORD, UIT, MC-OCR). It is significantly larger with 20,000 annotated receipts, 50,600 QA, 30,000 OCR-annotated images, and 10,000 item-level annotations. Unlike existing English-only datasets with clean text, ReceiptSense incorporates multilingual Arabic-English receipts with diverse layouts, fonts, and mixed-language content reflecting global retail scenarios. The dataset includes real-world noise—blurred, rotated, or damaged text—challenging models for robust performance. Crucially, ReceiptSense is the first to provide item-level annotations, enabling fine-grained extraction for expense management and inventory analysis. These characteristics advance multilingual OCR, post-OCR parsing, and practical applications in finance and retail.
Figure 1. Examples of annotated receipt images from the ReceiptSense dataset.
2. Related Work
While several datasets cater to various OCR tasks (Nguyen et al., 2021a; abdallah2024transformers), there is a notable shortage of receipt datasets, particularly for Arabic receipts. To highlight the unique features and comprehensive nature of our dataset, Table 1 compares it against existing datasets like SROIE, MC-OCR, UIT, and CORD.
One of the earliest and most popular datasets in the scanned receipt domain is the ICDAR 2019 Challenge on Scanned Receipts OCR and Key Information Extraction (SROIE) dataset (Huang et al., 2019). This dataset marked a significant advancement in the automated analysis of scanned receipts by introducing a standardized collection of 1,000 annotated receipt images. It focuses on three tasks: text localization, OCR, and key information extraction. These tasks are essential for document analysis systems with substantial commercial potential. The challenge emphasized the unique difficulties inherent in receipt OCR, such as poor scan quality and complex layouts. SROIE has since become a cornerstone for research in receipt document understanding, serving as a reliable benchmark for evaluating the performance of novel models in various OCR and information extraction tasks.
The CORD (Consolidated Receipt Dataset) dataset (Park et al., 2019) addresses the challenge of integrating OCR with NLP tasks like semantic parsing by providing a comprehensive resource for post-OCR parsing. It includes thousands of Indonesian receipt images with box-level text annotations for OCR and multi-level semantic labels for parsing. Unlike traditional datasets, CORD bridges the gap between OCR and parsing, enabling the development of robust models that can handle OCR errors. It also introduces line annotations for converting two-dimensional OCR text into a well-ordered sequence, thereby enhancing parsing performance. CORD’s hierarchical labeling structure allows researchers to investigate both low-level text extraction and high-level semantic interpretation, making it a versatile resource for document understanding tasks.
The MC-OCR (Mobile Captured Receipt Recognition Challenge) dataset (vu2021mc), featured at RIVF 2021 conference, includes 2,436 images of Vietnamese receipts captured via mobile devices. This dataset supports two tasks: predicting receipt quality and recognizing key information fields. Collected by 50 data collectors, these images were annotated in two phases: image quality assessment (IQA) and key information extraction (KIE). IQA involved evaluating text line readability, while KIE required annotators to identify and transcribe key fields. This dataset serves as a benchmark for improving document digitalization and automated financial document processing, with evaluation metrics including RMSE for IQA and CER for KIE. Its mobile-captured nature makes it particularly relevant to real-world applications where varying lighting conditions and image distortions are common challenges.
The UIT-MLReceipts dataset (nguyen2024uit) addresses the need for a more extensive and carefully labeled dataset for extracting receipt information, overcoming the limitations of existing datasets like SROIE and CORD. With a focus on enriching the available data, UIT-MLReceipts has been compiled by sourcing receipts from various establishments such as restaurants, cafes, bookstores, and supermarkets, ensuring diversity in structure, color, font, and format. Additionally, images from social media groups were incorporated to further enhance dataset variability. Following meticulous curation, a total of 2,147 receipt images were obtained and annotated, covering key information like store names, addresses, timestamps, and total costs. These annotations were refined using a Faster R-CNN model trained on the MC-OCR challenge dataset, ensuring accuracy and completeness. The resulting UIT-MLReceipts dataset offers a comprehensive representation of receipt characteristics, languages, status, and image attributes, making it a valuable asset for research in receipt information extraction and Visual Document Understanding (VDU).
Despite the existence of these datasets, there remains a substantial gap in resources tailored for Arabic receipt information extraction. Arabic receipts often present unique challenges such as right-to-left text orientation, diverse numeral systems, and variations in layout structures. Our proposed CORU dataset aims to address these challenges by offering a sizable, high-quality collection of Arabic and English receipts annotated with key information fields. This dataset not only facilitates OCR tasks but also supports advanced question-answering models by including context-aware questions related to receipt content. By bridging this gap, CORU contributes to the broader effort of developing multilingual and multimodal OCR systems capable of handling diverse document structures across languages and regions.
Table 1. Comparative overview of the ReceiptSense dataset, highlighting differences in the number of images, categories, and supported tasks. OB refers to Object detection and IE refers to Information Extraction.
Dataset Name # Images # categories OB OCR IE Item IE ReceiptSense Language SROIE 1,000 4 ✓ ✓ ✓ ✓ XX English MC-OCR 2,436 4 XX ✓ ✓ ✓ XX English & Vietnamese UIT 2,147 4 ✓ ✓ ✓ ✓ XX English & Vietnamese CORD 1,000 8 ✓ ✓ ✓ ✓ XX English ReceiptSense (ours) 20,000 5 ✓ ✓ ✓ ✓ ✓ English & Arabic
3. Dataset Creation and Analysis
3.1. Data Collection
We developed ReceiptSense to advance multilingual receipt understanding, particularly for Arabic and English texts. Our methodology integrates rigorous data collection, annotation, and quality control processes to ensure diversity and real-world applicability. Ethical Considerations and Data Collection: All receipts were collected with explicit user consent through the DISCO application222https://discoapp.ai/, following strict privacy protocols. We gathered over 100k receipts from restaurants, supermarkets, and retail stores across different geographical regions. To protect privacy, our annotation team applied a four-step PII redaction process: line-by-line review, sensitive data obscuring, verification, and independent cross-checking. Annotation Process: We established detailed guidelines for annotating merchant names, items, prices, and dates, with specialized protocols for bi-directional Arabic-English text. Using MakeSense333https://www.makesense.ai/, annotators created bounding boxes in YOLO and COCO formats. For OCR tasks, we developed a custom system maintaining positional integrity of mixed-language text. Item-specific annotations covered names, classifications, quantities, prices, packaging, and brands, validated through iterative feedback loops with domain experts. For the Receipt QA subset, we used 1,265 real receipt images and manually authored 40 diverse QA pairs per receipt. Questions cover receipt meta-data, item-level details, and transaction summaries. The QA pairs were validated for consistency with the receipt content.
3.2. Dataset Statistical Analysis
ReceiptSense demonstrates significant diversity across multiple dimensions. The language distribution shows Arabic predominance (53.6%), followed by English (26.2%) and mixed-language content (20.3%), reflecting real-world multilingual scenarios in international retail environments. Object class analysis reveals Item entries as most frequent, supporting detailed transaction analysis for applications like itemised billing and inventory management. The item class distribution spans from prevalent categories like ’Soft drinks’ and ’Rice, pasta, and noodles’ to specialised items like ’Hair & body care’, ensuring model versatility across commercial domains. The Receipt QA subset comprises over 50,000 QA pairs (1,265 receipts × 40 questions). The QA coverage includes 30% merchant/payment/date metadata, 50% item-level information, and 20% tax/total/payment details, ensuring comprehensive evaluation of receipt understanding.
4. Models
Visual QA Models
To evaluate receipt-level question answering, we applied state-of-the-art large language models (LLMs) including GPT-4o (Achiam et al., 2023), Llama 3.2 (Touvron et al., 2023a), Phi3 (Abdin et al., 2024), Phi3.5 (Abdin et al., 2024), Llava (Liu et al., 2023), Internvl2 4B (Chen et al., 2024), and Internvl2 8B (Chen et al., 2024). These models were tasked with predicting answers to 40 distinct question types per receipt, covering receipt metadata, item details, VAT, totals, and payment methods.
Object Detection Models:
Weakly supervised models localize objects using only image-level labels, reducing annotation effort while maintaining accuracy (Zhang et al., 2021; Choe et al., 2020). Techniques include Class Activation Mapping (CAM) (Zhou et al., 2016), Hide-and-Seek (HAS) (Singh and Lee, 2017), Adversarial Complementary Learning (ACoL) (Zhang et al., 2018a), Self-produced Guidance (SPG) (Zhang et al., 2018b), Attention-based Dropout Layer (ADL) (Choe and Shim, 2019), and CutMix (Yun et al., 2019), each enhancing feature learning through masking, attention, or augmentation. Advanced architectures like DINO (Zhang et al., 2022) use Transformer-based deformable attention for precise localization, while YOLO models (YOLOv7 (Wang et al., 2023), YOLOv8 (Jocher et al., 2023)) offer real-time detection with multi-scale and attention mechanisms.
OCR Models
Our OCR model combines CNNs and bidirectional LSTMs for text recognition. The convolutional layer performs feature extraction through: f(x,y)=∑i=−aa∑j=−bbk(i,j)⋅g(x−i,y−j)f(x,y)=\sum_{i=-a}{a}\sum_{j=-b}{b}k(i,j)\cdot g(x-i,y-j), where f(x,y)f(x,y) is the output feature map, g(x,y)g(x,y) the input image, and k(i,j)k(i,j) the convolutional filter. LSTM units capture sequential dependencies for accurate text decoding, with bidirectional processing ensuring comprehensive context understanding.
Large Language Models
We evaluate several LLMs for information extraction tasks. Llama (Touvron et al., 2023b), Mistral (Jiang et al., 2023), Mixtral (Jiang et al., 2024), Falcon (Almazrouei et al., 2023) and Zephyr (Tunstall et al., 2023).
Table 2. Comparative Analysis of Object Detection Models Across Different Backbone Architectures.
Method Backbone Avg IoU 10 20 30 40 50 60 70 80 90 CAM ResNet50 6.17 41.08 11.93 4.48 2.24 1.07 0.59 0.21 0.05 0.05 VGG16 5.86 45.23 10.12 2.24 0.64 0.16 0.11 0.05 0.00 0.00 InceptionV2 5.26 36.81 9.06 3.52 1.76 0.80 0.37 0.21 0.05 0.00 HAS ResNet50 7.14 43.26 16.84 6.55 2.72 1.17 0.59 0.21 0.11 0.00 VGG16 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 InceptionV2 4.65 34.79 8.20 2.40 0.75 0.21 0.11 0.05 0.00 0.00 ADL ResNet50 6.57 39.74 15.13 6.34 2.56 1.17 0.48 0.16 0.11 0.05 VGG16 6.97 47.52 15.34 4.69 1.60 0.37 0.16 0.05 0.00 0.00 InceptionV2 7.04 52.10 11.61 4.00 1.60 0.64 0.27 0.11 0.05 0.00 ACOL ResNet50 3.94 27.49 7.19 2.72 1.01 0.48 0.32 0.21 0.00 0.00 VGG16 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 InceptionV2 7.04 52.10 11.61 4.00 1.60 0.64 0.27 0.11 0.05 0.00 SPG ResNet50 6.53 42.14 13.00 5.06 2.66 1.33 0.64 0.27 0.11 0.05 VGG16 - - - - - - - - - - InceptionV2 4.74 35.11 7.73 2.66 1.07 0.48 0.27 0.05 0.00 0.00 Cutmix ResNet50 6.32 41.61 13.37 5.01 1.97 0.69 0.37 0.16 0.05 0.00 VGG16 5.54 38.47 11.45 3.25 1.28 0.59 0.27 0.11 0.00 0.00 InceptionV2 4.64 31.91 8.42 3.57 1.55 0.59 0.27 0.05 0.00 0.00 DINO Swin 4-scale 32.2 45.4 44.6 43.3 41.9 39.9 35.9 27.5 10.6 0.7 ResNet50 4-scale 31.9 45.9 45.0 43.6 41.9 39.4 35.2 25.6 10.2 0.5 ResNet50 5-scale 29.4 44.1 43.2 41.7 39.8 37.1 32.5 25.0 0.89 0.00
Figure 7. Performance of different models on the Receipt QA subset across key metrics.
Figure 8. OCR Performance Comparison Table 3. Performance Evaluation of YOLO Models
| Model | P | R | mAP50 | mAP50-95 |
| YoloV7 | 76.00 | 85.60 | 79.20 | 43.70 |
| YoloV8 | 74.60 | 81.00 | 76.10 | 45.30 |
| YoloV9 | 75.70 | 83.40 | 77.90 | 46.70 |
5. Experimental Results
5.1. QA Results
Figure 7 provides a comparative analysis of various large language models on the Receipt QA subset of ReceiptSense across four key metrics: precision, recall, exact match, and contains. GPT-4o consistently outperforms other models, achieving the highest scores across all metrics — notably 37.7% precision, 36.4% recall, 35.0% exact match, and 29.1% contains. Llama3.2 (11B) follows, with precision at 32.6%, recall at 31.3%, exact match at 31.6%, and contains at 25.9%. Phi3 (4.15B) and Phi3.5 show solid but slightly lower performance, clustering in the 28–30% range for precision, recall, and exact match. Llava (7.06B) and Internvl2 models (4B, 8B) exhibit notably lower results, especially in the exact match and contains metrics, reflecting challenges in handling complex receipt QA tasks. The figure illustrates the clear advantage of larger and more advanced models like GPT-4o and Llama3.2 in comprehending and extracting accurate information from multilingual receipts.
Table 4. Performance of Language Models in Information Extraction Across Zero-Shot, One-Shot, Two-Shot, and Three-Shot Settings for Various Receipt Information Categories.
Models parameters #Shots Brand Weight # Units S.Units T.Price Price Pack Units Overall F1 Acc F1 Acc F1 Acc F1 Acc F1 Acc F1 Acc F1 Acc F1 Acc F1 Acc LLaMA V1 7B 0 3.70 0.93 4.29 1.25 8.14 3.34 0.04 0.02 0.08 0.04 0.08 0.04 0.08 0.04 0.00 0.00 0.88 0.46 LLaMA V2 7B 0 29.16 16.46 28.32 15.87 74.94 60.52 0.04 0.02 0.75 0.38 1.50 0.75 0.33 0.17 2.39 0.24 20.70 10.80 LLaMA V2 13B 0 14.46 6.97 11.70 5.35 54.00 36.89 0.00 0.00 0.79 0.40 0.67 0.34 0.17 0.08 7.19 2.82 12.14 5.61 Mistral 7B 0 32.07 18.53 25.66 14.05 82.64 71.37 0.21 0.10 1.50 0.75 1.95 0.98 0.67 0.34 16.52 8.20 24.53 13.29 Mixtral 8x7B 0 28.98 16.33 20.93 10.95 69.35 53.46 0.08 0.04 1.08 0.54 2.07 1.05 0.08 0.04 0.65 0.34 18.18 9.22 Falcon 7B 0 29.88 16.96 18.60 9.48 48.80 32.05 0.25 0.13 0.88 0.44 0.96 0.48 0.88 0.44 0.00 0.00 13.24 6.25 Zephyr 7B 0 42.98 27.02 33.89 19.87 77.88 64.50 0.17 0.08 0.96 0.48 1.54 0.78 0.88 0.44 24.85 13.50 26.82 14.83 LLaMA V1 7B 1 47.27 30.68 57.56 40.41 42.02 26.22 76.32 62.36 84.92 74.87 84.92 74.87 6.51 2.44 0.08 0.04 55.89 38.74 LLaMA V2 7B 1 38.90 23.71 41.35 25.68 61.82 44.87 43.48 27.44 67.27 50.99 65.18 48.58 2.63 0.37 13.29 6.28 44.73 28.49 LLaMA V2 13B 1 41.51 25.80 51.86 34.85 75.58 61.38 66.42 50.00 80.06 67.58 79.34 66.55 2.83 0.47 36.26 21.65 58.18 41.04 Mistral 7B 1 39.32 24.04 55.80 38.65 75.44 61.19 68.70 52.68 87.53 79.04 81.22 69.25 5.62 1.96 36.29 21.67 60.60 43.56 Mixtral 8x7B 1 46.86 30.33 58.07 40.93 34.54 20.35 77.36 63.79 92.33 87.27 92.39 87.38 5.31 1.79 0.77 0.40 58.54 41.41 LLaMA V1 7B 2 42.75 26.82 71.73 56.39 79.40 66.63 82.29 70.84 87.50 78.99 87.42 78.87 41.58 25.86 0.0 0.0 66.68 50.30 LLaMA V2 7B 2 52.17 35.14 72.19 56.97 93.42 89.24 80.64 68.41 94.05 90.41 94.16 90.60 53.40 36.31 50.95 34.01 76.52 62.64 LLaMA V2 13B 2 51.22 34.26 74.33 59.72 93.06 88.59 94.01 90.33 95.16 92.48 95.07 92.32 13.53 6.42 45.55 29.19 75.80 61.66 Mistral 7B 2 49.33 32.52 73.29 58.38 91.48 85.76 89.79 82.83 94.58 91.40 94.74 91.69 41.66 25.92 47.71 31.08 76.38 62.45 Mixtral 8x7B 2 43.35 27.33 63.07 46.22 78.68 65.63 69.60 53.77 84.00 73.44 83.86 73.23 10.12 4.45 27.41 15.24 61.86 44.91 Falcon 7B 2 47.95 31.29 65.01 48.38 81.85 70.19 89.78 82.81 85.98 76.54 84.93 74.89 13.78 6.56 0.12 0.06 65.19 48.59 Zephyr 7B 2 52.43 35.39 71.31 55.86 92.57 87.71 90.87 84.69 94.95 92.09 94.90 91.98 57.36 40.21 63.35 46.54 79.52 66.81 LLaMA V1 7B 3 44.45 28.25 64.32 47.61 83.55 72.75 90.72 84.44 89.96 83.12 89.76 82.78 37.07 22.28 0 0 68.47 52.40 LLaMA V2 7B 3 52.43 35.39 71.31 55.86 92.57 87.71 90.87 84.69 94.95 92.09 94.90 93.98 57.36 40.21 63.35 46.54 79.52 66.81 LLaMA V2 13B 3 49.49 32.67 66.53 50.12 93.71 89.78 96.19 94.45 95.17 92.51 95.12 92.40 41.07 25.44 53.69 36.59 77.69 64.25 Mistral 7B 3 51.88 34.87 76.11 62.08 92.05 86.77 94.59 91.42 95.08 92.34 94.88 91.96 49.90 33.05 66.85 50.50 80.26 67.87 Mixtral 8x7B 3 51.65 34.66 69.71 53.89 87.24 78.57 80.13 67.68 93.28 88.99 93.15 88.76 20.40 10.61 29.61 16.77 70.60 54.99 Falcon 7B 3 62.30 45.39 63.68 46.90 91.73 86.20 95.79 93.68 93.34 89.09 93.31 89.05 14.88 7.21 0.0 0.0 72.16 56.94 Zephyr 7B 3 56.36 39.21 72.24 57.04 84.20 73.75 87.64 79.22 93.93 90.18 93.86 90.05 57.22 40.06 51.67 34.68 76.80 63.02
Table 5. Performance Comparison of Object Classification Methods Across Different Backbone Architectures.
| Method | ResNet50 | VGG16 | InceptionV2 |
| CAM | 43.10 | 43.74 | 38.46 |
| HAS | 34.15 | 16.36 | 40.65 |
| ADL | 41.72 | 41.56 | 42.09 |
| ACOL | 43.74 | 16.36 | 42.09 |
| SPG | 42.19 | - | 40.92 |
| Cutmix | 41.45 | 41.93 | 40.60 |
5.2. Object Detection Results
We evaluated multiple object detection models including Class Activation Mapping (CAM), Hide-and-Seek (HAS), Attention-based Dropout Layer (ADL), Adversarial Complementary Learning (ACOL), Self-produced Guidance (SPG), CutMix, and DINO across different backbone architectures (ResNet50, VGG16, InceptionV2, Swin Transformer). DINO with Swin 4-scale backbone achieved the best performance with an average score of 32.2 and notable robustness across quality thresholds (45.4 at 10%). Traditional models like CAM and HAS showed limitations, with ResNet50 scoring averages of 6.17 and 7.14 respectively, with significant performance drops at higher quality thresholds.
Object classification results revealed strong performance from CAM with consistent scores (43.10 for ResNet50, 43.74 for VGG16). HAS showed significant variability, excelling with ResNet50 (34.15) but performing poorly with VGG16 (16.36), indicating dependency on backbone feature extraction capabilities. Advanced methods like ADL and ACOL performed well with InceptionV2 (both achieving 42.09), suggesting that attention diversification across images improves classification in complex datasets.
YOLO model evaluation provided insights into real-time detection capabilities. YOLOv7 demonstrated highest performance with 76.00% precision, 85.60% recall, and 79.20% mAP50, showing robustness in detecting objects on multilingual receipts with diverse layouts. YOLOv8 showed slightly lower performance but improved mAP across different IoU thresholds (45.30% mAP50-95), suggesting better generalization across object sizes and shapes. YOLOv9 further improved these metrics with 46.70% mAP50-95, highlighting continuous YOLO architecture advancements.
5.3. OCR Results
As shown in Figure 8, OCR evaluation focused on Character Error Rate (CER) and Word Error Rate (WER) metrics. Tesseract OCR baseline achieved 15.56% CER and 30.78% WER. The Attention-Gated CNN-BiGRU model improved performance with 14.85% CER and 27.22% WER by combining gated CNNs with bidirectional GRU layers and attention mechanisms for better spatial dependencies and contextual information handling. Our specialized OCR model reduced CER to 7.83% but maintained similar WER (27.24%), indicating better individual character recognition while struggling at word level. Azura OCR achieved best performance with 6.39% CER and 25.97% WER.
5.4. Information Extraction Results
In zero-shot settings, models performed poorly, particularly for Brand and Weight categories. Zephyr achieved highest performance (42.98 F1, 27.02% accuracy), while Mistral followed (32.07 F1, 18.53% accuracy). The #Units category showed better results with Mistral obtaining 74.94 F1 and 60.52% accuracy, indicating easier numerical field recognition. One-shot learning showed marked improvement, especially LLaMA V1 in Brand category (jumping from 3.70 to 47.27 F1). Mixtral led Weight extraction (58.07 F1, 40.93% accuracy), while Mistral achieved best overall performance (60.60 F1, 43.56% accuracy). Few-shot learning further enhanced performance: in two-shot scenarios, Zephyr recorded 52.43 F1 and 35.39% accuracy for Brand category, while LLaMA V2 excelled in #Units extraction (93.42 F1, 89.24% accuracy). Three-shot setup showed Falcon leading Brand category (62.30 F1, 45.39% accuracy) and Mistral dominating Weight category (76.11 F1, 62.08% accuracy), with Mistral achieving best overall performance (80.26 F1, 67.87% accuracy).
5.5. Model Performance Analysis
Figure 9 presents a comprehensive comparison of three distinct methodological approaches across six critical performance dimensions for multilingual receipt understanding. The Figure 9 reveals complementary strengths and trade-offs between traditional methods, advanced neural networks, and large language models. Traditional approaches (Tesseract + CAM) excel in real-time performance and show reasonable multilingual support but demonstrate significant limitations in object detection, OCR accuracy, and information extraction capabilities. Advanced neural networks (DINO + Azura) achieve superior performance in object detection and OCR accuracy with strong overall robustness, though at the cost of reduced real-time processing speed. LLM-based approaches demonstrate the highest information extraction capabilities and excellent multilingual support but suffer from poor real-time performance due to computational requirements.
Figure 9. Comprehensive performance comparison of traditional, advanced neural, and LLM-based approaches across six key dimensions for multilingual receipt understanding.
6. Conclusion
In this work, we presented ReceiptSense, a comprehensive dataset designed to advance multilingual OCR and post-OCR parsing, particularly for Arabic and English receipts. The dataset consists of over 20,000 annotated receipts, 30,000 OCR-annotated images, and 10,000 item-level annotations, supporting tasks such as object detection, OCR, and information extraction. We evaluated various models, ranging from traditional methods like Tesseract OCR to advanced neural network architectures like YOLO and DINO, as well as large language models like LLaMA and Mistral. Our results demonstrate the dataset’s effectiveness in capturing real-world receipt variations and highlight the challenges posed by noisy, multilingual text. By publicly releasing ReceiptSense, we aim to foster research into document understanding across diverse domains and applications.
7. GenAI Usage Disclosure
We used OpenAI’s ChatGPT for minor language editing, specifically to rephrase sentences and correct grammatical errors.
References
- (1)
- Abdin et al. (2024) Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. 2024. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219 (2024).
- Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
- Almazrouei et al. (2023) Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Mérouane Debbah, Étienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, et al. 2023. The falcon series of open language models. arXiv preprint arXiv:2311.16867 (2023).
- Chen et al. (2024) Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. 2024. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024).
- Choe et al. (2020) Junsuk Choe, Seong Joon Oh, Seungho Lee, Sanghyuk Chun, Zeynep Akata, and Hyunjung Shim. 2020. Evaluating weakly supervised object localization methods right. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3133–3142.
- Choe and Shim (2019) Junsuk Choe and Hyunjung Shim. 2019. Attention-based dropout layer for weakly supervised object localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2219–2228.
- Huang et al. (2019) Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, and CV Jawahar. 2019. Icdar2019 competition on scanned receipt ocr and information extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 1516–1520.
- Hwang et al. (2019) Wonseok Hwang, Seonghyeon Kim, Minjoon Seo, Jinyeong Yim, Seunghyun Park, Sungrae Park, Junyeop Lee, Bado Lee, and Hwalsuk Lee. 2019. Post-ocr parsing: building simple and robust parser via bio tagging. In Workshop on Document Intelligence at NeurIPS 2019.
- Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7B. arXiv preprint arXiv:2310.06825 (2023).
- Jiang et al. (2024) Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. 2024. Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024).
- Jocher et al. (2023) Glenn Jocher, Ayush Chaurasia, and Jiahui Qiu. 2023. YOLO by Ultralytics. https://github.com/ultralytics/ultralytics.
- Liu et al. (2023) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning. Advances in neural information processing systems 36 (2023), 34892–34916.
- Nguyen et al. (2021b) Duy-Cuong Nguyen, Tuan-Anh Nguyen, and Xuan-Chung Nguyen. 2021b. MC-OCR challenge 2021: End-to-end system to extract key information from vietnamese receipts. In 2021 RIVF International Conference on Computing and Communication Technologies (RIVF). IEEE, 1–5.
- Nguyen et al. (2021a) Thi Tuyet Hai Nguyen, Adam Jatowt, Mickael Coustaty, and Antoine Doucet. 2021a. Survey of Post-OCR Processing Approaches. ACM Comput. Surv. 54, 6, Article 124 (jul 2021), 37 pages. doi:10.1145/3453476
- Park et al. (2019) Seunghyun Park, Seung Shin, Bado Lee, Junyeop Lee, Jaeheung Surh, Minjoon Seo, and Hwalsuk Lee. 2019. CORD: a consolidated receipt dataset for post-OCR parsing. In Workshop on Document Intelligence at NeurIPS 2019.
- Singh and Lee (2017) Krishna Kumar Singh and Yong Jae Lee. 2017. Hide-and-Seek: Forcing a Network to be Meticulous for Weakly-supervised Object and Action Localization. In International Conference on Computer Vision (ICCV).
- Touvron et al. (2023a) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023a. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
- Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023b. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
- Tunstall et al. (2023) Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, et al. 2023. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944 (2023).
- Wang et al. (2023) Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. 2023. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- Yun et al. (2019) Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. 2019. CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. In International Conference on Computer Vision (ICCV).
- Zhang et al. (2021) Dingwen Zhang, Junwei Han, Gong Cheng, and Ming-Hsuan Yang. 2021. Weakly supervised object localization and detection: A survey. IEEE transactions on pattern analysis and machine intelligence 44, 9 (2021), 5866–5885.
- Zhang et al. (2022) Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. 2022. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605 (2022).
- Zhang et al. (2018a) Xiaolin Zhang, Yunchao Wei, Jiashi Feng, Yi Yang, and Thomas Huang. 2018a. Adversarial complementary learning for weakly supervised object localization. In IEEE CVPR.
- Zhang et al. (2018b) Xiaolin Zhang, Yunchao Wei, Guoliang Kang, Yi Yang, and Thomas Huang. 2018b. Self-produced Guidance for Weakly-supervised Object Localization. In European Conference on Computer Vision. Springer.
- Zhou et al. (2016) Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2016. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2921–2929.