Receipt Processor

A Python application that processes receipt images and meal vouchers from PDFs or image files, converts them to PDF, extracts data using OCR, and exports to Excel. The system automatically splits scanned images to separately process receipts (left column) and meal vouchers (right column).

Features

� PDF Support: Upload full PDFs and automatically extract each page as an image
🖼️ Image Support: Process individual image files (JPG, PNG, etc.)
�📸 Dual Document Processing: Automatically splits images into receipts and vouchers
🔄 Smart Classification: Identifies document type (receipt vs voucher) automatically
� PDF Generation: Combines all images into a single organized PDF document
🔍 OCR Extraction: Extract text from both receipts and vouchers using Tesseract OCR
🧾 Receipt Parsing: Extracts merchant, date, items, prices, and totals
🎫 Voucher Parsing: Extracts voucher numbers, amounts, employee names, and company info
📊 Excel Export: Comprehensive spreadsheet with all extracted data and confidence scores
🧹 Auto Cleanup: Temporary files (PDF pages, split images) automatically removed after processing
📈 Quality Metrics: Confidence scores and performance analysis
🎯 Easy Setup: Automatic Tesseract OCR detection and configuration

How It Works

Input: Place PDF files in input/pdfs/ OR image files in input/images/
- PDFs are automatically split into separate page images
- Each image is then split into left (receipt) and right (voucher) columns
Process: OCR extracts appropriate data based on document type
Output:
- Combined PDF with all split images
- Excel file with comprehensive data for receipts and vouchers
Cleanup: Temporary files (PDF pages, split images) automatically deleted (configurable)

Project Structure

ReceiptProcessor/
├── input/                 # Input folder for source files
│   ├── pdfs/             # Place PDF files here (auto-converted to images)
│   └── images/           # Place image files here OR auto-generated from PDFs
├── outputs/               # Output folder for generated PDFs and Excel files
│   ├── pdfs/             # Generated PDF files
│   └── excel/            # Generated Excel files
├── src/                   # Source code
│   ├── __init__.py
│   ├── pdf_splitter.py    # PDF to image conversion
│   ├── image_processor.py # Image processing and metadata extraction
│   ├── pdf_generator.py   # PDF generation functionality
│   ├── ocr_processor.py   # OCR text extraction and parsing
│   └── excel_exporter.py  # Excel export functionality
├── tests/                 # Test scripts
│   ├── test_metadata.py   # Metadata extraction tests
│   ├── test_tesseract.py  # OCR installation test utility
│   ├── test_ocr_quality.py # OCR quality tests
│   ├── test_extraction.py # Data extraction tests
│   ├── analyze_results.py # Results analysis and quality metrics
│   └── demo_pdf_split.py  # PDF splitting demo
├── docs/                  # Documentation
│   ├── CONFIGURATION.md          # Configuration guide
│   ├── CLEANUP_GUIDE.md          # Cleanup behavior and settings
│   ├── PERFORMANCE_GUIDE.md      # Performance optimization tips
│   ├── METADATA_TRACKING.md      # Metadata implementation guide
│   ├── PDF_INPUT_SUMMARY.md      # PDF support documentation
│   ├── QUICK_START_PDF.md        # PDF quick start guide
│   ├── IMPLEMENTATION_VERIFICATION.md # Verification checklist
│   ├── CHANGELOG.md              # Change history
│   ├── FIELD_GUIDE.md            # Field extraction guide
│   ├── EXTRACTION_STATUS.md      # Extraction status
│   ├── IMPLEMENTATION_COMPLETE.md # Implementation details
│   └── UPDATE_SUMMARY.md         # Update summary
├── .github/               # GitHub configuration
│   ├── workflows/        # CI/CD workflows
│   └── copilot-instructions.md
├── main.py                # Main entry point
├── dev.py                 # Development workflow helper
├── requirements.txt       # Python dependencies
├── .gitignore            # Git ignore file
└── README.md             # This file

Prerequisites

Python 3.8 or higher
Tesseract OCR engine
Poppler (for PDF to image conversion)

Installing Tesseract

Windows:

Download the installer from: https://github.com/UB-Mannheim/tesseract/wiki
Install and add to PATH (default: C:\Program Files\Tesseract-OCR)

macOS:

brew install tesseract

Linux:

sudo apt-get install tesseract-ocr

Installing Poppler (for PDF support)

Windows:

Download from: https://github.com/oschwartz10612/poppler-windows/releases
Extract and add bin folder to PATH

macOS:

brew install poppler

Linux:

sudo apt-get install poppler-utils

Installation

Clone the repository:

git clone https://github.com/yourusername/ReceiptProcessor.git
cd ReceiptProcessor

Create a virtual environment:

python -m venv venv

Activate the virtual environment:

Windows:

.\venv\Scripts\Activate.ps1

macOS/Linux:

source venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Usage

Option 1: Process PDF Files

Place your PDF files in the input/pdfs/ folder
- Each page will be automatically extracted as a separate image
- Each image is then split into left (receipt) and right (voucher) columns
Run the processor:

python main.py

Option 2: Process Image Files

Place your image files in the input/images/ folder
- Images should have the receipt on the left side and meal voucher on the right side
- Supported formats: JPG, PNG, JPEG, BMP, TIFF
Run the processor:

python main.py

Processing Steps

The system will:

Convert any PDFs to individual page images (stored in input/images/)
Split each image into left (receipt) and right (voucher) columns
Extract appropriate data based on document type
Generate a combined PDF and Excel file
Automatically clean up temporary split files

Output

Find your outputs in the outputs/ folder:

outputs/pdfs/receipts_TIMESTAMP.pdf - Combined PDF of all split images
outputs/excel/receipts_TIMESTAMP.xlsx - Excel file with extracted data

Dependencies

Pillow - Image processing and splitting
img2pdf - Image to PDF conversion
pytesseract - OCR text extraction
openpyxl - Excel file generation
python-dateutil - Date parsing
pdf2image - PDF to image conversion

Configuration

The application can be configured using environment variables or a .env file. Copy .env.example to .env and customize:

# Input source control (default: pdfs)
RECEIPT_INPUT_SOURCE=pdfs      # Options: pdfs, images, both

# PDF processing (default: 300)
RECEIPT_PDF_DPI=300            # Resolution: 150 (fast), 300 (balanced), 600 (high quality)

# Cleanup settings (default: true)
RECEIPT_CLEANUP_TEMP_FILES=true  # Auto-remove temporary files after processing

# OCR settings (default: eng)
RECEIPT_OCR_LANGUAGE=eng       # Language: eng, fra, deu, spa, etc.

# Output quality (default: 95)
RECEIPT_PDF_QUALITY=95         # JPEG quality: 1-100

Key Settings

Input Source: Control whether to process PDFs, images, or both
PDF DPI: Higher = better quality but slower processing
Cleanup: Enable/disable automatic removal of temporary files
OCR Language: Change for non-English receipts

For detailed configuration options, see docs/CONFIGURATION.md

Cleanup Behavior

By default, temporary files are automatically cleaned up after processing:

✅ PDF-extracted page images (e.g., Apr 2-8_page1.jpg)
✅ Split images (e.g., *_left.jpg, *_right.jpg)

To keep temporary files for debugging:

export RECEIPT_CLEANUP_TEMP_FILES=false
python main.py

See docs/CLEANUP_GUIDE.md for details.

Excel Output Format

The Excel file contains two sheets:

Sheet 1: Summary

Overview of processing results

Sheet 2: Receipt Data

Streamlined data with 11 focused columns:

For ALL Documents:

Filename - Original image filename
Document Type - receipt, voucher, or unknown
Extraction Confidence - Success rate of extracting specific fields (0-100%)
Space Confidence - Confidence that document space contains expected type (0-100%)
Text Length - Number of characters extracted via OCR
Raw Text - Full OCR output for manual review

Receipt-Specific Columns (Left Column Documents):

Total (Eat In) - Eat-in total amount from receipt
Meal Voucher - Meal voucher amount used on receipt
Served By - Name of person who served/processed order

Voucher-Specific Columns (Right Column Documents):

Valid - Validity/expiration date of voucher
Amount - Monetary value of the meal voucher

Note:

Receipt rows will have empty voucher columns (Valid, Amount for vouchers)

Voucher rows will have empty receipt-specific columns (Total Eat In, Meal Voucher, Served By)

Fields showing "Not found" indicate OCR couldn't locate that specific data

Confidence Scores Explained

Extraction Confidence

Measures how successfully the targeted fields were extracted:

Receipt: Based on finding Total (Eat In), Meal Voucher, and Served By
Voucher: Based on finding Valid date and Amount
Higher percentage = more fields successfully extracted

Space Confidence

Measures confidence that the document space contains the expected type:

Analyzes keywords, text patterns, and document structure
Helps identify potential classification errors
Higher percentage = stronger indicators of correct document type

See FIELD_GUIDE.md for detailed extraction patterns and troubleshooting.

Performance Metrics

Current parsing performance (based on testing with 201 receipts):

Text Extraction: 96% success rate (193/201 receipts)
Merchant Names: 98% success rate
Total Amounts: 83% success rate
Items: 80% success rate
Dates: 20% success rate (area for improvement)
Average Confidence: 80.7%

Development Tools

Quick Test & Analysis

# Test Tesseract installation
python test_tesseract.py

# Analyze parsing results  
python analyze_results.py

# Development workflow helper
python dev.py [test|process|analyze|status|commit|push|all]

Git Workflow

# Useful aliases are pre-configured:
git st          # git status
git co          # git checkout  
git br          # git branch
git lg          # git log --oneline --graph --decorate

Troubleshooting

Issue: Tesseract not found

Ensure Tesseract is installed and added to your system PATH
On Windows, you may need to set the path in the code: pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

Issue: Poor OCR accuracy

Ensure images are clear and high resolution
Images should be well-lit with minimal glare
Try preprocessing images (contrast adjustment, noise reduction)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License - feel free to use this project for personal or commercial purposes.

Future Enhancements

Web interface for uploading receipts
Machine learning for better data extraction
Support for multiple languages
Database integration
Duplicate receipt detection
Category classification
Cloud storage integration

Author

Created for processing receipts efficiently and maintaining organized expense records.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github		.github
docs		docs
input		input
outputs		outputs
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
check_doc3.py		check_doc3.py
config.py		config.py
dev.py		dev.py
diagnose_voucher_ocr.py		diagnose_voucher_ocr.py
main.py		main.py
requirements.txt		requirements.txt

maxhightower/ReceiptProcessor

Folders and files

Latest commit

History

Repository files navigation

Receipt Processor

Features

How It Works

Project Structure

Prerequisites

Installing Tesseract

Installing Poppler (for PDF support)

Installation

Usage

Option 1: Process PDF Files

Option 2: Process Image Files

Processing Steps

Output

Dependencies

Configuration

Key Settings

Cleanup Behavior

Excel Output Format

Sheet 1: Summary

Sheet 2: Receipt Data

For ALL Documents:

Receipt-Specific Columns (Left Column Documents):

Voucher-Specific Columns (Right Column Documents):

Confidence Scores Explained

Extraction Confidence

Space Confidence

Performance Metrics

Development Tools

Quick Test & Analysis

Git Workflow

Troubleshooting

Contributing

License

Future Enhancements

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages