Skip to content

maxhightower/ReceiptProcessor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Receipt Processor

A Python application that processes receipt images and meal vouchers from PDFs or image files, converts them to PDF, extracts data using OCR, and exports to Excel. The system automatically splits scanned images to separately process receipts (left column) and meal vouchers (right column).

Features

  • οΏ½ PDF Support: Upload full PDFs and automatically extract each page as an image
  • πŸ–ΌοΈ Image Support: Process individual image files (JPG, PNG, etc.)
  • οΏ½πŸ“Έ Dual Document Processing: Automatically splits images into receipts and vouchers
  • πŸ”„ Smart Classification: Identifies document type (receipt vs voucher) automatically
  • οΏ½ PDF Generation: Combines all images into a single organized PDF document
  • πŸ” OCR Extraction: Extract text from both receipts and vouchers using Tesseract OCR
  • 🧾 Receipt Parsing: Extracts merchant, date, items, prices, and totals
  • 🎫 Voucher Parsing: Extracts voucher numbers, amounts, employee names, and company info
  • πŸ“Š Excel Export: Comprehensive spreadsheet with all extracted data and confidence scores
  • 🧹 Auto Cleanup: Temporary files (PDF pages, split images) automatically removed after processing
  • πŸ“ˆ Quality Metrics: Confidence scores and performance analysis
  • 🎯 Easy Setup: Automatic Tesseract OCR detection and configuration

How It Works

  1. Input: Place PDF files in input/pdfs/ OR image files in input/images/
    • PDFs are automatically split into separate page images
    • Each image is then split into left (receipt) and right (voucher) columns
  2. Process: OCR extracts appropriate data based on document type
  3. Output:
    • Combined PDF with all split images
    • Excel file with comprehensive data for receipts and vouchers
  4. Cleanup: Temporary files (PDF pages, split images) automatically deleted (configurable)

Project Structure

ReceiptProcessor/
β”œβ”€β”€ input/                 # Input folder for source files
β”‚   β”œβ”€β”€ pdfs/             # Place PDF files here (auto-converted to images)
β”‚   └── images/           # Place image files here OR auto-generated from PDFs
β”œβ”€β”€ outputs/               # Output folder for generated PDFs and Excel files
β”‚   β”œβ”€β”€ pdfs/             # Generated PDF files
β”‚   └── excel/            # Generated Excel files
β”œβ”€β”€ src/                   # Source code
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ pdf_splitter.py    # PDF to image conversion
β”‚   β”œβ”€β”€ image_processor.py # Image processing and metadata extraction
β”‚   β”œβ”€β”€ pdf_generator.py   # PDF generation functionality
β”‚   β”œβ”€β”€ ocr_processor.py   # OCR text extraction and parsing
β”‚   └── excel_exporter.py  # Excel export functionality
β”œβ”€β”€ tests/                 # Test scripts
β”‚   β”œβ”€β”€ test_metadata.py   # Metadata extraction tests
β”‚   β”œβ”€β”€ test_tesseract.py  # OCR installation test utility
β”‚   β”œβ”€β”€ test_ocr_quality.py # OCR quality tests
β”‚   β”œβ”€β”€ test_extraction.py # Data extraction tests
β”‚   β”œβ”€β”€ analyze_results.py # Results analysis and quality metrics
β”‚   └── demo_pdf_split.py  # PDF splitting demo
β”œβ”€β”€ docs/                  # Documentation
β”‚   β”œβ”€β”€ CONFIGURATION.md          # Configuration guide
β”‚   β”œβ”€β”€ CLEANUP_GUIDE.md          # Cleanup behavior and settings
β”‚   β”œβ”€β”€ PERFORMANCE_GUIDE.md      # Performance optimization tips
β”‚   β”œβ”€β”€ METADATA_TRACKING.md      # Metadata implementation guide
β”‚   β”œβ”€β”€ PDF_INPUT_SUMMARY.md      # PDF support documentation
β”‚   β”œβ”€β”€ QUICK_START_PDF.md        # PDF quick start guide
β”‚   β”œβ”€β”€ IMPLEMENTATION_VERIFICATION.md # Verification checklist
β”‚   β”œβ”€β”€ CHANGELOG.md              # Change history
β”‚   β”œβ”€β”€ FIELD_GUIDE.md            # Field extraction guide
β”‚   β”œβ”€β”€ EXTRACTION_STATUS.md      # Extraction status
β”‚   β”œβ”€β”€ IMPLEMENTATION_COMPLETE.md # Implementation details
β”‚   └── UPDATE_SUMMARY.md         # Update summary
β”œβ”€β”€ .github/               # GitHub configuration
β”‚   β”œβ”€β”€ workflows/        # CI/CD workflows
β”‚   └── copilot-instructions.md
β”œβ”€β”€ main.py                # Main entry point
β”œβ”€β”€ dev.py                 # Development workflow helper
β”œβ”€β”€ requirements.txt       # Python dependencies
β”œβ”€β”€ .gitignore            # Git ignore file
└── README.md             # This file

Prerequisites

  • Python 3.8 or higher
  • Tesseract OCR engine
  • Poppler (for PDF to image conversion)

Installing Tesseract

Windows:

  1. Download the installer from: https://github.com/UB-Mannheim/tesseract/wiki
  2. Install and add to PATH (default: C:\Program Files\Tesseract-OCR)

macOS:

brew install tesseract

Linux:

sudo apt-get install tesseract-ocr

Installing Poppler (for PDF support)

Windows:

  1. Download from: https://github.com/oschwartz10612/poppler-windows/releases
  2. Extract and add bin folder to PATH

macOS:

brew install poppler

Linux:

sudo apt-get install poppler-utils

Installation

  1. Clone the repository:
git clone https://github.com/yourusername/ReceiptProcessor.git
cd ReceiptProcessor
  1. Create a virtual environment:
python -m venv venv
  1. Activate the virtual environment:

Windows:

.\venv\Scripts\Activate.ps1

macOS/Linux:

source venv/bin/activate
  1. Install dependencies:
pip install -r requirements.txt

Usage

Option 1: Process PDF Files

  1. Place your PDF files in the input/pdfs/ folder

    • Each page will be automatically extracted as a separate image
    • Each image is then split into left (receipt) and right (voucher) columns
  2. Run the processor:

python main.py

Option 2: Process Image Files

  1. Place your image files in the input/images/ folder

    • Images should have the receipt on the left side and meal voucher on the right side
    • Supported formats: JPG, PNG, JPEG, BMP, TIFF
  2. Run the processor:

python main.py

Processing Steps

The system will:

  • Convert any PDFs to individual page images (stored in input/images/)
  • Split each image into left (receipt) and right (voucher) columns
  • Extract appropriate data based on document type
  • Generate a combined PDF and Excel file
  • Automatically clean up temporary split files

Output

Find your outputs in the outputs/ folder:

  • outputs/pdfs/receipts_TIMESTAMP.pdf - Combined PDF of all split images
  • outputs/excel/receipts_TIMESTAMP.xlsx - Excel file with extracted data

Dependencies

  • Pillow - Image processing and splitting
  • img2pdf - Image to PDF conversion
  • pytesseract - OCR text extraction
  • openpyxl - Excel file generation
  • python-dateutil - Date parsing
  • pdf2image - PDF to image conversion

Configuration

The application can be configured using environment variables or a .env file. Copy .env.example to .env and customize:

# Input source control (default: pdfs)
RECEIPT_INPUT_SOURCE=pdfs      # Options: pdfs, images, both

# PDF processing (default: 300)
RECEIPT_PDF_DPI=300            # Resolution: 150 (fast), 300 (balanced), 600 (high quality)

# Cleanup settings (default: true)
RECEIPT_CLEANUP_TEMP_FILES=true  # Auto-remove temporary files after processing

# OCR settings (default: eng)
RECEIPT_OCR_LANGUAGE=eng       # Language: eng, fra, deu, spa, etc.

# Output quality (default: 95)
RECEIPT_PDF_QUALITY=95         # JPEG quality: 1-100

Key Settings

  • Input Source: Control whether to process PDFs, images, or both
  • PDF DPI: Higher = better quality but slower processing
  • Cleanup: Enable/disable automatic removal of temporary files
  • OCR Language: Change for non-English receipts

For detailed configuration options, see docs/CONFIGURATION.md

Cleanup Behavior

By default, temporary files are automatically cleaned up after processing:

  • βœ… PDF-extracted page images (e.g., Apr 2-8_page1.jpg)
  • βœ… Split images (e.g., *_left.jpg, *_right.jpg)

To keep temporary files for debugging:

export RECEIPT_CLEANUP_TEMP_FILES=false
python main.py

See docs/CLEANUP_GUIDE.md for details.

Excel Output Format

The Excel file contains two sheets:

Sheet 1: Summary

Overview of processing results

Sheet 2: Receipt Data

Streamlined data with 11 focused columns:

For ALL Documents:

  • Filename - Original image filename
  • Document Type - receipt, voucher, or unknown
  • Extraction Confidence - Success rate of extracting specific fields (0-100%)
  • Space Confidence - Confidence that document space contains expected type (0-100%)
  • Text Length - Number of characters extracted via OCR
  • Raw Text - Full OCR output for manual review

Receipt-Specific Columns (Left Column Documents):

  • Total (Eat In) - Eat-in total amount from receipt
  • Meal Voucher - Meal voucher amount used on receipt
  • Served By - Name of person who served/processed order

Voucher-Specific Columns (Right Column Documents):

  • Valid - Validity/expiration date of voucher
  • Amount - Monetary value of the meal voucher

Note:

  • Receipt rows will have empty voucher columns (Valid, Amount for vouchers)
  • Voucher rows will have empty receipt-specific columns (Total Eat In, Meal Voucher, Served By)
  • Fields showing "Not found" indicate OCR couldn't locate that specific data

Confidence Scores Explained

Extraction Confidence

Measures how successfully the targeted fields were extracted:

  • Receipt: Based on finding Total (Eat In), Meal Voucher, and Served By
  • Voucher: Based on finding Valid date and Amount
  • Higher percentage = more fields successfully extracted

Space Confidence

Measures confidence that the document space contains the expected type:

  • Analyzes keywords, text patterns, and document structure
  • Helps identify potential classification errors
  • Higher percentage = stronger indicators of correct document type

See FIELD_GUIDE.md for detailed extraction patterns and troubleshooting.

Performance Metrics

Current parsing performance (based on testing with 201 receipts):

  • Text Extraction: 96% success rate (193/201 receipts)
  • Merchant Names: 98% success rate
  • Total Amounts: 83% success rate
  • Items: 80% success rate
  • Dates: 20% success rate (area for improvement)
  • Average Confidence: 80.7%

Development Tools

Quick Test & Analysis

# Test Tesseract installation
python test_tesseract.py

# Analyze parsing results  
python analyze_results.py

# Development workflow helper
python dev.py [test|process|analyze|status|commit|push|all]

Git Workflow

# Useful aliases are pre-configured:
git st          # git status
git co          # git checkout  
git br          # git branch
git lg          # git log --oneline --graph --decorate

Troubleshooting

Issue: Tesseract not found

  • Ensure Tesseract is installed and added to your system PATH
  • On Windows, you may need to set the path in the code: pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

Issue: Poor OCR accuracy

  • Ensure images are clear and high resolution
  • Images should be well-lit with minimal glare
  • Try preprocessing images (contrast adjustment, noise reduction)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License - feel free to use this project for personal or commercial purposes.

Future Enhancements

  • Web interface for uploading receipts
  • Machine learning for better data extraction
  • Support for multiple languages
  • Database integration
  • Duplicate receipt detection
  • Category classification
  • Cloud storage integration

Author

Created for processing receipts efficiently and maintaining organized expense records.

About

Python receipt processor with OCR, PDF generation, and Excel export

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages