A Python application that processes receipt images and meal vouchers from PDFs or image files, converts them to PDF, extracts data using OCR, and exports to Excel. The system automatically splits scanned images to separately process receipts (left column) and meal vouchers (right column).
- οΏ½ PDF Support: Upload full PDFs and automatically extract each page as an image
- πΌοΈ Image Support: Process individual image files (JPG, PNG, etc.)
- οΏ½πΈ Dual Document Processing: Automatically splits images into receipts and vouchers
- π Smart Classification: Identifies document type (receipt vs voucher) automatically
- οΏ½ PDF Generation: Combines all images into a single organized PDF document
- π OCR Extraction: Extract text from both receipts and vouchers using Tesseract OCR
- π§Ύ Receipt Parsing: Extracts merchant, date, items, prices, and totals
- π« Voucher Parsing: Extracts voucher numbers, amounts, employee names, and company info
- π Excel Export: Comprehensive spreadsheet with all extracted data and confidence scores
- π§Ή Auto Cleanup: Temporary files (PDF pages, split images) automatically removed after processing
- π Quality Metrics: Confidence scores and performance analysis
- π― Easy Setup: Automatic Tesseract OCR detection and configuration
- Input: Place PDF files in
input/pdfs/
OR image files ininput/images/
- PDFs are automatically split into separate page images
- Each image is then split into left (receipt) and right (voucher) columns
- Process: OCR extracts appropriate data based on document type
- Output:
- Combined PDF with all split images
- Excel file with comprehensive data for receipts and vouchers
- Cleanup: Temporary files (PDF pages, split images) automatically deleted (configurable)
ReceiptProcessor/
βββ input/ # Input folder for source files
β βββ pdfs/ # Place PDF files here (auto-converted to images)
β βββ images/ # Place image files here OR auto-generated from PDFs
βββ outputs/ # Output folder for generated PDFs and Excel files
β βββ pdfs/ # Generated PDF files
β βββ excel/ # Generated Excel files
βββ src/ # Source code
β βββ __init__.py
β βββ pdf_splitter.py # PDF to image conversion
β βββ image_processor.py # Image processing and metadata extraction
β βββ pdf_generator.py # PDF generation functionality
β βββ ocr_processor.py # OCR text extraction and parsing
β βββ excel_exporter.py # Excel export functionality
βββ tests/ # Test scripts
β βββ test_metadata.py # Metadata extraction tests
β βββ test_tesseract.py # OCR installation test utility
β βββ test_ocr_quality.py # OCR quality tests
β βββ test_extraction.py # Data extraction tests
β βββ analyze_results.py # Results analysis and quality metrics
β βββ demo_pdf_split.py # PDF splitting demo
βββ docs/ # Documentation
β βββ CONFIGURATION.md # Configuration guide
β βββ CLEANUP_GUIDE.md # Cleanup behavior and settings
β βββ PERFORMANCE_GUIDE.md # Performance optimization tips
β βββ METADATA_TRACKING.md # Metadata implementation guide
β βββ PDF_INPUT_SUMMARY.md # PDF support documentation
β βββ QUICK_START_PDF.md # PDF quick start guide
β βββ IMPLEMENTATION_VERIFICATION.md # Verification checklist
β βββ CHANGELOG.md # Change history
β βββ FIELD_GUIDE.md # Field extraction guide
β βββ EXTRACTION_STATUS.md # Extraction status
β βββ IMPLEMENTATION_COMPLETE.md # Implementation details
β βββ UPDATE_SUMMARY.md # Update summary
βββ .github/ # GitHub configuration
β βββ workflows/ # CI/CD workflows
β βββ copilot-instructions.md
βββ main.py # Main entry point
βββ dev.py # Development workflow helper
βββ requirements.txt # Python dependencies
βββ .gitignore # Git ignore file
βββ README.md # This file
- Python 3.8 or higher
- Tesseract OCR engine
- Poppler (for PDF to image conversion)
Windows:
- Download the installer from: https://github.com/UB-Mannheim/tesseract/wiki
- Install and add to PATH (default:
C:\Program Files\Tesseract-OCR
)
macOS:
brew install tesseract
Linux:
sudo apt-get install tesseract-ocr
Windows:
- Download from: https://github.com/oschwartz10612/poppler-windows/releases
- Extract and add
bin
folder to PATH
macOS:
brew install poppler
Linux:
sudo apt-get install poppler-utils
- Clone the repository:
git clone https://github.com/yourusername/ReceiptProcessor.git
cd ReceiptProcessor
- Create a virtual environment:
python -m venv venv
- Activate the virtual environment:
Windows:
.\venv\Scripts\Activate.ps1
macOS/Linux:
source venv/bin/activate
- Install dependencies:
pip install -r requirements.txt
-
Place your PDF files in the
input/pdfs/
folder- Each page will be automatically extracted as a separate image
- Each image is then split into left (receipt) and right (voucher) columns
-
Run the processor:
python main.py
-
Place your image files in the
input/images/
folder- Images should have the receipt on the left side and meal voucher on the right side
- Supported formats: JPG, PNG, JPEG, BMP, TIFF
-
Run the processor:
python main.py
The system will:
- Convert any PDFs to individual page images (stored in
input/images/
) - Split each image into left (receipt) and right (voucher) columns
- Extract appropriate data based on document type
- Generate a combined PDF and Excel file
- Automatically clean up temporary split files
Find your outputs in the outputs/
folder:
outputs/pdfs/receipts_TIMESTAMP.pdf
- Combined PDF of all split imagesoutputs/excel/receipts_TIMESTAMP.xlsx
- Excel file with extracted data
- Pillow - Image processing and splitting
- img2pdf - Image to PDF conversion
- pytesseract - OCR text extraction
- openpyxl - Excel file generation
- python-dateutil - Date parsing
- pdf2image - PDF to image conversion
The application can be configured using environment variables or a .env
file. Copy .env.example
to .env
and customize:
# Input source control (default: pdfs)
RECEIPT_INPUT_SOURCE=pdfs # Options: pdfs, images, both
# PDF processing (default: 300)
RECEIPT_PDF_DPI=300 # Resolution: 150 (fast), 300 (balanced), 600 (high quality)
# Cleanup settings (default: true)
RECEIPT_CLEANUP_TEMP_FILES=true # Auto-remove temporary files after processing
# OCR settings (default: eng)
RECEIPT_OCR_LANGUAGE=eng # Language: eng, fra, deu, spa, etc.
# Output quality (default: 95)
RECEIPT_PDF_QUALITY=95 # JPEG quality: 1-100
- Input Source: Control whether to process PDFs, images, or both
- PDF DPI: Higher = better quality but slower processing
- Cleanup: Enable/disable automatic removal of temporary files
- OCR Language: Change for non-English receipts
For detailed configuration options, see docs/CONFIGURATION.md
By default, temporary files are automatically cleaned up after processing:
- β
PDF-extracted page images (e.g.,
Apr 2-8_page1.jpg
) - β
Split images (e.g.,
*_left.jpg
,*_right.jpg
)
To keep temporary files for debugging:
export RECEIPT_CLEANUP_TEMP_FILES=false
python main.py
See docs/CLEANUP_GUIDE.md for details.
The Excel file contains two sheets:
Overview of processing results
Streamlined data with 11 focused columns:
- Filename - Original image filename
- Document Type - receipt, voucher, or unknown
- Extraction Confidence - Success rate of extracting specific fields (0-100%)
- Space Confidence - Confidence that document space contains expected type (0-100%)
- Text Length - Number of characters extracted via OCR
- Raw Text - Full OCR output for manual review
- Total (Eat In) - Eat-in total amount from receipt
- Meal Voucher - Meal voucher amount used on receipt
- Served By - Name of person who served/processed order
- Valid - Validity/expiration date of voucher
- Amount - Monetary value of the meal voucher
Note:
- Receipt rows will have empty voucher columns (Valid, Amount for vouchers)
- Voucher rows will have empty receipt-specific columns (Total Eat In, Meal Voucher, Served By)
- Fields showing "Not found" indicate OCR couldn't locate that specific data
Measures how successfully the targeted fields were extracted:
- Receipt: Based on finding Total (Eat In), Meal Voucher, and Served By
- Voucher: Based on finding Valid date and Amount
- Higher percentage = more fields successfully extracted
Measures confidence that the document space contains the expected type:
- Analyzes keywords, text patterns, and document structure
- Helps identify potential classification errors
- Higher percentage = stronger indicators of correct document type
See FIELD_GUIDE.md for detailed extraction patterns and troubleshooting.
Current parsing performance (based on testing with 201 receipts):
- Text Extraction: 96% success rate (193/201 receipts)
- Merchant Names: 98% success rate
- Total Amounts: 83% success rate
- Items: 80% success rate
- Dates: 20% success rate (area for improvement)
- Average Confidence: 80.7%
# Test Tesseract installation
python test_tesseract.py
# Analyze parsing results
python analyze_results.py
# Development workflow helper
python dev.py [test|process|analyze|status|commit|push|all]
# Useful aliases are pre-configured:
git st # git status
git co # git checkout
git br # git branch
git lg # git log --oneline --graph --decorate
Issue: Tesseract not found
- Ensure Tesseract is installed and added to your system PATH
- On Windows, you may need to set the path in the code:
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
Issue: Poor OCR accuracy
- Ensure images are clear and high resolution
- Images should be well-lit with minimal glare
- Try preprocessing images (contrast adjustment, noise reduction)
Contributions are welcome! Please feel free to submit a Pull Request.
MIT License - feel free to use this project for personal or commercial purposes.
- Web interface for uploading receipts
- Machine learning for better data extraction
- Support for multiple languages
- Database integration
- Duplicate receipt detection
- Category classification
- Cloud storage integration
Created for processing receipts efficiently and maintaining organized expense records.