Skip to content

denizgolbas/ReceiptML

Repository files navigation

ReceiptML Core

A machine learning–driven engine that converts retail and supermarket receipt images into structured, deterministic transaction data.

Overview

ReceiptML Core is a hybrid ML system that:

  • Performs OCR on receipt images (Tesseract or EasyOCR)
  • Detects and decodes barcodes (EAN-13, EAN-8, QR codes)
  • Classifies receipt lines using supervised ML or rule-based fallback
  • Extracts structured entities (store, products, payments, etc.)
  • Generates deterministic, schema-consistent JSON output

Architecture

ReceiptML/
├── engine.py              # Main ReceiptML engine
├── ocr.py                 # OCR module
├── barcode.py             # Barcode detection
├── classifier.py          # Line classification ML model
├── extractor.py           # Entity extraction
├── schema.py              # Output schema definitions
├── models/                # Trained model weights
├── config/                # Configuration files
├── tests/                 # Unit tests
└── example_usage.py       # Example usage script

Installation

1. Install Python Dependencies

pip install -r requirements.txt

2. Install Tesseract OCR

macOS:

brew install tesseract

Ubuntu/Debian:

sudo apt-get install tesseract-ocr

Windows: Download from GitHub

3. (Optional) Install EasyOCR

If you want to use EasyOCR instead of Tesseract:

pip install easyocr

Quick Start

Basic Usage

from engine import ReceiptMLEngine

# Initialize the engine
engine = ReceiptMLEngine(version="1.0.0")

# Process a receipt image
result = engine.process_image("receipt.jpg")

# Get JSON output
print(result.to_json())

# Access individual fields
print(f"Store: {result.store.store_name}")
print(f"Total: {result.total}")
print(f"Items: {len(result.items)}")

With Custom Model

engine = ReceiptMLEngine(
    version="1.0.0",
    model_path="models/line_classifier.pkl",
    use_easyocr=False
)

Training the Classifier

To train a custom line classifier model:

  1. Create training data:
python train_classifier.py --create-sample --data training_data.json
  1. Edit training_data.json with your labeled examples

  2. Train the model:

python train_classifier.py --data training_data.json --output models/line_classifier.pkl

Output Schema

The engine outputs a JSON object with the following structure:

{
  "store": {
    "store_name": "ABC MARKET",
    "branch_code": "001",
    "tax_number": "1234567890"
  },
  "timestamp": "2024-01-01T12:30:45",
  "transaction_date": "2024-01-01",
  "transaction_time": "12:30:45",
  "receipt_number": "123456",
  "receipt_hash": "a1b2c3d4e5f6g7h8",
  "cashier": {
    "cashier_id": "001"
  },
  "pos": {
    "terminal_id": "POS-001"
  },
  "items": [
    {
      "raw_text": "Süt 2x 5.50",
      "normalized_name": "Süt",
      "quantity": 2.0,
      "unit_price": 2.75,
      "total_price": 5.50,
      "associated_barcode": "8690123456789"
    }
  ],
  "discounts": [
    {
      "description": "İNDİRİM: -10.00",
      "amount": -10.0
    }
  ],
  "payments": [
    {
      "payment_method": "CASH",
      "amount": 100.50
    }
  ],
  "taxes": [
    {
      "description": "KDV %18: 15.30",
      "amount": 15.30,
      "rate": 18.0
    }
  ],
  "subtotal": 90.00,
  "total": 100.50
}

Line Classification Types

The system classifies each receipt line into one of these types:

  • PRODUCT - Product line items
  • DISCOUNT - Discounts and promotions
  • COUPON - Coupon information
  • TAX - Tax information
  • SUBTOTAL - Subtotal amounts
  • TOTAL - Total amount
  • PAYMENT - Payment methods and amounts
  • HEADER - Store header information
  • FOOTER - Footer messages
  • INFORMATIONAL - Dates, times, receipt numbers
  • BARCODE_ONLY - Barcode-only lines

Determinism

Each version of ReceiptML Core produces deterministic output:

  • Same input + same version = same output
  • Model weights, preprocessing, and parsing are versioned
  • Receipt hashes are deterministic and reproducible

Supported Formats

  • Image formats: JPEG, PNG
  • Document formats: PDF (first page)
  • Barcode types: EAN-13, EAN-8, QR codes, CODE128, CODE39

Features

OCR

  • Text extraction with bounding boxes
  • Spatial relationship preservation
  • Support for multiple languages (English, Turkish)
  • Preprocessing for better accuracy

Barcode Detection

  • Automatic detection and decoding
  • Spatial association with products
  • Multiple barcode format support

Entity Extraction

  • Store information extraction
  • Temporal data parsing (date, time)
  • Receipt identification
  • Cashier and POS metadata
  • Product, discount, payment, and tax extraction

Machine Learning

  • Supervised line classification
  • Rule-based fallback
  • Extensible model architecture

Development

Running Tests

pytest tests/

Code Style

black receiptml/
flake8 receiptml/

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages