A machine learning–driven engine that converts retail and supermarket receipt images into structured, deterministic transaction data.
ReceiptML Core is a hybrid ML system that:
- Performs OCR on receipt images (Tesseract or EasyOCR)
- Detects and decodes barcodes (EAN-13, EAN-8, QR codes)
- Classifies receipt lines using supervised ML or rule-based fallback
- Extracts structured entities (store, products, payments, etc.)
- Generates deterministic, schema-consistent JSON output
ReceiptML/
├── engine.py # Main ReceiptML engine
├── ocr.py # OCR module
├── barcode.py # Barcode detection
├── classifier.py # Line classification ML model
├── extractor.py # Entity extraction
├── schema.py # Output schema definitions
├── models/ # Trained model weights
├── config/ # Configuration files
├── tests/ # Unit tests
└── example_usage.py # Example usage script
pip install -r requirements.txtmacOS:
brew install tesseractUbuntu/Debian:
sudo apt-get install tesseract-ocrWindows: Download from GitHub
If you want to use EasyOCR instead of Tesseract:
pip install easyocrfrom engine import ReceiptMLEngine
# Initialize the engine
engine = ReceiptMLEngine(version="1.0.0")
# Process a receipt image
result = engine.process_image("receipt.jpg")
# Get JSON output
print(result.to_json())
# Access individual fields
print(f"Store: {result.store.store_name}")
print(f"Total: {result.total}")
print(f"Items: {len(result.items)}")engine = ReceiptMLEngine(
version="1.0.0",
model_path="models/line_classifier.pkl",
use_easyocr=False
)To train a custom line classifier model:
- Create training data:
python train_classifier.py --create-sample --data training_data.json-
Edit
training_data.jsonwith your labeled examples -
Train the model:
python train_classifier.py --data training_data.json --output models/line_classifier.pklThe engine outputs a JSON object with the following structure:
{
"store": {
"store_name": "ABC MARKET",
"branch_code": "001",
"tax_number": "1234567890"
},
"timestamp": "2024-01-01T12:30:45",
"transaction_date": "2024-01-01",
"transaction_time": "12:30:45",
"receipt_number": "123456",
"receipt_hash": "a1b2c3d4e5f6g7h8",
"cashier": {
"cashier_id": "001"
},
"pos": {
"terminal_id": "POS-001"
},
"items": [
{
"raw_text": "Süt 2x 5.50",
"normalized_name": "Süt",
"quantity": 2.0,
"unit_price": 2.75,
"total_price": 5.50,
"associated_barcode": "8690123456789"
}
],
"discounts": [
{
"description": "İNDİRİM: -10.00",
"amount": -10.0
}
],
"payments": [
{
"payment_method": "CASH",
"amount": 100.50
}
],
"taxes": [
{
"description": "KDV %18: 15.30",
"amount": 15.30,
"rate": 18.0
}
],
"subtotal": 90.00,
"total": 100.50
}The system classifies each receipt line into one of these types:
PRODUCT- Product line itemsDISCOUNT- Discounts and promotionsCOUPON- Coupon informationTAX- Tax informationSUBTOTAL- Subtotal amountsTOTAL- Total amountPAYMENT- Payment methods and amountsHEADER- Store header informationFOOTER- Footer messagesINFORMATIONAL- Dates, times, receipt numbersBARCODE_ONLY- Barcode-only lines
Each version of ReceiptML Core produces deterministic output:
- Same input + same version = same output
- Model weights, preprocessing, and parsing are versioned
- Receipt hashes are deterministic and reproducible
- Image formats: JPEG, PNG
- Document formats: PDF (first page)
- Barcode types: EAN-13, EAN-8, QR codes, CODE128, CODE39
- Text extraction with bounding boxes
- Spatial relationship preservation
- Support for multiple languages (English, Turkish)
- Preprocessing for better accuracy
- Automatic detection and decoding
- Spatial association with products
- Multiple barcode format support
- Store information extraction
- Temporal data parsing (date, time)
- Receipt identification
- Cashier and POS metadata
- Product, discount, payment, and tax extraction
- Supervised line classification
- Rule-based fallback
- Extensible model architecture
pytest tests/black receiptml/
flake8 receiptml/MIT
Contributions are welcome! Please feel free to submit a Pull Request.