Skip to content

Alok-Ranjan23/EasyOcr-ggml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EasyOcr-ggml

A GGML/GGUF port of the EasyOCR inference pipeline. The goal is a self-contained native binary (no Python, no PyTorch, no ONNX Runtime) that loads .gguf weights and produces the same OCR results as the upstream Python library.

This repo is the GGML-backed sibling of @qvac/ocr-onnx — same pipeline shape, same pre/post-processing, different inference engine.

Current milestone scope is gen-2 recognizers only (English/Latin path).

Status

  • PyTorch checkpoint → GGUF converter (scripts/pth_to_gguf.py)
  • CRAFT detector weights converted (models/craft_mlt_25k.gguf, 80 MB, F32)
  • CRNN gen-2 English recognizer converted (models/english_g2.gguf, 15 MB, F32)
  • Build scaffolding: ggml submodule, CMake, OpenCV link, GGUF loader, smoke binary (Phase 1 of docs/PLAN.md)
  • CRAFT detector compute graph + PyTorch oracle (Phase 2 of docs/PLAN.md); end-to-end output bit-exact on a synthetic ramp and on a real examples/english.png (max abs error 5.36e-07)
  • C++ CRAFT pre/post-processing (Phase 3 of docs/PLAN.md); lifted from @qvac/ocr-onnx. ./build/detect examples/english.png returns the same 12 aligned text boxes EasyOCR Python's Reader.detect does, 11/12 within 3 px (one outlier at 21 px is an inherited ocr-onnx merge quirk — see docs/known-divergences.md)
  • CRNN gen-2 compute graph + manual BiLSTM op (Phase 4 of docs/PLAN.md); end-to-end logits on english_g2.gguf bit-exact against PyTorch within FP32 noise (max_abs ≈ 7.6e-6)
  • CRNN gen-2 pre/post + CTC decode (Phase 5 of docs/PLAN.md); lifted from @qvac/ocr-onnx. ./build/ocr-cli reads examples/english.png and produces the 12 lines of text; 9/12 exact vs EasyOCR Python (known divergences documented in docs/known-divergences.md)
  • ocr-cli end-to-end binary with EasyOCR-compatible flags (--detail, --output-format standard|json, --lang, --paragraph, --mag-ratio, --debug-png) for gen-2 recognizers (Phase 7 of docs/PLAN.md).
  • Phase 8 evaluation metrics: - detection: IoU@threshold, precision/recall/F1, tight/loose pixel bars - recognition: CER/WER, exact-match fraction, normalized edit distance - latency: repeatable warmup+runs with p50/p95/mean reporting
  • Phase 9 quantization path: - pth_to_gguf.py --quantize {Q8_0|Q4_K} - Q8_0 variants for craft_mlt_25k and english_g2 generated - Q8_0 end-to-end text parity confirmed on examples/english.png

See docs/PLAN.md for the detailed roadmap. Architecture deep-dive: docs/architecture.md.

PoC benchmark snapshot

End-to-end CPU latency on examples/english.png (same host, warmup=1, runs=5):

Benchmark Mean (ms) p50 (ms) p95 (ms) Notes
easyocr-ggml (test_ocr_pipeline) 23162.42 22999.27 23772.36 Full pipeline: CRAFT + box + CRNN
EasyOCR Python (Reader.readtext, CPU) 5510.45 5468.99 5919.81 Same image and mag_ratio=1.5, add_margin=0.0, paragraph=False

Stage split estimate for easyocr-ggml (same run profile):

Segment Mean (ms) Share
Detection side (detect ~= CRAFT + box post-proc) 19591.69 80.1%
Recognition side (residual from full - detect) 4870.27 19.9%

Notes:

  • This is a PoC benchmark snapshot, not a release SLA.
  • Stage split is derived from separate binary runs; it is directionally useful for bottleneck targeting.

Repository layout

EasyOcr-ggml/
├── README.md            this file
├── CMakeLists.txt       top-level build (ggml submodule + OpenCV + targets)
├── docs/
│   ├── PLAN.md          detailed port plan and design notes
│   └── architecture.md  layered architecture, decisions, tech debt
├── scripts/
│   └── pth_to_gguf.py   PyTorch .pth → GGUF weight converter
├── models/
│   ├── craft_mlt_25k.gguf    detector weights + metadata
│   └── english_g2.gguf       English recognizer weights + vocab metadata
├── include/easyocr-ggml/
│   ├── gguf_loader.hpp       public GGUF loader API
│   ├── craft_weights.hpp     CRAFT weight loader + BN-fold
│   ├── craft.hpp             build_craft() and tap names
│   ├── crnn_weights.hpp      CRNN gen-2 weight loader (Phase 4)
│   ├── crnn.hpp              build_crnn_gen2() and tap names (Phase 4)
│   ├── ops.hpp               reusable conv / bilinear ops
│   └── pipeline/             Phase 3: lifted from @qvac/ocr-onnx
│       ├── steps.hpp           shared types (PipelineContext, …)
│       ├── step_detection_inference.hpp  pre-proc + GGML inference
│       └── step_bounding_box.hpp         post-proc (heatmap → polygons)
├── src/
│   ├── ggml/
│   │   ├── gguf_loader.cpp   RAII wrapper over gguf_init_from_file
│   │   ├── craft_weights.cpp 154 tensors + BN-fold
│   │   ├── ops.cpp           conv_2d_bias / _relu, bilinear_to
│   │   ├── craft.cpp         CRAFT compute graph
│   │   ├── crnn_weights.cpp  CRNN gen-2 weights (BN-fold + verbatim copy)
│   │   └── crnn.cpp          CRNN gen-2 graph + manual BiLSTM cell
│   ├── pipeline/             Phase 3 implementations
│   │   ├── steps.cpp           fourPointTransform, InferredText::toString
│   │   ├── step_detection_inference.cpp
│   │   ├── step_bounding_box.cpp           lifted verbatim (535 LOC)
│   │   └── qlog.hpp                        QLOG/ALOG_DEBUG no-op shim
│   └── cli/
│       ├── smoke.cpp         GGUF metadata smoke (Phase 1)
│       ├── craft_smoke.cpp   CRAFT graph smoke (Phase 2)
│       ├── detect.cpp        end-to-end detection (Phase 3)
│       └── crnn_smoke.cpp    CRNN gen-2 graph smoke (Phase 4)
├── examples/
│   └── english.png          canonical real-world OCR test image
├── tests/
│   ├── test_build_craft.cpp  oracle vs PyTorch references
│   └── reference/
│       ├── dump_craft_reference.py        PyTorch oracle dumper
│       ├── craft_input.npy                synthetic ramp input
│       ├── craft_output_nhwc.npy          synthetic expected output
│       ├── craft_real_english_input.npy   pre-processed english.png
│       └── craft_real_english_output_nhwc.npy   expected heatmap
├── third_party/
│   └── ggml/                 git submodule, pinned commit
└── (future: more under src/ggml/, build_crnn_gen{1,2})

Quick start (today — weight conversion only)

The conversion script depends on torch, gguf, and easyocr. Easiest is to reuse the venv from the upstream EasyOCR clone:

# from this directory
../EasyOCR/.venv/bin/python scripts/pth_to_gguf.py \
    ~/.EasyOCR/model/<model>.pth \
    models/<model>.gguf

The script auto-detects the architecture from the filename:

Input filename pattern general.architecture Extra metadata
craft_mlt_25k.pth craft (none)
pretrained_ic15_res*.pt dbnet (none)
english_g2.pth, *_g2.pth crnn crnn.generation=2 + vocab

For custom checkpoints not in easyocr.config, pass --arch explicitly.

Quantized conversion (Phase 9)

scripts/pth_to_gguf.py now supports:

--quantize Q8_0
--quantize Q4_K

Example:

../EasyOCR/.venv/bin/python scripts/pth_to_gguf.py \
  ~/.EasyOCR/model/english_g2.pth \
  models/english_g2_q8_0.gguf \
  --quantize Q8_0

Batch conversion used for PoC benchmarking:

../EasyOCR/.venv/bin/python scripts/pth_to_gguf.py \
  ~/.EasyOCR/model/craft_mlt_25k.pth models/craft_mlt_25k_q8_0.gguf --quantize Q8_0
../EasyOCR/.venv/bin/python scripts/pth_to_gguf.py \
  ~/.EasyOCR/model/english_g2.pth models/english_g2_q8_0.gguf --quantize Q8_0
../EasyOCR/.venv/bin/python scripts/pth_to_gguf.py \
  ~/.EasyOCR/model/craft_mlt_25k.pth models/craft_mlt_25k_q4_k.gguf --quantize Q4_K
../EasyOCR/.venv/bin/python scripts/pth_to_gguf.py \
  ~/.EasyOCR/model/english_g2.pth models/english_g2_q4_k.gguf --quantize Q4_K

Build (native, Linux x64)

The native build links ggml (vendored as a submodule) and OpenCV (system). It produces these binaries in build/:

  • smoke — opens a .gguf file and prints its architecture / tensor count. Validates the loader, the ggml + OpenCV link, and the converted weights.
  • craft_smoke — runs the CRAFT compute graph end-to-end on a synthetic input and prints the output shape + simple stats.
  • test_build_craft — compares the GGML graph output against PyTorch reference dumps (synthetic ramp by default; --image english for a real image). Pass or fail at atol=1e-4.
  • detect (Phase 3) — full pipeline: imread → resize + ImageNet normalize → build_craft → connected-components / box merge → prints aligned + unaligned text-box polygons. Optional --debug-png debug.png overlays the boxes on the source image.
  • test_detect_polygons (Phase 3) — runs detect on examples/english.png and compares against EasyOCR Python's polygons (committed at tests/reference/craft_real_english_polygons.json). Hooked into ctest.
  • crnn_smoke (Phase 4) — runs the CRNN gen-2 compute graph on a synthetic input and prints the logits' shape + min/max/mean.
  • test_build_crnn_gen2 (Phase 4) — feeds the same np.linspace-ramp through both build_crnn_gen2 and PyTorch, and compares the final [1, T, num_classes] logits at atol=1e-4. Hooked into ctest.
  • ocr-cli (Phase 5 + 7) — full end-to-end OCR: imread → detect → box → crop → recognize → print recognized text (gen-2 recognizers). Flags: --detail 0|1, --output-format standard|json, --lang en[,fr,...], --paragraph, --mag-ratio 1.5, --debug-png debug.png.
  • test_ocr_pipeline (Phase 5) — runs the pipeline on examples/english.png and compares recognized text against EasyOCR Python's readtext; now also reports CER/WER and optional latency stats. Hooked into ctest.

One-time setup

# 1. Fetch the ggml submodule (pinned commit recorded in .gitmodules)
git submodule update --init --recursive

# 2. Install OpenCV headers + libs (Ubuntu / Debian)
sudo apt update
sudo apt install -y libopencv-dev cmake build-essential

# 3. Configure & build
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

Run the smoke binaries

# GGUF metadata smoke
./build/smoke models/craft_mlt_25k.gguf models/english_g2.gguf
# [ok] models/craft_mlt_25k.gguf  arch=craft  n_tensors=154  n_kv=12
# [ok] models/english_g2.gguf  arch=crnn  n_tensors=44  n_kv=18  vocab_bytes=98  num_classes=97

# CRAFT graph smoke at side=256
./build/craft_smoke models/craft_mlt_25k.gguf 256
# [ok] output ne = [2, 128, 128, 1]  n_elements=32768  ...

Run the CRAFT oracle test

# (a) Synthetic ramp at 64×64 (this is what ctest runs in CI)
./build/test_build_craft
# [ ok ] output_nhwc          n=2048    max_abs=4.68e-08  ...
# 1 passed, 0 failed, 12 skipped  (atol=1e-04)

# (b) Real image: examples/english.png
./build/test_build_craft --image english
# [input] real image english  NCHW=[1,3,736,1376]
# [ ok ] output_nhwc          n=506368  max_abs=5.36e-07  ...
# 1 passed, 0 failed, 12 skipped  (atol=1e-04)

Run Phase 8 evaluations

# One command: build + quality/latency evaluation + JSON reports.
./scripts/eval_phase8.sh
# Reports:
#   out/phase8/detect_metrics.json
#   out/phase8/ocr_metrics.json

# Or run tests directly:
./build/test_detect_polygons --report-json /tmp/detect_metrics.json
./build/test_ocr_pipeline \
  --warmup-runs 1 --bench-runs 5 \
  --report-json /tmp/ocr_metrics.json

Run full OCR end-to-end (Phase 5 + 7)

# Default: detect + recognize on examples/english.png using English weights
./build/ocr-cli
# Reduce your risk of coronavirus infection:
# Clean hands with soap and water
# ...

# Detail mode: index + confidence + bounding box per line
./build/ocr-cli --detail 1

# JSON output (matches EasyOCR Python's readtext shape)
./build/ocr-cli --output-format json | jq .

# With annotated debug image
./build/ocr-cli --image examples/english.png \
                --debug-png /tmp/english_ocr.png

# Different gen-2 recognizer:
./build/ocr-cli --recognizer models/latin_g2.gguf --image my_french.jpg

# Q8_0 quantized recognizer + detector:
./build/ocr-cli \
  --detector models/craft_mlt_25k_q8_0.gguf \
  --recognizer models/english_g2_q8_0.gguf \
  --image examples/english.png

# Text-vs-EasyOCR test (9/12 exact, all within edit distance 3):
./build/test_ocr_pipeline
# 9/12 (75%) exact, worst_edit=3, CER/WER reported  PASS

# Or via ctest:
cmake --build build --target test
# 4/4 tests passed

Run the end-to-end detection pipeline (Phase 3)

# Pretty-print the boxes detected on examples/english.png
./build/detect
# [load]   image examples/english.png  905x480x3
# [infer]  textMap=688x368  linkMap=688x368  imgResizeRatio=1.3333
# [boxes]  aligned=12  unaligned=0

# Save an annotated PNG with green boxes over the source image:
./build/detect --image examples/english.png \
               --debug-png /tmp/english_boxes.png

# Polygon-vs-EasyOCR test:
./build/test_detect_polygons
# 12/12 box count match, 11/12 within 3 px (PASS)

# Or via ctest:
cmake --build build --target test

Run the CRNN gen-2 recognizer graph (Phase 4)

# Smoke: synthetic input through the recognizer graph
./build/crnn_smoke models/english_g2.gguf 256
# [run]   computing graph (input 256x64, 4910 nodes)...
# [ok]    logits ne = [97, 63, 1, 1]   (== PyTorch [1, T=63, num_classes=97])

# Logits-vs-PyTorch oracle test (regenerate references first if needed):
../EasyOCR/.venv/bin/python tests/reference/dump_crnn_reference.py
./build/test_build_crnn_gen2
# [ ok ] logits  n=6111  max_abs=7.6e-06  ...
# PASS  (atol=1e-04)

The real-image test (b) needs no Python or extra setup — the pre-processed input (tests/reference/craft_real_english_input.npy, 12 MB) and expected heatmap (craft_real_english_output_nhwc.npy, 2 MB) are committed alongside the source PNG (examples/english.png). The 12 "skip" lines are intentional: only the input + final output are committed; per-layer dumps regenerate locally — see Diagnosing which layer first diverges for the bisect workflow.

Note on vocab_bytes / num_classes in the GGUF smoke output: vocab_bytes is the UTF-8 byte length of crnn.vocab; the character count is num_classes − 1 (the −1 accounts for the CTC blank token), which is 96 for the bundled English gen-2 vocab. The two disagree by 2 because the symbol takes 3 UTF-8 bytes.

CRAFT detector graph (Phase 2)

build_craft lives in src/ggml/craft.cpp and mirrors easyocr/craft.py exactly. All BatchNorm parameters are pre-folded into the preceding Conv2d at load time inside CraftWeights, so the runtime graph contains no BN op.

The end-to-end correctness test (./build/test_build_craft) is documented under Build above. This section covers the two extra workflows: regenerating the references and bisecting a regression.

Reference dumps

Mode Input Committed dumps Regen command
Synthetic np.linspace(-1, 1) ramp at 64×64 craft_input.npy, craft_output_nhwc.npy (~60 KB) dump_craft_reference.py
Real image examples/english.png via EasyOCR's imgproc (NCHW [1, 3, 736, 1376]) craft_real_english_input.npy, craft_real_english_output_nhwc.npy (~14 MB) dump_craft_reference.py --image examples/english.png

Both pass at atol=1e-4; observed errors are at the FP32-noise floor (synthetic max 4.7e-08, real-image max 5.36e-07). Synthetic regenerates in <1 s; real-image runs the upstream PyTorch model once (~5 s on CPU) and is rarely needed unless you change the dumper.

# Regenerate the committed (minimal) references:
../EasyOCR/.venv/bin/python tests/reference/dump_craft_reference.py
../EasyOCR/.venv/bin/python tests/reference/dump_craft_reference.py \
    --image examples/english.png

# Run synthetic test via ctest:
cmake --build build --target test

Diagnosing which layer first diverges

If test_build_craft reports a regression, regenerate every U-net stage locally and re-run — the comparator reports max_abs per tap and pinpoints where things drifted:

../EasyOCR/.venv/bin/python tests/reference/dump_craft_reference.py --per-layer
../EasyOCR/.venv/bin/python tests/reference/dump_craft_reference.py \
    --image examples/english.png --per-layer

./build/test_build_craft               # 13 passed when the graph is healthy
./build/test_build_craft --image english

The per-layer dumps are sizable (~1.8 MB synthetic, ~360 MB real) and are git-ignored — they live only on the developer's machine until the next regeneration.

Try a different real image

The committed examples/english.png is the canonical test, but the workflow generalises:

# Drop a new image (PNG / JPG, RGB or grayscale)
cp /path/to/your/image.png examples/myimage.png

# Generate the pre-processed input + expected heatmap
../EasyOCR/.venv/bin/python tests/reference/dump_craft_reference.py \
    --image examples/myimage.png

# Run the test against it (note: --image takes the *stem*, not the path)
./build/test_build_craft --image myimage

tests/reference/craft_real_<stem>_*.npy are git-ignored by default unless you choose to commit them. EasyOCR's own pre-processing (imgproc.resize_aspect_ratio with mag_ratio=1.5, canvas_size=2560) is applied — pass --mag-ratio / --canvas-size to the dumper to vary those.

Phase 3 will lift @qvac/ocr-onnx's resizeAspectRatio + normalizeAndBuildCHW C++ pre-processing into the runtime, with these real-image references serving as the bit-exact ground truth.

Inspecting a converted GGUF

../EasyOCR/.venv/bin/python -m gguf.scripts.gguf_dump models/english_g2.gguf | head -30

You should see metadata KVs including general.architecture = crnn, crnn.generation = 2, crnn.num_classes = 97, and crnn.vocab containing the 96-character set used by the gen-2 English CTC head.

Why GGML?

The two QVAC OCR siblings differ only in the inference engine:

@qvac/ocr-onnx EasyOcr-ggml (this)
Inference backend ONNX Runtime GGML
Weight format .onnx .gguf
Pre/post-processing C++ + OpenCV C++ + OpenCV (same code)
Quantization per-EP (limited) block-quantized (Q8_0, Q4_K, …) out of the box
Binary size ONNX Runtime ~30 MB+ libggml ~1–3 MB
Mobile / edge fit good smaller, faster cold start, no EP plumbing

Quantization trade-offs (PoC snapshot)

Measured on examples/english.png, CPU, warmup=1, runs=3 using test_ocr_pipeline.

Variant Detector GGUF Recognizer GGUF Total model size OCR text parity vs F32 Mean latency (ms)
F32 baseline 80M 15M 95M baseline 23734.13
Q8_0 80M 7.8M 87.8M identical (12/12 lines) 23930.10
Q4_K 80M 15M 95M identical (12/12 lines) 24074.04

Notes:

  • Current CRAFT tensor layout does not hit block-quantization constraints in a way that reduces file size with this converter path, so detector size stays ~80M.
  • Q4_K via current gguf Python quantizer falls back to F32 for this model family (no size win).
  • With the current bottleneck split (~80% detection side), quantizing only recognizer weights does not yet improve end-to-end latency.

License

Apache-2.0 (matches upstream EasyOCR and @qvac/ocr-onnx).

About

GGML/GGUF port of the EasyOCR inference pipeline (CRAFT detector + CRNN recognizer)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages