A GGML/GGUF port of the EasyOCR inference
pipeline. The goal is a self-contained native binary (no Python, no PyTorch,
no ONNX Runtime) that loads .gguf weights and produces the same OCR results
as the upstream Python library.
This repo is the GGML-backed sibling of
@qvac/ocr-onnx — same pipeline
shape, same pre/post-processing, different inference engine.
Current milestone scope is gen-2 recognizers only (English/Latin path).
- PyTorch checkpoint → GGUF converter (
scripts/pth_to_gguf.py) - CRAFT detector weights converted (
models/craft_mlt_25k.gguf, 80 MB, F32) - CRNN gen-2 English recognizer converted (
models/english_g2.gguf, 15 MB, F32) - Build scaffolding: ggml submodule, CMake, OpenCV link, GGUF loader,
smokebinary (Phase 1 ofdocs/PLAN.md) - CRAFT detector compute graph + PyTorch oracle (Phase 2 of
docs/PLAN.md); end-to-end output bit-exact on a synthetic ramp and on a realexamples/english.png(max abs error 5.36e-07) - C++ CRAFT pre/post-processing (Phase 3 of
docs/PLAN.md); lifted from@qvac/ocr-onnx../build/detect examples/english.pngreturns the same 12 aligned text boxes EasyOCR Python'sReader.detectdoes, 11/12 within 3 px (one outlier at 21 px is an inherited ocr-onnx merge quirk — seedocs/known-divergences.md) - CRNN gen-2 compute graph + manual BiLSTM op (Phase 4 of
docs/PLAN.md); end-to-end logits onenglish_g2.ggufbit-exact against PyTorch within FP32 noise (max_abs ≈ 7.6e-6) - CRNN gen-2 pre/post + CTC decode (Phase 5 of
docs/PLAN.md); lifted from@qvac/ocr-onnx../build/ocr-clireadsexamples/english.pngand produces the 12 lines of text; 9/12 exact vs EasyOCR Python (known divergences documented indocs/known-divergences.md) -
ocr-cliend-to-end binary with EasyOCR-compatible flags (--detail,--output-format standard|json,--lang,--paragraph,--mag-ratio,--debug-png) for gen-2 recognizers (Phase 7 ofdocs/PLAN.md). - Phase 8 evaluation metrics: - detection: IoU@threshold, precision/recall/F1, tight/loose pixel bars - recognition: CER/WER, exact-match fraction, normalized edit distance - latency: repeatable warmup+runs with p50/p95/mean reporting
- Phase 9 quantization path:
-
pth_to_gguf.py --quantize {Q8_0|Q4_K}- Q8_0 variants forcraft_mlt_25kandenglish_g2generated - Q8_0 end-to-end text parity confirmed onexamples/english.png
See docs/PLAN.md for the detailed roadmap.
Architecture deep-dive: docs/architecture.md.
End-to-end CPU latency on examples/english.png (same host, warmup=1, runs=5):
| Benchmark | Mean (ms) | p50 (ms) | p95 (ms) | Notes |
|---|---|---|---|---|
easyocr-ggml (test_ocr_pipeline) |
23162.42 | 22999.27 | 23772.36 | Full pipeline: CRAFT + box + CRNN |
EasyOCR Python (Reader.readtext, CPU) |
5510.45 | 5468.99 | 5919.81 | Same image and mag_ratio=1.5, add_margin=0.0, paragraph=False |
Stage split estimate for easyocr-ggml (same run profile):
| Segment | Mean (ms) | Share |
|---|---|---|
Detection side (detect ~= CRAFT + box post-proc) |
19591.69 | 80.1% |
| Recognition side (residual from full - detect) | 4870.27 | 19.9% |
Notes:
- This is a PoC benchmark snapshot, not a release SLA.
- Stage split is derived from separate binary runs; it is directionally useful for bottleneck targeting.
EasyOcr-ggml/
├── README.md this file
├── CMakeLists.txt top-level build (ggml submodule + OpenCV + targets)
├── docs/
│ ├── PLAN.md detailed port plan and design notes
│ └── architecture.md layered architecture, decisions, tech debt
├── scripts/
│ └── pth_to_gguf.py PyTorch .pth → GGUF weight converter
├── models/
│ ├── craft_mlt_25k.gguf detector weights + metadata
│ └── english_g2.gguf English recognizer weights + vocab metadata
├── include/easyocr-ggml/
│ ├── gguf_loader.hpp public GGUF loader API
│ ├── craft_weights.hpp CRAFT weight loader + BN-fold
│ ├── craft.hpp build_craft() and tap names
│ ├── crnn_weights.hpp CRNN gen-2 weight loader (Phase 4)
│ ├── crnn.hpp build_crnn_gen2() and tap names (Phase 4)
│ ├── ops.hpp reusable conv / bilinear ops
│ └── pipeline/ Phase 3: lifted from @qvac/ocr-onnx
│ ├── steps.hpp shared types (PipelineContext, …)
│ ├── step_detection_inference.hpp pre-proc + GGML inference
│ └── step_bounding_box.hpp post-proc (heatmap → polygons)
├── src/
│ ├── ggml/
│ │ ├── gguf_loader.cpp RAII wrapper over gguf_init_from_file
│ │ ├── craft_weights.cpp 154 tensors + BN-fold
│ │ ├── ops.cpp conv_2d_bias / _relu, bilinear_to
│ │ ├── craft.cpp CRAFT compute graph
│ │ ├── crnn_weights.cpp CRNN gen-2 weights (BN-fold + verbatim copy)
│ │ └── crnn.cpp CRNN gen-2 graph + manual BiLSTM cell
│ ├── pipeline/ Phase 3 implementations
│ │ ├── steps.cpp fourPointTransform, InferredText::toString
│ │ ├── step_detection_inference.cpp
│ │ ├── step_bounding_box.cpp lifted verbatim (535 LOC)
│ │ └── qlog.hpp QLOG/ALOG_DEBUG no-op shim
│ └── cli/
│ ├── smoke.cpp GGUF metadata smoke (Phase 1)
│ ├── craft_smoke.cpp CRAFT graph smoke (Phase 2)
│ ├── detect.cpp end-to-end detection (Phase 3)
│ └── crnn_smoke.cpp CRNN gen-2 graph smoke (Phase 4)
├── examples/
│ └── english.png canonical real-world OCR test image
├── tests/
│ ├── test_build_craft.cpp oracle vs PyTorch references
│ └── reference/
│ ├── dump_craft_reference.py PyTorch oracle dumper
│ ├── craft_input.npy synthetic ramp input
│ ├── craft_output_nhwc.npy synthetic expected output
│ ├── craft_real_english_input.npy pre-processed english.png
│ └── craft_real_english_output_nhwc.npy expected heatmap
├── third_party/
│ └── ggml/ git submodule, pinned commit
└── (future: more under src/ggml/, build_crnn_gen{1,2})
The conversion script depends on torch, gguf, and easyocr. Easiest is to
reuse the venv from the upstream EasyOCR clone:
# from this directory
../EasyOCR/.venv/bin/python scripts/pth_to_gguf.py \
~/.EasyOCR/model/<model>.pth \
models/<model>.ggufThe script auto-detects the architecture from the filename:
| Input filename pattern | general.architecture |
Extra metadata |
|---|---|---|
craft_mlt_25k.pth |
craft |
(none) |
pretrained_ic15_res*.pt |
dbnet |
(none) |
english_g2.pth, *_g2.pth |
crnn |
crnn.generation=2 + vocab |
For custom checkpoints not in easyocr.config, pass --arch explicitly.
scripts/pth_to_gguf.py now supports:
--quantize Q8_0
--quantize Q4_KExample:
../EasyOCR/.venv/bin/python scripts/pth_to_gguf.py \
~/.EasyOCR/model/english_g2.pth \
models/english_g2_q8_0.gguf \
--quantize Q8_0Batch conversion used for PoC benchmarking:
../EasyOCR/.venv/bin/python scripts/pth_to_gguf.py \
~/.EasyOCR/model/craft_mlt_25k.pth models/craft_mlt_25k_q8_0.gguf --quantize Q8_0
../EasyOCR/.venv/bin/python scripts/pth_to_gguf.py \
~/.EasyOCR/model/english_g2.pth models/english_g2_q8_0.gguf --quantize Q8_0
../EasyOCR/.venv/bin/python scripts/pth_to_gguf.py \
~/.EasyOCR/model/craft_mlt_25k.pth models/craft_mlt_25k_q4_k.gguf --quantize Q4_K
../EasyOCR/.venv/bin/python scripts/pth_to_gguf.py \
~/.EasyOCR/model/english_g2.pth models/english_g2_q4_k.gguf --quantize Q4_KThe native build links ggml (vendored as a submodule) and OpenCV (system).
It produces these binaries in build/:
smoke— opens a.gguffile and prints its architecture / tensor count. Validates the loader, the ggml + OpenCV link, and the converted weights.craft_smoke— runs the CRAFT compute graph end-to-end on a synthetic input and prints the output shape + simple stats.test_build_craft— compares the GGML graph output against PyTorch reference dumps (synthetic ramp by default;--image englishfor a real image). Pass or fail atatol=1e-4.detect(Phase 3) — full pipeline:imread→ resize + ImageNet normalize →build_craft→ connected-components / box merge → prints aligned + unaligned text-box polygons. Optional--debug-png debug.pngoverlays the boxes on the source image.test_detect_polygons(Phase 3) — runsdetectonexamples/english.pngand compares against EasyOCR Python's polygons (committed attests/reference/craft_real_english_polygons.json). Hooked intoctest.crnn_smoke(Phase 4) — runs the CRNN gen-2 compute graph on a synthetic input and prints the logits' shape + min/max/mean.test_build_crnn_gen2(Phase 4) — feeds the samenp.linspace-ramp through bothbuild_crnn_gen2and PyTorch, and compares the final[1, T, num_classes]logits atatol=1e-4. Hooked intoctest.ocr-cli(Phase 5 + 7) — full end-to-end OCR:imread→ detect → box → crop → recognize → print recognized text (gen-2 recognizers). Flags:--detail 0|1,--output-format standard|json,--lang en[,fr,...],--paragraph,--mag-ratio 1.5,--debug-png debug.png.test_ocr_pipeline(Phase 5) — runs the pipeline onexamples/english.pngand compares recognized text against EasyOCR Python'sreadtext; now also reports CER/WER and optional latency stats. Hooked intoctest.
# 1. Fetch the ggml submodule (pinned commit recorded in .gitmodules)
git submodule update --init --recursive
# 2. Install OpenCV headers + libs (Ubuntu / Debian)
sudo apt update
sudo apt install -y libopencv-dev cmake build-essential
# 3. Configure & build
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j# GGUF metadata smoke
./build/smoke models/craft_mlt_25k.gguf models/english_g2.gguf
# [ok] models/craft_mlt_25k.gguf arch=craft n_tensors=154 n_kv=12
# [ok] models/english_g2.gguf arch=crnn n_tensors=44 n_kv=18 vocab_bytes=98 num_classes=97
# CRAFT graph smoke at side=256
./build/craft_smoke models/craft_mlt_25k.gguf 256
# [ok] output ne = [2, 128, 128, 1] n_elements=32768 ...# (a) Synthetic ramp at 64×64 (this is what ctest runs in CI)
./build/test_build_craft
# [ ok ] output_nhwc n=2048 max_abs=4.68e-08 ...
# 1 passed, 0 failed, 12 skipped (atol=1e-04)
# (b) Real image: examples/english.png
./build/test_build_craft --image english
# [input] real image english NCHW=[1,3,736,1376]
# [ ok ] output_nhwc n=506368 max_abs=5.36e-07 ...
# 1 passed, 0 failed, 12 skipped (atol=1e-04)# One command: build + quality/latency evaluation + JSON reports.
./scripts/eval_phase8.sh
# Reports:
# out/phase8/detect_metrics.json
# out/phase8/ocr_metrics.json
# Or run tests directly:
./build/test_detect_polygons --report-json /tmp/detect_metrics.json
./build/test_ocr_pipeline \
--warmup-runs 1 --bench-runs 5 \
--report-json /tmp/ocr_metrics.json# Default: detect + recognize on examples/english.png using English weights
./build/ocr-cli
# Reduce your risk of coronavirus infection:
# Clean hands with soap and water
# ...
# Detail mode: index + confidence + bounding box per line
./build/ocr-cli --detail 1
# JSON output (matches EasyOCR Python's readtext shape)
./build/ocr-cli --output-format json | jq .
# With annotated debug image
./build/ocr-cli --image examples/english.png \
--debug-png /tmp/english_ocr.png
# Different gen-2 recognizer:
./build/ocr-cli --recognizer models/latin_g2.gguf --image my_french.jpg
# Q8_0 quantized recognizer + detector:
./build/ocr-cli \
--detector models/craft_mlt_25k_q8_0.gguf \
--recognizer models/english_g2_q8_0.gguf \
--image examples/english.png
# Text-vs-EasyOCR test (9/12 exact, all within edit distance 3):
./build/test_ocr_pipeline
# 9/12 (75%) exact, worst_edit=3, CER/WER reported PASS
# Or via ctest:
cmake --build build --target test
# 4/4 tests passed# Pretty-print the boxes detected on examples/english.png
./build/detect
# [load] image examples/english.png 905x480x3
# [infer] textMap=688x368 linkMap=688x368 imgResizeRatio=1.3333
# [boxes] aligned=12 unaligned=0
# Save an annotated PNG with green boxes over the source image:
./build/detect --image examples/english.png \
--debug-png /tmp/english_boxes.png
# Polygon-vs-EasyOCR test:
./build/test_detect_polygons
# 12/12 box count match, 11/12 within 3 px (PASS)
# Or via ctest:
cmake --build build --target test# Smoke: synthetic input through the recognizer graph
./build/crnn_smoke models/english_g2.gguf 256
# [run] computing graph (input 256x64, 4910 nodes)...
# [ok] logits ne = [97, 63, 1, 1] (== PyTorch [1, T=63, num_classes=97])
# Logits-vs-PyTorch oracle test (regenerate references first if needed):
../EasyOCR/.venv/bin/python tests/reference/dump_crnn_reference.py
./build/test_build_crnn_gen2
# [ ok ] logits n=6111 max_abs=7.6e-06 ...
# PASS (atol=1e-04)The real-image test (b) needs no Python or extra setup — the
pre-processed input (tests/reference/craft_real_english_input.npy,
12 MB) and expected heatmap (craft_real_english_output_nhwc.npy, 2 MB)
are committed alongside the source PNG (examples/english.png). The 12
"skip" lines are intentional: only the input + final output are
committed; per-layer dumps regenerate locally — see
Diagnosing which layer first diverges
for the bisect workflow.
Note on
vocab_bytes/num_classesin the GGUF smoke output:vocab_bytesis the UTF-8 byte length ofcrnn.vocab; the character count isnum_classes − 1(the−1accounts for the CTC blank token), which is96for the bundled English gen-2 vocab. The two disagree by 2 because the€symbol takes 3 UTF-8 bytes.
build_craft lives in src/ggml/craft.cpp and mirrors
easyocr/craft.py exactly. All BatchNorm parameters are pre-folded into the
preceding Conv2d at load time inside CraftWeights, so the runtime graph
contains no BN op.
The end-to-end correctness test (./build/test_build_craft) is documented
under Build above. This section covers the
two extra workflows: regenerating the references and bisecting a regression.
| Mode | Input | Committed dumps | Regen command |
|---|---|---|---|
| Synthetic | np.linspace(-1, 1) ramp at 64×64 |
craft_input.npy, craft_output_nhwc.npy (~60 KB) |
dump_craft_reference.py |
| Real image | examples/english.png via EasyOCR's imgproc (NCHW [1, 3, 736, 1376]) |
craft_real_english_input.npy, craft_real_english_output_nhwc.npy (~14 MB) |
dump_craft_reference.py --image examples/english.png |
Both pass at atol=1e-4; observed errors are at the FP32-noise floor
(synthetic max 4.7e-08, real-image max 5.36e-07). Synthetic regenerates in
<1 s; real-image runs the upstream PyTorch model once (~5 s on CPU) and
is rarely needed unless you change the dumper.
# Regenerate the committed (minimal) references:
../EasyOCR/.venv/bin/python tests/reference/dump_craft_reference.py
../EasyOCR/.venv/bin/python tests/reference/dump_craft_reference.py \
--image examples/english.png
# Run synthetic test via ctest:
cmake --build build --target testIf test_build_craft reports a regression, regenerate every U-net stage
locally and re-run — the comparator reports max_abs per tap and
pinpoints where things drifted:
../EasyOCR/.venv/bin/python tests/reference/dump_craft_reference.py --per-layer
../EasyOCR/.venv/bin/python tests/reference/dump_craft_reference.py \
--image examples/english.png --per-layer
./build/test_build_craft # 13 passed when the graph is healthy
./build/test_build_craft --image englishThe per-layer dumps are sizable (~1.8 MB synthetic, ~360 MB real) and are git-ignored — they live only on the developer's machine until the next regeneration.
The committed examples/english.png is the canonical test, but the
workflow generalises:
# Drop a new image (PNG / JPG, RGB or grayscale)
cp /path/to/your/image.png examples/myimage.png
# Generate the pre-processed input + expected heatmap
../EasyOCR/.venv/bin/python tests/reference/dump_craft_reference.py \
--image examples/myimage.png
# Run the test against it (note: --image takes the *stem*, not the path)
./build/test_build_craft --image myimagetests/reference/craft_real_<stem>_*.npy are git-ignored by default
unless you choose to commit them. EasyOCR's own pre-processing
(imgproc.resize_aspect_ratio with mag_ratio=1.5, canvas_size=2560)
is applied — pass --mag-ratio / --canvas-size to the dumper to vary
those.
Phase 3 will lift @qvac/ocr-onnx's resizeAspectRatio +
normalizeAndBuildCHW C++ pre-processing into the runtime, with these
real-image references serving as the bit-exact ground truth.
../EasyOCR/.venv/bin/python -m gguf.scripts.gguf_dump models/english_g2.gguf | head -30You should see metadata KVs including general.architecture = crnn,
crnn.generation = 2, crnn.num_classes = 97, and crnn.vocab containing
the 96-character set used by the gen-2 English CTC head.
The two QVAC OCR siblings differ only in the inference engine:
@qvac/ocr-onnx |
EasyOcr-ggml (this) |
|
|---|---|---|
| Inference backend | ONNX Runtime | GGML |
| Weight format | .onnx |
.gguf |
| Pre/post-processing | C++ + OpenCV | C++ + OpenCV (same code) |
| Quantization | per-EP (limited) | block-quantized (Q8_0, Q4_K, …) out of the box |
| Binary size | ONNX Runtime ~30 MB+ | libggml ~1–3 MB |
| Mobile / edge fit | good | smaller, faster cold start, no EP plumbing |
Measured on examples/english.png, CPU, warmup=1, runs=3 using test_ocr_pipeline.
| Variant | Detector GGUF | Recognizer GGUF | Total model size | OCR text parity vs F32 | Mean latency (ms) |
|---|---|---|---|---|---|
| F32 baseline | 80M | 15M | 95M | baseline | 23734.13 |
| Q8_0 | 80M | 7.8M | 87.8M | identical (12/12 lines) | 23930.10 |
| Q4_K | 80M | 15M | 95M | identical (12/12 lines) | 24074.04 |
Notes:
- Current CRAFT tensor layout does not hit block-quantization constraints in a way that reduces file size with this converter path, so detector size stays ~80M.
Q4_Kvia currentggufPython quantizer falls back to F32 for this model family (no size win).- With the current bottleneck split (~80% detection side), quantizing only recognizer weights does not yet improve end-to-end latency.
Apache-2.0 (matches upstream EasyOCR and @qvac/ocr-onnx).