Skip to content

I integrated DeepSeek-OCR-2 via GGUF + llama.cpp — 3 hours → 17 minutes #365

@Milor123

Description

@Milor123

I've been playing with pdf-craft and managed to swap the original OCR pipeline for DeepSeek-OCR-2 through llama.cpp GGUF. The result: a 190-page book went from ~3 hours down to ~17 minutes on my RTX 4070 (12 GB VRAM).

What I did:

  • Built llama.cpp with PR #20975 (mtmd: Add DeepSeekOCR 2 Support ggml-org/llama.cpp#20975) (adds deepseekocr2 projector support)
  • Downloaded the GGUF Q4_K_M model from HuggingFace
  • Created a server-based and a CLI-based GGUF page extractor that plugs into pdf-craft's page_extractor= hook
  • Built a simple CLI (epub.py) that auto-detects PDFs and picks the fastest backend

My setup:

  • GPU: RTX 4070 (12 GB VRAM)
  • CUDA 13.3
  • llama-server: ~4-5s/page (experimental)
  • llama-cli: ~5.3s/page (stable, tested with full books)

Full disclosure:

  • This is an experimental fork, not a maintained project. I made it for one purpose: PDF → EPUB with my GPU.
  • I have no idea if something else broke. The original pipeline still works if you don't pass page_extractor.
  • The whole thing was built with help from DeepSeek V4 Flash (AI) — I don't actually know C++ or Python internals, I just iterated with the model until it worked.
  • If something doesn't work, you'll need to figure it out yourself or ask another AI for help.

Known quirk: text detected as images

I noticed that sometimes the model confuses text blocks with images — paragraphs end up being treated as figures. I'm not sure if this is a code issue on my side or a model behavior, but it looks like it's a known limitation of DeepSeek-OCR-2's layout detector (it happens with tables, multi-column text, and decorative page elements). The same behavior appears regardless of backend, so it's likely the model, not the integration. If anyone has insights or knows how to improve this (different prompt? different model size? post-processing?), I'd love to hear.

Urls

Repo: https://github.com/Milor123/pdf-craft-experimental-v2-gguf
Release with pre-built Windows binaries: https://github.com/Milor123/pdf-craft-experimental-v2-gguf/releases/tag/v1.0.0-gguf
Hope some of this is useful for the official project. The GGUF approach might help people with lower VRAM who can't run the full Transformers pipeline.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions