I integrated DeepSeek-OCR-2 via GGUF + llama.cpp — 3 hours → 17 minutes

I've been playing with pdf-craft  and managed to swap the original OCR pipeline for DeepSeek-OCR-2 through llama.cpp GGUF. The result: a 190-page book went from ~3 hours down to ~17 minutes on my RTX 4070 (12 GB VRAM).

## What I did:

- Built llama.cpp with PR #20975 (https://github.com/ggml-org/llama.cpp/pull/20975) (adds deepseekocr2 projector support)
- Downloaded the GGUF Q4_K_M model from HuggingFace
- Created a server-based and a CLI-based GGUF page extractor that plugs into pdf-craft's page_extractor= hook
- Built a simple CLI (epub.py) that auto-detects PDFs and picks the fastest backend

## My setup:
- GPU: RTX 4070 (12 GB VRAM)
- CUDA 13.3
- llama-server: ~4-5s/page (experimental)
- llama-cli: ~5.3s/page (stable, tested with full books)
## Full disclosure:
- This is an experimental fork, not a maintained project. I made it for one purpose: PDF → EPUB with my GPU.
- I have no idea if something else broke. The original pipeline still works if you don't pass page_extractor.
- The whole thing was built with help from DeepSeek V4 Flash (AI) — I don't actually know C++ or Python internals, I just iterated with the model until it worked.
- If something doesn't work, you'll need to figure it out yourself or ask another AI for help.

## Known quirk: text detected as images
I noticed that sometimes the model confuses text blocks with images — paragraphs end up being treated as figures. I'm not sure if this is a code issue on my side or a model behavior, but it looks like it's a known limitation of DeepSeek-OCR-2's layout detector (it happens with tables, multi-column text, and decorative page elements). The same behavior appears regardless of backend, so it's likely the model, not the integration. If anyone has insights or knows how to improve this (different prompt? different model size? post-processing?), I'd love to hear.

## Urls

Repo: https://github.com/Milor123/pdf-craft-experimental-v2-gguf  
Release with pre-built Windows binaries: https://github.com/Milor123/pdf-craft-experimental-v2-gguf/releases/tag/v1.0.0-gguf
Hope some of this is useful for the official project. The GGUF approach might help people with lower VRAM who can't run the full Transformers pipeline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I integrated DeepSeek-OCR-2 via GGUF + llama.cpp — 3 hours → 17 minutes #365

What I did:

My setup:

Full disclosure:

Known quirk: text detected as images

Urls

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

I integrated DeepSeek-OCR-2 via GGUF + llama.cpp — 3 hours → 17 minutes #365

Description

What I did:

My setup:

Full disclosure:

Known quirk: text detected as images

Urls

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions