You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been playing with pdf-craft and managed to swap the original OCR pipeline for DeepSeek-OCR-2 through llama.cpp GGUF. The result: a 190-page book went from ~3 hours down to ~17 minutes on my RTX 4070 (12 GB VRAM).
Created a server-based and a CLI-based GGUF page extractor that plugs into pdf-craft's page_extractor= hook
Built a simple CLI (epub.py) that auto-detects PDFs and picks the fastest backend
My setup:
GPU: RTX 4070 (12 GB VRAM)
CUDA 13.3
llama-server: ~4-5s/page (experimental)
llama-cli: ~5.3s/page (stable, tested with full books)
Full disclosure:
This is an experimental fork, not a maintained project. I made it for one purpose: PDF → EPUB with my GPU.
I have no idea if something else broke. The original pipeline still works if you don't pass page_extractor.
The whole thing was built with help from DeepSeek V4 Flash (AI) — I don't actually know C++ or Python internals, I just iterated with the model until it worked.
If something doesn't work, you'll need to figure it out yourself or ask another AI for help.
Known quirk: text detected as images
I noticed that sometimes the model confuses text blocks with images — paragraphs end up being treated as figures. I'm not sure if this is a code issue on my side or a model behavior, but it looks like it's a known limitation of DeepSeek-OCR-2's layout detector (it happens with tables, multi-column text, and decorative page elements). The same behavior appears regardless of backend, so it's likely the model, not the integration. If anyone has insights or knows how to improve this (different prompt? different model size? post-processing?), I'd love to hear.
I've been playing with pdf-craft and managed to swap the original OCR pipeline for DeepSeek-OCR-2 through llama.cpp GGUF. The result: a 190-page book went from ~3 hours down to ~17 minutes on my RTX 4070 (12 GB VRAM).
What I did:
My setup:
Full disclosure:
Known quirk: text detected as images
I noticed that sometimes the model confuses text blocks with images — paragraphs end up being treated as figures. I'm not sure if this is a code issue on my side or a model behavior, but it looks like it's a known limitation of DeepSeek-OCR-2's layout detector (it happens with tables, multi-column text, and decorative page elements). The same behavior appears regardless of backend, so it's likely the model, not the integration. If anyone has insights or knows how to improve this (different prompt? different model size? post-processing?), I'd love to hear.
Urls
Repo: https://github.com/Milor123/pdf-craft-experimental-v2-gguf
Release with pre-built Windows binaries: https://github.com/Milor123/pdf-craft-experimental-v2-gguf/releases/tag/v1.0.0-gguf
Hope some of this is useful for the official project. The GGUF approach might help people with lower VRAM who can't run the full Transformers pipeline.