Uses Mistral OCR to convert a scanned PDF to markdown and then to an epub. The main motivation is to be able to read scanned pdfs on my e-reader. There is also an optional automatic translation feature, but the accuracy may vary.
- Install
pandocfrom https://github.com/jgm/pandoc/releases/- check installation with
pandoc --version
- check installation with
- Make python 3.14 virtual env and install requirements
python -m venv .venv source .venv/bin/activate pip install -r requirements.txt - Set
MISTRAL_API_KEYenvironment variable to your Mistral API key.
Converting a PDF to EPUB.
python main.py path/to/input.pdf path/to/output.epubOptional: translate to target language (e.g., French):
python main.py path/to/input.pdf path/to/output.epub --translate_to frenchThe language parameter is flexible, e.g., fr, fr-FR, french, etc. should all work.
Costs depend on the document. In my experiments it is about 20 cents per 100 pages for OCR, and about 3-5 cents per 100 pages for translation.
- Mistral OCR to get markdown text from each page, including images.
- Mistral OCR Annotation on the first 10 pages of document for getting book metadata following pandoc's epub metadata.
- Optional: use mistral-small-latest to translate the markdown in chunks (concurrently for speed).
pandocto convert markdown to epub