Scanned PDF to EPUB with Mistral OCR

Uses Mistral OCR to convert a scanned PDF to markdown and then to an epub. The main motivation is to be able to read scanned pdfs on my e-reader. There is also an optional automatic translation feature, but the accuracy may vary.

Install

Install pandoc from https://github.com/jgm/pandoc/releases/
- check installation with pandoc --version

Make python 3.14 virtual env and install requirements

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Set MISTRAL_API_KEY environment variable to your Mistral API key.

Usage

Converting a PDF to EPUB.

python main.py path/to/input.pdf path/to/output.epub

Optional: translate to target language (e.g., French):

python main.py path/to/input.pdf path/to/output.epub --translate_to french

The language parameter is flexible, e.g., fr, fr-FR, french, etc. should all work.

API Costs

Costs depend on the document. In my experiments it is about 20 cents per 100 pages for OCR, and about 3-5 cents per 100 pages for translation.

Approach

Mistral OCR to get markdown text from each page, including images.
Mistral OCR Annotation on the first 10 pages of document for getting book metadata following pandoc's epub metadata.
Optional: use mistral-small-latest to translate the markdown in chunks (concurrently for speed).
pandoc to convert markdown to epub

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
models		models
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
translate.py		translate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Scanned PDF to EPUB with Mistral OCR

Install

Usage

API Costs

Approach

About

Uh oh!

Releases

Packages

Uh oh!

Languages

LourensT/PDFtoEPUB

Folders and files

Latest commit

History

Repository files navigation

Scanned PDF to EPUB with Mistral OCR

Install

Usage

API Costs

Approach

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages