This guide walks you through setting up Versatile-OCR-Program from a fresh clone to a working OCR pipeline that processes PDFs through Stage 1 (layout detection + multi-API OCR) and Stage 2 (LLM-based correction).
Estimated time: 60–90 minutes for a first install (Docker image build dominates).
Versatile-OCR-Program is a two-stage OCR pipeline optimized for educational materials (figures, formulas, tables, multilingual text):
-
Stage 1 — Runs inside a Docker container. Detects regions with DocLayout-YOLO, then dispatches each region to a specialized API:
- Text/title/list → Google Vision OCR
- Formulas → MathPix
- Figures & tables → Gemini (multimodal)
- Results are uploaded directly to Google Cloud Storage (GCS-only architecture; no local output directory).
-
Stage 2 — Runs on the host (no Docker). Reads Stage 1 results from GCS, sends each page's regions to ChatGPT for OCR-error correction while strictly preserving the region structure, then writes the corrected results back to GCS.
Both stages have lightweight wrappers (auto_run_stage1.py, auto_run_stage2.py) for batch and parallel processing.
| Component | Minimum | Notes |
|---|---|---|
| OS | Linux (Ubuntu 20.04+, Arch, or similar) | Tested on Vertex AI / GCP Notebook. WSL2 also works with caveats. |
| Disk | ~13 GB free | CUDA base image (~5GB) + Python deps + model weights cache |
| RAM | 8 GB+ | 16 GB recommended for PDFs >100 pages |
| GPU | Optional (NVIDIA, CUDA 11.8 compatible) | Setting docker.gpu_enabled: true in config.yaml requires NVIDIA Container Toolkit |
| Docker | Required | Stage 1 runs inside a container |
| Python | 3.9+ on host | For running auto_run_*.py wrappers and Stage 2 |
| Network | Required | Pulls CUDA base image, HuggingFace models, calls Vision / Gemini / MathPix / OpenAI APIs |
There are 3 categories. Most are automatic — only Category 1 needs you to install manually.
1. Host (you install manually — see §3 for Docker, §2 for the rest)
- Docker — runs the Stage 1 container.
- Python 3.9+ — already on most Linux. Used to run wrapper scripts (
auto_run_*.py,docker_build.py) and Stage 2. - Host-side Python packages (one-time install):
These are used by
pip install pyyaml python-dotenv openai google-cloud-storage
auto_run_*.py,docker_build.py, andocr_stage_2.py(which runs on the host, not in Docker). - (Optional)
gsutilCLI — for GCS verification in §8. Install via the Google Cloud SDK, or skip and use the web console instead.
2. Inside the Docker container (automatic — docker_build.py handles this)
You don't install these yourself; they're declared in src/ocr/Dockerfile. Listed for reference:
- CUDA 11.8 base image + cuDNN 8 (Ubuntu 20.04)
- Python 3.9, PyTorch 2.0.1 + torchvision 0.15.2
- NumPy 1.26.4, Pillow 9.4.0, OpenCV 4.7.0.72, pdf2image 1.16.3
- google-cloud-storage 2.9.0, google-cloud-vision 3.4.0, google-genai
- huggingface_hub 0.19.4, ultralytics 8.0.196, doclayout-yolo (pinned commit
7c4be36) - python-dotenv 1.2.2, PyYAML 6.0.1, protobuf 3.20.3
3. Auto-downloaded on first run
- CUDA base image — pulled from Docker Hub during build (~5 GB; cached after first build)
- DocLayout-YOLO model weights — pulled from HuggingFace Hub on the first Stage 1 run (~40 MB; cached locally inside the container)
First-time total: ~30–60 min for Docker build + a few minutes on first Stage 1 run for model download. After that, builds reuse cache and models are local.
git clone https://github.com/raphael-seo/Versatile-OCR-Program.git
cd Versatile-OCR-ProgramThe repository ships with two legacy folders (v1.0_initial/, v2.0_initial/) preserved for reference. All new work should use the root-level v3.0_initial layout (src/ocr/, src/stages/, auto_run_*.py, config.yaml, prompts/).
The pipeline calls 4 external APIs. Below is what each is for and the short path to get a key. Note: all 4 are paid services — see the cost note in §6 before processing large documents.
- Go to https://platform.openai.com → log in / sign up.
- Top-right user menu → View API keys (or "Dashboard → API keys").
- Click Create new secret key, name it, copy the value (starts with
sk-...). - Add billing info if you haven't — keys without credit return 429/insufficient_quota.
- Go to https://aistudio.google.com → sign in with a Google account.
- Top-left Get API key → Create API key.
- Copy the value (starts with
AIza...).
- Go to https://mathpix.com → sign up.
- Dashboard → API tokens (or "Account → API Keys").
- Create a new app — note both App ID and App Key (you need both).
This one has the most steps; do it last.
- Go to https://console.cloud.google.com → create a new project (or pick an existing one).
- APIs & Services → Library → enable both Cloud Vision API and Cloud Storage API.
- IAM & Admin → Service Accounts → Create Service Account, name it (e.g.
ocr-runner). - Grant roles: Cloud Vision API User + Storage Object Admin.
- Click the new account → Keys tab → Add key → Create new key → JSON → download.
- Keep the downloaded JSON file — you'll move it in §2.2.
mkdir -p /home/jupyter/credentials
mv ~/Downloads/<service-account>.json /home/jupyter/credentials/Google_Vision_S.Account.json
⚠️ The filename must match whatconfig.yamlexpects (default:Google_Vision_S.Account.json). If you use a different filename, updatecredentials.google_vision_accountinconfig.yaml.
The path is controlled by config.yaml's env_file_path field (default: /home/jupyter/Program/.env).
cat > /home/jupyter/Program/.env <<'EOF'
OPENAI_API_KEY=sk-...your_key...
GEMINI_API_KEY=AIza...your_key...
MATHPIX_APP_ID=your_mathpix_app_id
MATHPIX_APP_KEY=your_mathpix_app_key
GOOGLE_APPLICATION_CREDENTIALS=/home/jupyter/credentials/Google_Vision_S.Account.json
# (Optional) Override GCS bucket without editing config.yaml:
# GCS_BUCKET_NAME=your-bucket-name
EOF
chmod 600 /home/jupyter/Program/.env # protect secrets
⚠️ Never commit.envto git. The shipped.gitignoreexcludes it.
# Using gcloud CLI (recommended)
gcloud storage buckets create gs://your-bucket-name --location=asia-northeast1Then either edit config.yaml's gcs.bucket_name, or set GCS_BUCKET_NAME in .env to override.
Stage 1 runs inside a Docker container. Stage 2 runs on the host (no container).
Ubuntu / Debian:
sudo apt-get update
sudo apt-get install -y docker.ioArch / Manjaro:
sudo pacman -Syu --noconfirm docker runc containerdOther distros: see https://docs.docker.com/engine/install/
# systemd-based systems:
sudo systemctl enable --now docker
# WSL2 / systemd-less environments:
sudo nohup dockerd > /var/log/dockerd.log 2>&1 &sudo usermod -aG docker $USER
# Log out and log back in, OR:
newgrp docker
⚠️ On Jupyter / Vertex AI environments, the docker group membership often resets per session. Ifdocker infokeeps failing with permission errors, runsudo usermod -aG docker jupyterandsudo rebootonce — this is the persistent fix frompatch_notes/v2.0_initial_patchnotes.md.
Verify:
docker run --rm hello-worldBy default, config.yaml has docker.gpu_enabled: false. The pipeline runs on CPU. Skip this section if you do not need GPU acceleration.
If you do enable GPU, install the NVIDIA Container Toolkit:
# Ubuntu/Debian
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart dockerAdd NVIDIA runtime to Docker daemon config:
sudo tee /etc/docker/daemon.json <<'EOF'
{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
EOF
sudo systemctl restart dockerVerify:
docker run --gpus all --rm nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04 nvidia-smiThe Dockerfile is pinned to CUDA 11.8. If your driver is older, either update the driver or stay on CPU.
Then set docker.gpu_enabled: true in config.yaml.
Open config.yaml and review/adjust. The values you most likely want to change are highlighted below.
env_file_path: "/home/jupyter/Program/.env" # ← Adjust to your .env location
directories:
credentials: "/home/jupyter/Program/credentials" # ← Where your service-account JSON lives
docker_build: "/home/jupyter/Program/Versatile-OCR-Program/src/ocr" # ← Where Dockerfile is
gcs:
bucket_name: "eju-ocr-results" # ← Your bucket
stage1_prefix: "stage_1" # GCS path prefix for Stage 1 output
stage2_prefix: "stage_2" # GCS path prefix for Stage 2 output
docker:
image_name: "cantaloupe" # Docker image name (rename if you build under a different name)
gpu_enabled: false # ← Set true only if you completed §3.4
ocr:
language_hints: ["ja", "en", "ko"] # ← Vision OCR language priority (reorder for your data)
pdf_dpi: 200 # PDF page rendering DPI (higher = slower, better accuracy)
image_processing:
vision_max_dim: 1600 # Resize cap before Vision API call
gemini_max_dim: 1024 # Resize cap before Gemini API call
jpeg_quality: 85
gemini:
model: "gemini-2.0-flash" # ← Gemini model for figure/table
figure_prompt_path: "prompts/gemini_figure.txt"
table_prompt_path: "prompts/gemini_table.txt"
stage2:
model: "gpt-5-nano" # ← ChatGPT model for Stage 2 correction
max_tokens: 100000
system_prompt_path: "prompts/chatgpt_stage2.md"stage1:
input_directory: "src/input" # ← Directory tree to scan for PDFs
mode: "recursive" # "recursive" (walk subdirs) or "direct" (single dir)
stage2:
gcs_input_paths: # ← GCS paths to read Stage 1 results from
- "gs://your-bucket/stage_1/subject_A"
- "gs://your-bucket/stage_1/subject_B"
parallel_workers: 70 # ThreadPoolExecutor size for parallel processingThree prompt files are loaded at runtime (paths in config.yaml):
| File | Used by | Sent to |
|---|---|---|
prompts/gemini_figure.txt |
Stage 1 — _process_figure_region |
Gemini |
prompts/gemini_table.txt |
Stage 1 — _process_table_region |
Gemini |
prompts/chatgpt_stage2.md |
Stage 2 — chatgpt_correct_text |
ChatGPT |
Edit these files freely. The Stage 2 system prompt enforces strict region-structure preservation — read it before modifying.
Relative paths are resolved against the directory containing config.yaml (so prompts load correctly regardless of where the script is invoked from or inside Docker).
Use the safe build script that automatically gathers all required files into a temporary build context:
python src/ocr/docker_build.pyThis will:
- Locate
config.yaml(walks up the directory tree) - Verify all
docker.source_filesexist (Dockerfile, advanced_ocr.py, custom_doclayout_yolo.py, config.yaml, prompts/) - Copy them to a temporary build context
- Run
docker build -t cantaloupe . - Clean up the temporary context
Build time: 30–60 minutes on first run (downloads CUDA 11.8 base ~3 GB, installs PyTorch, clones and installs DocLayout-YOLO).
Useful flags:
python src/ocr/docker_build.py --check-files # Verify source files exist, don't build
python src/ocr/docker_build.py --no-cache # Force rebuild from scratch
python src/ocr/docker_build.py --keep-context # Keep the temp build context for debuggingVerify:
docker images cantaloupeThe build script prints all docker build output to stdout. Save it to a file if you want to scroll through it:
python src/ocr/docker_build.py 2>&1 | tee /tmp/docker_build.logCommon failures:
| Error | Cause | Fix |
|---|---|---|
This script does not work on Python 3.9 (in get-pip.py step) |
The legacy https://bootstrap.pypa.io/get-pip.py URL dropped Python 3.9 support |
The shipped Dockerfile pins the 3.9-compatible URL https://bootstrap.pypa.io/pip/3.9/get-pip.py. If you forked and edited the Dockerfile, restore this URL. |
Killed mid-build, or WSL2 freezes |
Out of memory during pip install (PyTorch resolve is heavy) |
Increase RAM. On WSL2, edit %USERPROFILE%\.wslconfig ([wsl2] → memory=24GB, swap=16GB), then wsl --shutdown and reopen. |
failed to retrieve ... from mirror.pkgbuild.com |
Mirror serving a stale package list (Arch only) | sudo pacman -Syyu --noconfirm to refresh, then retry build. |
runc: undefined symbol: seccomp_transaction_reject |
Outdated libseccomp on host |
Upgrade libseccomp (sudo pacman -S libseccomp on Arch; sudo apt-get install libseccomp2 on Ubuntu); restart dockerd. |
| Build slow / network timeouts on first run | Pulling 5 GB CUDA base image | Be patient (first build only). Subsequent builds use cache. |
The build is incremental — fixing a step doesn't rebuild from scratch. Docker reuses earlier-step layers automatically.
src/input/
├── subject_A/
│ ├── exam_2023.pdf
│ └── exam_2024.pdf
└── subject_B/
└── workbook.pdf
auto_run_stage1.py recursively walks src/input/ (or whatever directory you set in auto_run.yaml) and processes every directory containing PDFs. Hidden directories (.ipynb_checkpoints, etc.) are skipped automatically.
⚠️ Korean / non-ASCII filenames on certain hosts: Some Linux environments (notably WSL2 / Arch with certain locales) may fail to open files whose names contain Korean / Japanese characters via poppler. If you hitI/O Error: Couldn't open file '...', rename the file to ASCII or use a copy with an ASCII name. The Docker container (Ubuntu 20.04 base) generally handles UTF-8 filenames correctly.
All 4 APIs are paid:
- OpenAI (Stage 2) — per-token. Charged per page (each page sends a JSON region array, gets a corrected one back). With
gpt-5-nano(default), expect ~$0.0X per page; large pages cost more. - Google Vision (Stage 1) — per OCR call. Each text/title/list region = 1 call.
- Google Gemini (Stage 1) — per image + token. Each figure/table region = 1 call with the cropped image.
- MathPix (Stage 1) — per formula image. Free tier covers ~1000 calls/month; beyond that you pay.
A 500-page math textbook ≈ thousands of regions across the 4 APIs. Do not run a large PDF as your first test. Costs can easily reach tens of dollars in a single run.
Before processing your real corpus:
- Create a 1–2 page sample PDF (split from your real PDF, or use any short document).
- Put it in
src/input/_smoke_test/sample.pdf. - Set
auto_run.yaml:stage1: input_directory: "src/input/_smoke_test" mode: "recursive"
- Run Stage 1 (§7.1) — should finish in 1–3 minutes per page.
- Verify the result in GCS (§8).
- Set
auto_run.yaml'sstage2.gcs_input_pathsto the GCS prefix from step 5. - Run Stage 2 (§7.2) — should finish in under a minute.
- Verify Stage 2 result (§8).
If both stages produce sensible output, switch input_directory back to your real input and re-run.
python auto_run_stage1.py --config auto_run.yamlWhat happens:
- Reads
auto_run.yaml→stage1.input_directory,stage1.mode. - Recursively finds every directory containing PDFs.
- For each directory, invokes
src/stages/ocr_stage_1.py, which spawns a Docker container (cantaloupe) that runsadvanced_ocr.pyinside. advanced_ocr.pyprocesses each PDF page-by-page, uploads regions to GCS asstage_1/{relative_path}/{pdf_name}/page_{NNN}.json.
You can override config values at the command line — see python auto_run_stage1.py --help.
First run — what you should see:
INFO - === Auto-Run Stage 1: Recursive OCR Processing Starting ===
INFO - Input directory: /home/jupyter/.../src/input/_smoke_test
INFO - Mode: recursive
INFO - Scanning root directory: ...
INFO - Found 1 directories containing PDF files
INFO - Processing directory 1/1: ...
INFO - === OCR System (Docker) Starting ===
INFO - Number of PDF files found in input directory: 1
INFO - Running Docker container (command hidden for security)
... (Docker container output — DocLayout-YOLO model download on first run, page-by-page processing)
Processing page 1/N
Page 1: Detected K regions
Page 1: Processed K regions
GCS upload complete: gs://your-bucket/stage_1/_smoke_test/sample/page_001.json
...
INFO - Stage 1 OCR completed successfully for: ...
INFO - === Auto-Run Stage 1 Complete: 1/1 directories processed successfully ===
If you see Found 0 directories → your PDF is in the wrong place. If you see Docker container execution failed → check docker info (daemon running? user in docker group?). If you see OPENAI_API_KEY is not set → .env path mismatch (Stage 1 doesn't need OpenAI, but ChatGPT init runs anyway — set the key to silence the warning).
python auto_run_stage2.py --config auto_run.yamlWhat happens:
- Reads
auto_run.yaml→stage2.gcs_input_paths,stage2.parallel_workers. - Spawns a
ThreadPoolExecutor(default 70 workers) and processes each GCS path in parallel. - For each path,
src/stages/ocr_stage_2.py:- Discovers all directories containing JSON files under the prefix.
- Skips directories where Stage 2 is already complete (file-count comparison — re-runs are idempotent).
- Loads each page's regions, sends them as a JSON array to ChatGPT with the strict structure-preserving system prompt.
- Validates the response (region count + type/coords/id preservation). On validation failure, falls back to the original regions.
- Re-assigns region IDs (
page_{N}_region_{i}) and uploads to GCS asstage_2/{relative_path}/{pdf_name}/page_{NNN}.json.
Useful flags:
python auto_run_stage2.py --gcs-path gs://your-bucket/stage_1/subject_A # Process only one path
python auto_run_stage2.py --workers 10 # Limit parallelism
python auto_run_stage2.py --dry-run # List paths, don't processgs://your-bucket/
├── stage_1/
│ └── subject_A/exam_2023/page_001.json ← Stage 1 raw OCR
└── stage_2/
└── subject_A/exam_2023/page_001.json ← Stage 2 corrected
Each page JSON has the shape:
{
"page": 1,
"regions": [
{ "type": "text", "coords": {...}, "text": "...", "id": "page_1_region_0" },
{ "type": "title", "coords": {...}, "text": "...", "id": "page_1_region_1" },
{ "type": "formula", "coords": {...}, "text": "LaTeX: ...\nText: ...", "id": "page_1_region_2" },
{ "type": "figure", "coords": {...}, "text": "## Image Description: ...", "id": "page_1_region_3" },
{ "type": "table", "coords": {...}, "text": "## Table Content: ...", "id": "page_1_region_4" }
]
}gsutil ls gs://your-bucket/stage_1/
gsutil ls gs://your-bucket/stage_1/subject_A/exam_2023/ | wc -l # page countgsutil cat gs://your-bucket/stage_1/subject_A/exam_2023/page_001.json | jq '.regions[0:3]'Look for:
regionsarray is non-empty- Each region has
type,coords,text,id - Text content matches the source PDF (sanity check a few pages)
gsutil cat gs://your-bucket/stage_1/.../page_001.json | jq '.regions | length'
gsutil cat gs://your-bucket/stage_2/.../page_001.json | jq '.regions | length'The counts must match. If Stage 2 has fewer regions, the structure validation failed and Stage 2 fell back to the originals — check the ChatGPT response logs.
| Symptom | Likely cause | Fix |
|---|---|---|
docker: command not found |
Docker not installed | §3.1 |
permission denied while trying to connect to Docker daemon |
User not in docker group | §3.3 (sudo usermod -aG docker $USER + new login or newgrp docker) |
Container fails with runc did not terminate successfully and a libseccomp error |
Outdated libseccomp on host | Upgrade libseccomp (sudo pacman -S libseccomp on Arch; sudo apt-get install libseccomp2 on Ubuntu) |
I/O Error: Couldn't open file '...pdf' |
Korean / non-ASCII filename on a host with locale issues | Rename to ASCII or work inside the Docker container (the container handles UTF-8 correctly) |
OPENAI_API_KEY is not set. ChatGPT calls may fail. |
.env not loaded or wrong path |
Check env_file_path in config.yaml matches your .env location |
Failed to initialize Google Cloud Storage client |
Service-account JSON missing or wrong path | Check GOOGLE_APPLICATION_CREDENTIALS in .env points to a real file |
Error loading prompt file prompts/gemini_figure.txt |
Wrong working directory or path | Prompts paths in config.yaml are resolved against the config file's directory — verify the prompts/ folder sits next to config.yaml |
| Stage 2 returns the same number of regions but the text is unchanged | ChatGPT response parse error → fallback to originals | Check the Stage 2 log for Structure validation failed or Failed to parse ChatGPT response as JSON; consider lowering max_tokens or splitting large pages |
| Stage 2 hits OpenAI rate limits | 70 workers too aggressive | Reduce stage2.parallel_workers in auto_run.yaml |
Docker image build fails at pip install |
Network / PyPI rate-limit | Retry; or use --no-cache |
| DocLayout-YOLO model download is slow on first run | HuggingFace anonymous rate limit | Set HF_TOKEN env var inside the container (rebuild with the token baked in, or pass via -e) |
| GPU not detected inside container | NVIDIA Container Toolkit not installed / configured | §3.4. Or set docker.gpu_enabled: false to run on CPU |
- Stage 1 host-side log: stdout from
auto_run_stage1.py - Stage 1 container-side log: stdout from
docker run(visible inauto_run_stage1.pyoutput) - Stage 2 log: stdout from
auto_run_stage2.py(per-worker prefixed[Worker N]) - Docker daemon log:
/var/log/dockerd.logorjournalctl -u docker
PDFs (src/input/)
│
▼
auto_run_stage1.py (recursive directory scan, host)
│
▼ one Docker container per directory
ocr_stage_1.py → docker run cantaloupe
│
▼ inside container
advanced_ocr.py
│ - DocLayout-YOLO region detection
│ - Google Vision (text/title/list)
│ - MathPix (formulas)
│ - Gemini (figures/tables)
│
▼
GCS: gs://bucket/stage_1/.../page_NNN.json
│
┌────────────────┘
│
▼
auto_run_stage2.py (ThreadPoolExecutor, host — no container)
│
▼ N workers in parallel
ocr_stage_2.py
│ - Reads stage_1 JSONs from GCS
│ - ChatGPT correction (region-structure-preserving prompt)
│ - 3-tier safety net (JSON parse / structure validation / exception → original fallback)
│
▼
GCS: gs://bucket/stage_2/.../page_NNN.json
If you were running v2.0_initial/ocr_stage1.py directly, see changes/2026-05-13_v3.0_release.md for the full migration guide. Quick summary:
- Move to the root-level layout (
src/ocr/,src/stages/, rootconfig.yaml+auto_run.yaml). - Use
auto_run_stage1.py/auto_run_stage2.pyinstead of callingocr_stage*.pydirectly. - Expect GCS-only output (no local output directory).
- Stage 1 output is now a regions-only array (the flattened
textfield and metadata fields have been removed; Stage 2 prompt enforces structure preservation). - Special-content placeholders (
[Formula Start]/[Formula End]etc.) are no longer used or recognized.
- Open an issue: https://github.com/raphael-seo/Versatile-OCR-Program/issues
- Email: raphael.es.seo@gmail.com