Setup Guide — Versatile-OCR-Program v3.0_initial

This guide walks you through setting up Versatile-OCR-Program from a fresh clone to a working OCR pipeline that processes PDFs through Stage 1 (layout detection + multi-API OCR) and Stage 2 (LLM-based correction).

Estimated time: 60–90 minutes for a first install (Docker image build dominates).

Overview

Versatile-OCR-Program is a two-stage OCR pipeline optimized for educational materials (figures, formulas, tables, multilingual text):

Stage 1 — Runs inside a Docker container. Detects regions with DocLayout-YOLO, then dispatches each region to a specialized API:
- Text/title/list → Google Vision OCR
- Formulas → MathPix
- Figures & tables → Gemini (multimodal)
- Results are uploaded directly to Google Cloud Storage (GCS-only architecture; no local output directory).
Stage 2 — Runs on the host (no Docker). Reads Stage 1 results from GCS, sends each page's regions to ChatGPT for OCR-error correction while strictly preserving the region structure, then writes the corrected results back to GCS.

Both stages have lightweight wrappers (auto_run_stage1.py, auto_run_stage2.py) for batch and parallel processing.

Requirements

Component	Minimum	Notes
OS	Linux (Ubuntu 20.04+, Arch, or similar)	Tested on Vertex AI / GCP Notebook. WSL2 also works with caveats.
Disk	~13 GB free	CUDA base image (~5GB) + Python deps + model weights cache
RAM	8 GB+	16 GB recommended for PDFs >100 pages
GPU	Optional (NVIDIA, CUDA 11.8 compatible)	Setting `docker.gpu_enabled: true` in `config.yaml` requires NVIDIA Container Toolkit
Docker	Required	Stage 1 runs inside a container
Python	3.9+ on host	For running `auto_run_*.py` wrappers and Stage 2
Network	Required	Pulls CUDA base image, HuggingFace models, calls Vision / Gemini / MathPix / OpenAI APIs

What gets installed where

There are 3 categories. Most are automatic — only Category 1 needs you to install manually.

1. Host (you install manually — see §3 for Docker, §2 for the rest)

Docker — runs the Stage 1 container.
Python 3.9+ — already on most Linux. Used to run wrapper scripts (auto_run_*.py, docker_build.py) and Stage 2.
Host-side Python packages (one-time install):
```
pip install pyyaml python-dotenv openai google-cloud-storage
```
These are used by auto_run_*.py, docker_build.py, and ocr_stage_2.py (which runs on the host, not in Docker).
(Optional) gsutil CLI — for GCS verification in §8. Install via the Google Cloud SDK, or skip and use the web console instead.

2. Inside the Docker container (automatic — docker_build.py handles this) You don't install these yourself; they're declared in src/ocr/Dockerfile. Listed for reference:

CUDA 11.8 base image + cuDNN 8 (Ubuntu 20.04)
Python 3.9, PyTorch 2.0.1 + torchvision 0.15.2
NumPy 1.26.4, Pillow 9.4.0, OpenCV 4.7.0.72, pdf2image 1.16.3
google-cloud-storage 2.9.0, google-cloud-vision 3.4.0, google-genai
huggingface_hub 0.19.4, ultralytics 8.0.196, doclayout-yolo (pinned commit 7c4be36)
python-dotenv 1.2.2, PyYAML 6.0.1, protobuf 3.20.3

3. Auto-downloaded on first run

CUDA base image — pulled from Docker Hub during build (~5 GB; cached after first build)
DocLayout-YOLO model weights — pulled from HuggingFace Hub on the first Stage 1 run (~40 MB; cached locally inside the container)

First-time total: ~30–60 min for Docker build + a few minutes on first Stage 1 run for model download. After that, builds reuse cache and models are local.

1. Clone the repository

git clone https://github.com/raphael-seo/Versatile-OCR-Program.git
cd Versatile-OCR-Program

The repository ships with two legacy folders (v1.0_initial/, v2.0_initial/) preserved for reference. All new work should use the root-level v3.0_initial layout (src/ocr/, src/stages/, auto_run_*.py, config.yaml, prompts/).

2. API keys and credentials

2.1 Obtain API keys from these services

The pipeline calls 4 external APIs. Below is what each is for and the short path to get a key. Note: all 4 are paid services — see the cost note in §6 before processing large documents.

OpenAI (Stage 2 — ChatGPT correction)

Go to https://platform.openai.com → log in / sign up.
Top-right user menu → View API keys (or "Dashboard → API keys").
Click Create new secret key, name it, copy the value (starts with sk-...).
Add billing info if you haven't — keys without credit return 429/insufficient_quota.

Google Gemini (Stage 1 — figure & table analysis)

Go to https://aistudio.google.com → sign in with a Google account.
Top-left Get API key → Create API key.
Copy the value (starts with AIza...).

MathPix (Stage 1 — formula recognition)

Go to https://mathpix.com → sign up.
Dashboard → API tokens (or "Account → API Keys").
Create a new app — note both App ID and App Key (you need both).

Google Cloud Service Account (Stage 1 — Vision OCR + GCS access)

This one has the most steps; do it last.

Go to https://console.cloud.google.com → create a new project (or pick an existing one).
APIs & Services → Library → enable both Cloud Vision API and Cloud Storage API.
IAM & Admin → Service Accounts → Create Service Account, name it (e.g. ocr-runner).
Grant roles: Cloud Vision API User + Storage Object Admin.
Click the new account → Keys tab → Add key → Create new key → JSON → download.
Keep the downloaded JSON file — you'll move it in §2.2.

2.2 Place the service-account JSON

mkdir -p /home/jupyter/credentials
mv ~/Downloads/<service-account>.json /home/jupyter/credentials/Google_Vision_S.Account.json

⚠️ The filename must match what config.yaml expects (default: Google_Vision_S.Account.json). If you use a different filename, update credentials.google_vision_account in config.yaml.

2.3 Create the `.env` file

The path is controlled by config.yaml's env_file_path field (default: /home/jupyter/Program/.env).

cat > /home/jupyter/Program/.env <<'EOF'
OPENAI_API_KEY=sk-...your_key...
GEMINI_API_KEY=AIza...your_key...
MATHPIX_APP_ID=your_mathpix_app_id
MATHPIX_APP_KEY=your_mathpix_app_key
GOOGLE_APPLICATION_CREDENTIALS=/home/jupyter/credentials/Google_Vision_S.Account.json
# (Optional) Override GCS bucket without editing config.yaml:
# GCS_BUCKET_NAME=your-bucket-name
EOF
chmod 600 /home/jupyter/Program/.env  # protect secrets

⚠️ Never commit .env to git. The shipped .gitignore excludes it.

2.4 Create the GCS bucket

# Using gcloud CLI (recommended)
gcloud storage buckets create gs://your-bucket-name --location=asia-northeast1

Then either edit config.yaml's gcs.bucket_name, or set GCS_BUCKET_NAME in .env to override.

3. Docker setup

Stage 1 runs inside a Docker container. Stage 2 runs on the host (no container).

3.1 Install Docker

Ubuntu / Debian:

sudo apt-get update
sudo apt-get install -y docker.io

Arch / Manjaro:

sudo pacman -Syu --noconfirm docker runc containerd

Other distros: see https://docs.docker.com/engine/install/

3.2 Start the Docker daemon

# systemd-based systems:
sudo systemctl enable --now docker

# WSL2 / systemd-less environments:
sudo nohup dockerd > /var/log/dockerd.log 2>&1 &

3.3 Add your user to the `docker` group (avoid `sudo` for every command)

sudo usermod -aG docker $USER
# Log out and log back in, OR:
newgrp docker

⚠️ On Jupyter / Vertex AI environments, the docker group membership often resets per session. If docker info keeps failing with permission errors, run sudo usermod -aG docker jupyter and sudo reboot once — this is the persistent fix from patch_notes/v2.0_initial_patchnotes.md.

Verify:

docker run --rm hello-world

3.4 (Optional) GPU support — only if `gpu_enabled: true`

By default, config.yaml has docker.gpu_enabled: false. The pipeline runs on CPU. Skip this section if you do not need GPU acceleration.

If you do enable GPU, install the NVIDIA Container Toolkit:

# Ubuntu/Debian
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Add NVIDIA runtime to Docker daemon config:

sudo tee /etc/docker/daemon.json <<'EOF'
{
  "runtimes": {
    "nvidia": {
      "path": "nvidia-container-runtime",
      "runtimeArgs": []
    }
  }
}
EOF
sudo systemctl restart docker

Verify:

docker run --gpus all --rm nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04 nvidia-smi

The Dockerfile is pinned to CUDA 11.8. If your driver is older, either update the driver or stay on CPU.

Then set docker.gpu_enabled: true in config.yaml.

4. Configuration

4.1 `config.yaml` — global settings

Open config.yaml and review/adjust. The values you most likely want to change are highlighted below.

env_file_path: "/home/jupyter/Program/.env"   # ← Adjust to your .env location
directories:
  credentials: "/home/jupyter/Program/credentials"   # ← Where your service-account JSON lives
  docker_build: "/home/jupyter/Program/Versatile-OCR-Program/src/ocr"  # ← Where Dockerfile is

gcs:
  bucket_name: "eju-ocr-results"                     # ← Your bucket
  stage1_prefix: "stage_1"                           # GCS path prefix for Stage 1 output
  stage2_prefix: "stage_2"                           # GCS path prefix for Stage 2 output

docker:
  image_name: "cantaloupe"                           # Docker image name (rename if you build under a different name)
  gpu_enabled: false                                 # ← Set true only if you completed §3.4

ocr:
  language_hints: ["ja", "en", "ko"]                 # ← Vision OCR language priority (reorder for your data)
  pdf_dpi: 200                                       # PDF page rendering DPI (higher = slower, better accuracy)
  image_processing:
    vision_max_dim: 1600                             # Resize cap before Vision API call
    gemini_max_dim: 1024                             # Resize cap before Gemini API call
    jpeg_quality: 85

gemini:
  model: "gemini-2.0-flash"                          # ← Gemini model for figure/table
  figure_prompt_path: "prompts/gemini_figure.txt"
  table_prompt_path: "prompts/gemini_table.txt"

stage2:
  model: "gpt-5-nano"                                # ← ChatGPT model for Stage 2 correction
  max_tokens: 100000
  system_prompt_path: "prompts/chatgpt_stage2.md"

4.2 `auto_run.yaml` — batch execution parameters

stage1:
  input_directory: "src/input"   # ← Directory tree to scan for PDFs
  mode: "recursive"              # "recursive" (walk subdirs) or "direct" (single dir)

stage2:
  gcs_input_paths:               # ← GCS paths to read Stage 1 results from
    - "gs://your-bucket/stage_1/subject_A"
    - "gs://your-bucket/stage_1/subject_B"
  parallel_workers: 70           # ThreadPoolExecutor size for parallel processing

4.3 `prompts/` directory

Three prompt files are loaded at runtime (paths in config.yaml):

File	Used by	Sent to
`prompts/gemini_figure.txt`	Stage 1 — `_process_figure_region`	Gemini
`prompts/gemini_table.txt`	Stage 1 — `_process_table_region`	Gemini
`prompts/chatgpt_stage2.md`	Stage 2 — `chatgpt_correct_text`	ChatGPT

Edit these files freely. The Stage 2 system prompt enforces strict region-structure preservation — read it before modifying.

Relative paths are resolved against the directory containing config.yaml (so prompts load correctly regardless of where the script is invoked from or inside Docker).

5. Build the Docker image

Use the safe build script that automatically gathers all required files into a temporary build context:

python src/ocr/docker_build.py

This will:

Locate config.yaml (walks up the directory tree)
Verify all docker.source_files exist (Dockerfile, advanced_ocr.py, custom_doclayout_yolo.py, config.yaml, prompts/)
Copy them to a temporary build context
Run docker build -t cantaloupe .
Clean up the temporary context

Build time: 30–60 minutes on first run (downloads CUDA 11.8 base ~3 GB, installs PyTorch, clones and installs DocLayout-YOLO).

Useful flags:

python src/ocr/docker_build.py --check-files     # Verify source files exist, don't build
python src/ocr/docker_build.py --no-cache        # Force rebuild from scratch
python src/ocr/docker_build.py --keep-context    # Keep the temp build context for debugging

Verify:

docker images cantaloupe

5.1 If the build fails

The build script prints all docker build output to stdout. Save it to a file if you want to scroll through it:

python src/ocr/docker_build.py 2>&1 | tee /tmp/docker_build.log

Common failures:

Error	Cause	Fix
`This script does not work on Python 3.9` (in `get-pip.py` step)	The legacy `https://bootstrap.pypa.io/get-pip.py` URL dropped Python 3.9 support	The shipped Dockerfile pins the 3.9-compatible URL `https://bootstrap.pypa.io/pip/3.9/get-pip.py`. If you forked and edited the Dockerfile, restore this URL.
`Killed` mid-build, or WSL2 freezes	Out of memory during `pip install` (PyTorch resolve is heavy)	Increase RAM. On WSL2, edit `%USERPROFILE%\.wslconfig` (`[wsl2]` → `memory=24GB`, `swap=16GB`), then `wsl --shutdown` and reopen.
`failed to retrieve ... from mirror.pkgbuild.com`	Mirror serving a stale package list (Arch only)	`sudo pacman -Syyu --noconfirm` to refresh, then retry build.
`runc: undefined symbol: seccomp_transaction_reject`	Outdated `libseccomp` on host	Upgrade libseccomp (`sudo pacman -S libseccomp` on Arch; `sudo apt-get install libseccomp2` on Ubuntu); restart `dockerd`.
Build slow / network timeouts on first run	Pulling 5 GB CUDA base image	Be patient (first build only). Subsequent builds use cache.

The build is incremental — fixing a step doesn't rebuild from scratch. Docker reuses earlier-step layers automatically.

6. Prepare input

6.1 Place PDFs

src/input/
├── subject_A/
│   ├── exam_2023.pdf
│   └── exam_2024.pdf
└── subject_B/
    └── workbook.pdf

auto_run_stage1.py recursively walks src/input/ (or whatever directory you set in auto_run.yaml) and processes every directory containing PDFs. Hidden directories (.ipynb_checkpoints, etc.) are skipped automatically.

⚠️ Korean / non-ASCII filenames on certain hosts: Some Linux environments (notably WSL2 / Arch with certain locales) may fail to open files whose names contain Korean / Japanese characters via poppler. If you hit I/O Error: Couldn't open file '...', rename the file to ASCII or use a copy with an ASCII name. The Docker container (Ubuntu 20.04 base) generally handles UTF-8 filenames correctly.

6.2 Cost note — read before processing large documents

All 4 APIs are paid:

OpenAI (Stage 2) — per-token. Charged per page (each page sends a JSON region array, gets a corrected one back). With gpt-5-nano (default), expect ~$0.0X per page; large pages cost more.
Google Vision (Stage 1) — per OCR call. Each text/title/list region = 1 call.
Google Gemini (Stage 1) — per image + token. Each figure/table region = 1 call with the cropped image.
MathPix (Stage 1) — per formula image. Free tier covers ~1000 calls/month; beyond that you pay.

A 500-page math textbook ≈ thousands of regions across the 4 APIs. Do not run a large PDF as your first test. Costs can easily reach tens of dollars in a single run.

6.3 Smoke test — validate the full pipeline with 1 page first

Before processing your real corpus:

Create a 1–2 page sample PDF (split from your real PDF, or use any short document).
Put it in src/input/_smoke_test/sample.pdf.

Set auto_run.yaml:

stage1:
  input_directory: "src/input/_smoke_test"
  mode: "recursive"

Run Stage 1 (§7.1) — should finish in 1–3 minutes per page.
Verify the result in GCS (§8).
Set auto_run.yaml's stage2.gcs_input_paths to the GCS prefix from step 5.
Run Stage 2 (§7.2) — should finish in under a minute.
Verify Stage 2 result (§8).

If both stages produce sensible output, switch input_directory back to your real input and re-run.

7. Run

7.1 Stage 1

python auto_run_stage1.py --config auto_run.yaml

What happens:

Reads auto_run.yaml → stage1.input_directory, stage1.mode.
Recursively finds every directory containing PDFs.
For each directory, invokes src/stages/ocr_stage_1.py, which spawns a Docker container (cantaloupe) that runs advanced_ocr.py inside.
advanced_ocr.py processes each PDF page-by-page, uploads regions to GCS as stage_1/{relative_path}/{pdf_name}/page_{NNN}.json.

You can override config values at the command line — see python auto_run_stage1.py --help.

First run — what you should see:

INFO - === Auto-Run Stage 1: Recursive OCR Processing Starting ===
INFO - Input directory: /home/jupyter/.../src/input/_smoke_test
INFO - Mode: recursive
INFO - Scanning root directory: ...
INFO - Found 1 directories containing PDF files
INFO - Processing directory 1/1: ...
INFO - === OCR System (Docker) Starting ===
INFO - Number of PDF files found in input directory: 1
INFO - Running Docker container (command hidden for security)
... (Docker container output — DocLayout-YOLO model download on first run, page-by-page processing)
Processing page 1/N
Page 1: Detected K regions
Page 1: Processed K regions
GCS upload complete: gs://your-bucket/stage_1/_smoke_test/sample/page_001.json
...
INFO - Stage 1 OCR completed successfully for: ...
INFO - === Auto-Run Stage 1 Complete: 1/1 directories processed successfully ===

If you see Found 0 directories → your PDF is in the wrong place. If you see Docker container execution failed → check docker info (daemon running? user in docker group?). If you see OPENAI_API_KEY is not set → .env path mismatch (Stage 1 doesn't need OpenAI, but ChatGPT init runs anyway — set the key to silence the warning).

7.2 Stage 2

python auto_run_stage2.py --config auto_run.yaml

What happens:

Reads auto_run.yaml → stage2.gcs_input_paths, stage2.parallel_workers.
Spawns a ThreadPoolExecutor (default 70 workers) and processes each GCS path in parallel.
For each path, src/stages/ocr_stage_2.py:
- Discovers all directories containing JSON files under the prefix.
- Skips directories where Stage 2 is already complete (file-count comparison — re-runs are idempotent).
- Loads each page's regions, sends them as a JSON array to ChatGPT with the strict structure-preserving system prompt.
- Validates the response (region count + type/coords/id preservation). On validation failure, falls back to the original regions.
- Re-assigns region IDs (page_{N}_region_{i}) and uploads to GCS as stage_2/{relative_path}/{pdf_name}/page_{NNN}.json.

Useful flags:

python auto_run_stage2.py --gcs-path gs://your-bucket/stage_1/subject_A    # Process only one path
python auto_run_stage2.py --workers 10                                     # Limit parallelism
python auto_run_stage2.py --dry-run                                        # List paths, don't process

7.3 Output structure (GCS)

gs://your-bucket/
├── stage_1/
│   └── subject_A/exam_2023/page_001.json    ← Stage 1 raw OCR
└── stage_2/
    └── subject_A/exam_2023/page_001.json    ← Stage 2 corrected

Each page JSON has the shape:

{
  "page": 1,
  "regions": [
    { "type": "text",  "coords": {...}, "text": "...", "id": "page_1_region_0" },
    { "type": "title", "coords": {...}, "text": "...", "id": "page_1_region_1" },
    { "type": "formula", "coords": {...}, "text": "LaTeX: ...\nText: ...", "id": "page_1_region_2" },
    { "type": "figure", "coords": {...}, "text": "## Image Description: ...", "id": "page_1_region_3" },
    { "type": "table",  "coords": {...}, "text": "## Table Content: ...", "id": "page_1_region_4" }
  ]
}

8. Verification

8.1 Check GCS for Stage 1 results

gsutil ls gs://your-bucket/stage_1/
gsutil ls gs://your-bucket/stage_1/subject_A/exam_2023/ | wc -l   # page count

8.2 Inspect a sample page

gsutil cat gs://your-bucket/stage_1/subject_A/exam_2023/page_001.json | jq '.regions[0:3]'

Look for:

regions array is non-empty
Each region has type, coords, text, id
Text content matches the source PDF (sanity check a few pages)

8.3 Compare Stage 1 vs Stage 2

gsutil cat gs://your-bucket/stage_1/.../page_001.json | jq '.regions | length'
gsutil cat gs://your-bucket/stage_2/.../page_001.json | jq '.regions | length'

The counts must match. If Stage 2 has fewer regions, the structure validation failed and Stage 2 fell back to the originals — check the ChatGPT response logs.

9. Troubleshooting

Symptom	Likely cause	Fix
`docker: command not found`	Docker not installed	§3.1
`permission denied while trying to connect to Docker daemon`	User not in docker group	§3.3 (`sudo usermod -aG docker $USER` + new login or `newgrp docker`)
Container fails with `runc did not terminate successfully` and a libseccomp error	Outdated libseccomp on host	Upgrade libseccomp (`sudo pacman -S libseccomp` on Arch; `sudo apt-get install libseccomp2` on Ubuntu)
`I/O Error: Couldn't open file '...pdf'`	Korean / non-ASCII filename on a host with locale issues	Rename to ASCII or work inside the Docker container (the container handles UTF-8 correctly)
`OPENAI_API_KEY is not set. ChatGPT calls may fail.`	`.env` not loaded or wrong path	Check `env_file_path` in `config.yaml` matches your `.env` location
`Failed to initialize Google Cloud Storage client`	Service-account JSON missing or wrong path	Check `GOOGLE_APPLICATION_CREDENTIALS` in `.env` points to a real file
`Error loading prompt file prompts/gemini_figure.txt`	Wrong working directory or path	Prompts paths in `config.yaml` are resolved against the config file's directory — verify the `prompts/` folder sits next to `config.yaml`
Stage 2 returns the same number of regions but the text is unchanged	ChatGPT response parse error → fallback to originals	Check the Stage 2 log for `Structure validation failed` or `Failed to parse ChatGPT response as JSON`; consider lowering `max_tokens` or splitting large pages
Stage 2 hits OpenAI rate limits	70 workers too aggressive	Reduce `stage2.parallel_workers` in `auto_run.yaml`
Docker image build fails at `pip install`	Network / PyPI rate-limit	Retry; or use `--no-cache`
DocLayout-YOLO model download is slow on first run	HuggingFace anonymous rate limit	Set `HF_TOKEN` env var inside the container (rebuild with the token baked in, or pass via `-e`)
GPU not detected inside container	NVIDIA Container Toolkit not installed / configured	§3.4. Or set `docker.gpu_enabled: false` to run on CPU

Logs to look at

Stage 1 host-side log: stdout from auto_run_stage1.py
Stage 1 container-side log: stdout from docker run (visible in auto_run_stage1.py output)
Stage 2 log: stdout from auto_run_stage2.py (per-worker prefixed [Worker N])
Docker daemon log: /var/log/dockerd.log or journalctl -u docker

Appendix

A. Architecture at a glance

PDFs (src/input/)
   │
   ▼
auto_run_stage1.py  (recursive directory scan, host)
   │
   ▼  one Docker container per directory
ocr_stage_1.py → docker run cantaloupe
                    │
                    ▼  inside container
                advanced_ocr.py
                    │  - DocLayout-YOLO region detection
                    │  - Google Vision (text/title/list)
                    │  - MathPix (formulas)
                    │  - Gemini (figures/tables)
                    │
                    ▼
                GCS: gs://bucket/stage_1/.../page_NNN.json
                    │
   ┌────────────────┘
   │
   ▼
auto_run_stage2.py  (ThreadPoolExecutor, host — no container)
   │
   ▼  N workers in parallel
ocr_stage_2.py
   │  - Reads stage_1 JSONs from GCS
   │  - ChatGPT correction (region-structure-preserving prompt)
   │  - 3-tier safety net (JSON parse / structure validation / exception → original fallback)
   │
   ▼
GCS: gs://bucket/stage_2/.../page_NNN.json

B. Migration from v2.0_initial

If you were running v2.0_initial/ocr_stage1.py directly, see changes/2026-05-13_v3.0_release.md for the full migration guide. Quick summary:

Move to the root-level layout (src/ocr/, src/stages/, root config.yaml + auto_run.yaml).
Use auto_run_stage1.py / auto_run_stage2.py instead of calling ocr_stage*.py directly.
Expect GCS-only output (no local output directory).
Stage 1 output is now a regions-only array (the flattened text field and metadata fields have been removed; Stage 2 prompt enforces structure preservation).
Special-content placeholders ([Formula Start]/[Formula End] etc.) are no longer used or recognized.

C. Where to ask for help

Open an issue: https://github.com/raphael-seo/Versatile-OCR-Program/issues
Email: raphael.es.seo@gmail.com

FilesExpand file tree

setup_guide.md

Latest commit

History

setup_guide.md

File metadata and controls

Setup Guide — Versatile-OCR-Program v3.0_initial

Overview

Requirements

What gets installed where

1. Clone the repository

2. API keys and credentials

2.1 Obtain API keys from these services

OpenAI (Stage 2 — ChatGPT correction)

Google Gemini (Stage 1 — figure & table analysis)

MathPix (Stage 1 — formula recognition)

Google Cloud Service Account (Stage 1 — Vision OCR + GCS access)

2.2 Place the service-account JSON

2.3 Create the .env file

2.4 Create the GCS bucket

3. Docker setup

3.1 Install Docker

3.2 Start the Docker daemon

3.3 Add your user to the docker group (avoid sudo for every command)

3.4 (Optional) GPU support — only if gpu_enabled: true

4. Configuration

4.1 config.yaml — global settings

4.2 auto_run.yaml — batch execution parameters

4.3 prompts/ directory

5. Build the Docker image

5.1 If the build fails

6. Prepare input

6.1 Place PDFs

6.2 Cost note — read before processing large documents

6.3 Smoke test — validate the full pipeline with 1 page first

7. Run

7.1 Stage 1

7.2 Stage 2

7.3 Output structure (GCS)

8. Verification

8.1 Check GCS for Stage 1 results

8.2 Inspect a sample page

8.3 Compare Stage 1 vs Stage 2

9. Troubleshooting

Logs to look at

Appendix

A. Architecture at a glance

B. Migration from v2.0_initial

C. Where to ask for help

2.3 Create the `.env` file

3.3 Add your user to the `docker` group (avoid `sudo` for every command)

3.4 (Optional) GPU support — only if `gpu_enabled: true`

4.1 `config.yaml` — global settings

4.2 `auto_run.yaml` — batch execution parameters

4.3 `prompts/` directory