Usage Guide — Versatile-OCR-Program v3.0_initial

How to run the pipeline once setup is complete. If you haven't completed setup yet, see setup_guide.md first.

Quick start

After setup, processing a corpus of PDFs takes 4 commands:

# 1. (one time) Place PDFs in src/input/
ls src/input/                                            # confirm your PDFs are there

# 2. Edit auto_run.yaml — point stage1.input_directory at src/input
nano auto_run.yaml

# 3. Run Stage 1 (Docker container → OCR → upload to GCS)
python auto_run_stage1.py --config auto_run.yaml

# 4. Run Stage 2 (read from GCS → ChatGPT correction → write back to GCS)
python auto_run_stage2.py --config auto_run.yaml

Results land in gs://<bucket>/stage_1/... (raw) and gs://<bucket>/stage_2/... (corrected).

Pipeline at a glance

PDFs (src/input/)
   │
   ▼
[Stage 1]  auto_run_stage1.py
              │  Recursively scans for PDF directories
              │
              ▼  (one Docker container per directory)
           ocr_stage_1.py  →  docker run cantaloupe  →  advanced_ocr.py (inside container)
                                                          │
                                                          ▼  Per page:
                                                          │   - DocLayout-YOLO region detection
                                                          │   - text/title/list → Google Vision OCR
                                                          │   - formula        → MathPix
                                                          │   - figure / table → Gemini
                                                          ▼
                                                  GCS: stage_1/{relpath}/{pdf}/page_NNN.json
   ────────────────────────────────────────────────────────┘
   │
   ▼
[Stage 2]  auto_run_stage2.py
              │  ThreadPoolExecutor — processes multiple GCS paths in parallel
              │
              ▼
           ocr_stage_2.py
              │  Reads each page JSON from GCS
              │  → ChatGPT with strict structure-preserving prompt
              │  → 3-tier safety net (parse / structure validation / exception → fallback to originals)
              │  → Re-assigns region IDs
              ▼
        GCS: stage_2/{relpath}/{pdf}/page_NNN.json

Stage 1 — `auto_run_stage1.py`

Wrapper around src/stages/ocr_stage_1.py that scans a directory tree for PDFs and launches a Docker container per directory.

Command

python auto_run_stage1.py [--config CONFIG_PATH]

Arguments

Flag	Default	Description
`--config`	`auto_run.yaml`	Path to the auto-run configuration file.

That's the only CLI flag — everything else is read from auto_run.yaml.

`auto_run.yaml` — Stage 1 fields

stage1:
  input_directory: "src/input"   # Where to look for PDFs (relative to cwd or absolute)
  mode: "recursive"              # "recursive" or "direct"

Field	Values	Description
`stage1.input_directory`	path	Directory to scan. Can be relative or absolute.
`stage1.mode`	`recursive` / `direct`	recursive: walks the entire tree, finds every directory that contains PDFs, and runs Stage 1 once per directory. direct: treats `input_directory` as the only directory to process.

Scenarios

Scenario A — Process a corpus organized into subdirectories

src/input/
├── corpus_A/document_001.pdf
├── corpus_A/document_002.pdf
└── corpus_B/document_001.pdf

stage1:
  input_directory: "src/input"
  mode: "recursive"

python auto_run_stage1.py

Spawns 2 Docker containers (one per subdirectory). GCS output: stage_1/corpus_A/document_001/page_NNN.json, etc.

Scenario B — Process only one directory

stage1:
  input_directory: "src/input/corpus_A"
  mode: "direct"

One container, only the corpus_A/ directory.

Scenario C — Re-run a single PDF There's no native single-file mode. Move the PDF to a fresh directory and use direct mode:

mkdir -p src/input/_rerun
cp src/input/corpus_A/document_001.pdf src/input/_rerun/
# auto_run.yaml: input_directory: "src/input/_rerun", mode: "direct"
python auto_run_stage1.py

GCS path will be stage_1/_rerun/document_001/page_NNN.json — to keep the original path, manually gsutil cp the result.

Scenario D — Recursive but skip a sub-tree The scanner automatically skips hidden directories (any starting with ., e.g. .ipynb_checkpoints). To exclude a regular directory, rename it with a leading dot:

mv src/input/old_data src/input/.old_data

What you'll see at runtime

See setup_guide.md §7.1 — the "First run — what you should see" section.

When Stage 1 finishes

Each PDF's pages end up as JSON files in GCS:

gs://<bucket>/stage_1/{relative_path}/{pdf_name}/page_001.json
gs://<bucket>/stage_1/{relative_path}/{pdf_name}/page_002.json
...

The {relative_path} mirrors your input_directory tree.

Stage 2 — `auto_run_stage2.py`

Wrapper around src/stages/ocr_stage_2.py that processes multiple GCS paths in parallel via ThreadPoolExecutor.

Command

python auto_run_stage2.py [--config CONFIG_PATH] [--gcs-path PATH] [--workers N] [--dry-run]

Arguments

Flag	Default	Description
`--config`	`auto_run.yaml`	Path to the auto-run configuration file.
`--gcs-path`	(from config)	Override the GCS paths from config — process only this single path.
`--workers`	(from config)	Override `stage2.parallel_workers`.
`--dry-run`	off	Don't actually call ChatGPT. Lists the directories that would be processed.

`auto_run.yaml` — Stage 2 fields

stage2:
  gcs_input_paths:
    - "gs://your-bucket/stage_1/corpus_A"
    - "gs://your-bucket/stage_1/corpus_B"
  parallel_workers: 70

Field	Description
`stage2.gcs_input_paths`	List of GCS prefixes to process. Each gets its own worker.
`stage2.parallel_workers`	ThreadPoolExecutor size.

Scenarios

Scenario A — Process every Stage 1 output

stage2:
  gcs_input_paths:
    - "gs://your-bucket/stage_1/corpus_A"
    - "gs://your-bucket/stage_1/corpus_B"
  parallel_workers: 30

python auto_run_stage2.py

Scenario B — Re-run only one path (CLI override)

python auto_run_stage2.py --gcs-path "gs://your-bucket/stage_1/corpus_A/document_001"

Scenario C — Validate the config without making API calls

python auto_run_stage2.py --dry-run

Prints the directories that would be processed; no OpenAI calls.

Scenario D — Reduce parallelism

python auto_run_stage2.py --workers 10

Idempotent re-runs

Stage 2 automatically skips directories where it's already complete (compares file counts in stage_1/ vs stage_2/):

INFO - Stage 2 already complete for gs://.../stage_1/corpus_A/document_001, skipping.
INFO -   (Found 50 files in Stage 1 and 50 in Stage 2)

To force re-processing, delete the existing Stage 2 result first:

gsutil -m rm -r gs://your-bucket/stage_2/corpus_A/document_001
python auto_run_stage2.py --gcs-path gs://your-bucket/stage_1/corpus_A/document_001

`config.yaml` reference

A short description of each user-facing field. Full default values are in the shipped config.yaml.

OCR

Key	Default	Description
`ocr.file_extensions`	`[".pdf"]`	File extensions Stage 1 will look for in the input directory.
`ocr.max_display_files`	`20`	Maximum number of filenames listed in the log output.
`ocr.confidence_threshold`	`0.5`	Minimum confidence to retain a region after DocLayout-YOLO inference.
`ocr.use_cache`	`true`	Toggle the image-hash–keyed cache for repeated OCR calls.
`ocr.cache_dir`	`"cache"`	Directory used for the cache.
`ocr.language_hints`	`["ja","en","ko"]`	Language priority hint passed to Google Vision OCR.
`ocr.pdf_dpi`	`200`	DPI used when rasterizing PDF pages with `pdf2image`.
`ocr.iou_threshold`	`0.5`	IoU threshold for merging duplicate same-type regions.
`ocr.image_processing.vision_max_dim`	`1600`	Max image dimension before calling Google Vision.
`ocr.image_processing.gemini_max_dim`	`1024`	Max image dimension before calling Gemini.
`ocr.image_processing.jpeg_quality`	`85`	JPEG re-encoding quality before API calls.

Models

Key	Default	Description
`gemini.model`	`"gemini-2.0-flash"`	Gemini model used for figure/table analysis.
`gemini.figure_prompt_path`	`"prompts/gemini_figure.txt"`	Prompt file for figure regions.
`gemini.table_prompt_path`	`"prompts/gemini_table.txt"`	Prompt file for table regions.
`stage2.model`	`"gpt-5-nano"`	ChatGPT model used for Stage 2 correction.
`stage2.max_tokens`	`100000`	Max completion tokens per ChatGPT call.
`stage2.system_prompt_path`	`"prompts/chatgpt_stage2.md"`	System prompt for ChatGPT correction.

DocLayout-YOLO

Key	Default	Description
`doclayout_yolo.model_path`	`null`	Local path to a model file. `null` means load from Hugging Face.
`doclayout_yolo.device`	`"auto"`	`"auto"` / `"cuda:0"` / `"cpu"`.
`doclayout_yolo.huggingface_repo_id`	`"juliozhao/DocLayout-YOLO-DocStructBench"`	HF repo for the layout model.
`doclayout_yolo.huggingface_filename`	`"doclayout_yolo_docstructbench_imgsz1024.pt"`	Model weight filename.
`doclayout_yolo.fallback_model`	`"yolov8n.pt"`	Fallback to ultralytics YOLOv8 if DocLayout-YOLO fails to load.
`doclayout_yolo.default_imgsz`	`1024`	Input image size for layout inference.
`doclayout_yolo.default_conf`	`0.25`	Detection confidence threshold inside the YOLO model.

GCS

Key	Default	Description
`gcs.bucket_name`	`"eju-ocr-results"`	Target GCS bucket. Can be overridden via the `GCS_BUCKET_NAME` env var.
`gcs.stage1_prefix`	`"stage_1"`	GCS path prefix for Stage 1 output.
`gcs.stage2_prefix`	`"stage_2"`	GCS path prefix for Stage 2 output.
`gcs.validate_max_results`	`5`	Sample size when validating that a GCS path contains JSON files.

Docker

Key	Default	Description
`docker.image_name`	`"cantaloupe"`	Name of the Docker image built by `docker_build.py`.
`docker.gpu_enabled`	`false`	Whether to pass `--gpus all` to `docker run`. Requires NVIDIA Container Toolkit.
`docker.runtime`	`"nvidia"`	Runtime used when `gpu_enabled` is true.
`docker.source_files`	(map)	Files copied into the build context by `docker_build.py`.

Paths

Key	Default	Description
`env_file_path`	`"/home/jupyter/Program/.env"`	Location of the `.env` file containing API keys.
`directories.credentials`	`"/home/jupyter/Program/credentials"`	Directory containing the Google Cloud service-account JSON.
`directories.docker_build`	`"/home/jupyter/Program/OCR/src/ocr"`	Directory used as the Docker build context root.
`credentials.google_vision_account`	`"Google_Vision_S.Account.json"`	Filename of the service-account JSON inside `directories.credentials`.

Runtime settings (models, thresholds, prompts) are read fresh on each run — no rebuild needed. Anything that affects the Docker image (e.g. docker.source_files) does require re-running docker_build.py.

Editing the prompts

Three prompt files in prompts/ control what each LLM is told:

File	Sent to	Purpose
`prompts/gemini_figure.txt`	Gemini	Figure region → JSON with description, related topics, characteristics
`prompts/gemini_table.txt`	Gemini	Table region → JSON with markdown table, headers, row/col counts
`prompts/chatgpt_stage2.md`	ChatGPT	Region-array correction with strict structure preservation

Edit these as plain text. Changes take effect on the next run — no rebuild needed.

When editing `chatgpt_stage2.md` (Stage 2 system prompt)

This prompt is load-bearing for safety. The pipeline depends on ChatGPT returning:

The same number of regions as input
Same type, coords, and id values
Only modified text fields

If your edits relax this requirement, the structure validation in ocr_stage_2.py:validate_region_structure will reject the response and fall back to the original regions — silently losing your corrections.

Keep the "CRITICAL REQUIREMENTS" and "VALIDATION CHECKLIST" sections intact unless you're also editing the validator.

When editing Gemini prompts

The Gemini prompts ask for JSON output. The code uses a regex r'(\{.*\})' to extract the JSON portion from the response. If you change the prompt format such that JSON is no longer the dominant structure, parsing will fail and the result text will degrade.

Working with results

Output JSON structure

Each page JSON has this shape:

{
  "page": 1,
  "regions": [
    {
      "type": "text",
      "coords": {"x": 100, "y": 200, "width": 300, "height": 50},
      "text": "Actual recognized text.",
      "id": "page_1_region_0"
    },
    {
      "type": "title",
      "coords": {"x": 100, "y": 300, "width": 400, "height": 30},
      "text": "Chapter heading",
      "id": "page_1_region_1"
    },
    {
      "type": "formula",
      "coords": {...},
      "text": "LaTeX: \\sqrt{a^2 + b^2}\nText: square root of a squared plus b squared",
      "id": "page_1_region_2"
    },
    {
      "type": "figure",
      "coords": {...},
      "text": "## Image Description:\n...\n\n## Related Topics:\n...\n\n## characteristics:\n...",
      "id": "page_1_region_3"
    },
    {
      "type": "table",
      "coords": {...},
      "text": "## Table Description: ...\n\n## Table Content:\n| col1 | col2 |\n|------|------|\n| ... | ... |\n\n## Table Info: 5 rows × 2 columns",
      "id": "page_1_region_4"
    }
  ]
}

Region types: text, title, list, formula, figure, table. Regions are sorted by Y-coordinate (top to bottom).

Reading results in Python

from google.cloud import storage
import json

client = storage.Client()
bucket = client.bucket('your-bucket')
blob = bucket.blob('stage_2/corpus_A/document_001/page_001.json')
page = json.loads(blob.download_as_text())

# All text content, in reading order
combined = "\n\n".join(r['text'] for r in page['regions'])
print(combined)

# Only formulas
formulas = [r for r in page['regions'] if r['type'] == 'formula']

Bulk-download a directory

mkdir -p ./local_results
gsutil -m cp -r gs://your-bucket/stage_2/corpus_A/document_001 ./local_results/

Convert each page to plain text

for f in ./local_results/document_001/page_*.json; do
  out="${f%.json}.txt"
  jq -r '.regions[].text' "$f" > "$out"
done

Use as ML training data

The Stage 2 output is already structured (JSON, per-region with type tags). Common downstream uses:

Build a (image_path, text) dataset for OCR fine-tuning.
Extract only formula regions for a math-specific corpus.
Use figure / table regions' Gemini descriptions as captions for multimodal training.

Re-execution and idempotency

Stage 1

Not idempotent. Re-running Stage 1 on the same input re-processes everything and re-uploads to the same GCS paths (overwriting).

Stage 2

Idempotent by file count. Re-running Stage 2 against a Stage 1 path that's already been fully Stage-2-processed → skip.

If Stage 1 has more files than Stage 2 (e.g. Stage 2 was interrupted), Stage 2 will re-process the whole directory (not just the missing files). To force a clean re-run, delete the partial Stage 2 result:

gsutil -m rm -r gs://your-bucket/stage_2/corpus_A/document_001

FAQ — common scenarios

"I only want to OCR a single PDF, not a whole directory."

Move it to a fresh directory and use mode: direct. See Stage 1 Scenario C.

"I want to add a new language to OCR."

Edit ocr.language_hints in config.yaml. Google Vision accepts ISO 639-1 codes (en, ja, ko, zh, de, fr, ...). Order matters — first listed has priority.

"I want to use a different GCS bucket."

Either edit gcs.bucket_name in config.yaml, or set GCS_BUCKET_NAME=other-bucket in .env (env var overrides config).

"I want to use a different ChatGPT or Gemini model."

Edit stage2.model or gemini.model. Make sure the model name is one your API key can access.

"The output path structure doesn't match my downstream tool."

Edit gcs.stage1_prefix and gcs.stage2_prefix. The defaults stage_1 / stage_2 become whatever you set.

"Stage 1 ran but Stage 2 says no JSON files found."

Either the --gcs-path doesn't match what Stage 1 actually wrote, or Unicode normalization is involved (Korean / Japanese paths uploaded from macOS arrive as NFD). ocr_stage_2.py tries NFC then NFD; if neither matches, check gsutil ls output and copy-paste the exact path.

"How do I cancel a long-running job?"

Stage 1: Ctrl+C on the host — the running Docker container is killed. Partially uploaded page JSONs remain in GCS. Stage 2: Ctrl+C — worker threads are cancelled. Partially uploaded results remain in GCS; a re-run picks up where it left off.

"ChatGPT is hitting rate limits with 70 workers."

Reduce stage2.parallel_workers to 10–30, or override at the CLI: python auto_run_stage2.py --workers 15.

"Can I run Stage 2 without GCS?"

Not currently. Stage 2 is hardcoded to read from and write to GCS. Stage 1 must also upload to GCS for Stage 2 to find anything to correct.

"Can I run Stage 1 without Docker?"

The shipped wrapper goes through Docker. You can import advanced_ocr directly on a host with the right Python environment, but that bypasses the per-directory container isolation and reproducibility the project is designed around.

Where to ask for help

Issues: https://github.com/raphael-seo/Versatile-OCR-Program/issues
Email: raphael.es.seo@gmail.com

FilesExpand file tree

usage.md

Latest commit

History

usage.md

File metadata and controls

Usage Guide — Versatile-OCR-Program v3.0_initial

Quick start

Pipeline at a glance

Stage 1 — auto_run_stage1.py

Command

Arguments

auto_run.yaml — Stage 1 fields

Scenarios

What you'll see at runtime

When Stage 1 finishes

Stage 2 — auto_run_stage2.py

Command

Arguments

auto_run.yaml — Stage 2 fields

Scenarios

Idempotent re-runs

config.yaml reference

OCR

Models

DocLayout-YOLO

GCS

Docker

Paths

Editing the prompts

When editing chatgpt_stage2.md (Stage 2 system prompt)

When editing Gemini prompts

Working with results

Output JSON structure

Reading results in Python

Bulk-download a directory

Convert each page to plain text

Use as ML training data

Re-execution and idempotency

Stage 1

Stage 2

FAQ — common scenarios

"I only want to OCR a single PDF, not a whole directory."

"I want to add a new language to OCR."

"I want to use a different GCS bucket."

"I want to use a different ChatGPT or Gemini model."

"The output path structure doesn't match my downstream tool."

"Stage 1 ran but Stage 2 says no JSON files found."

"How do I cancel a long-running job?"

"ChatGPT is hitting rate limits with 70 workers."

"Can I run Stage 2 without GCS?"

"Can I run Stage 1 without Docker?"

Where to ask for help

Stage 1 — `auto_run_stage1.py`

`auto_run.yaml` — Stage 1 fields

Stage 2 — `auto_run_stage2.py`

`auto_run.yaml` — Stage 2 fields

`config.yaml` reference

When editing `chatgpt_stage2.md` (Stage 2 system prompt)