How to run the pipeline once setup is complete. If you haven't completed setup yet, see setup_guide.md first.
After setup, processing a corpus of PDFs takes 4 commands:
# 1. (one time) Place PDFs in src/input/
ls src/input/ # confirm your PDFs are there
# 2. Edit auto_run.yaml — point stage1.input_directory at src/input
nano auto_run.yaml
# 3. Run Stage 1 (Docker container → OCR → upload to GCS)
python auto_run_stage1.py --config auto_run.yaml
# 4. Run Stage 2 (read from GCS → ChatGPT correction → write back to GCS)
python auto_run_stage2.py --config auto_run.yamlResults land in gs://<bucket>/stage_1/... (raw) and gs://<bucket>/stage_2/... (corrected).
PDFs (src/input/)
│
▼
[Stage 1] auto_run_stage1.py
│ Recursively scans for PDF directories
│
▼ (one Docker container per directory)
ocr_stage_1.py → docker run cantaloupe → advanced_ocr.py (inside container)
│
▼ Per page:
│ - DocLayout-YOLO region detection
│ - text/title/list → Google Vision OCR
│ - formula → MathPix
│ - figure / table → Gemini
▼
GCS: stage_1/{relpath}/{pdf}/page_NNN.json
────────────────────────────────────────────────────────┘
│
▼
[Stage 2] auto_run_stage2.py
│ ThreadPoolExecutor — processes multiple GCS paths in parallel
│
▼
ocr_stage_2.py
│ Reads each page JSON from GCS
│ → ChatGPT with strict structure-preserving prompt
│ → 3-tier safety net (parse / structure validation / exception → fallback to originals)
│ → Re-assigns region IDs
▼
GCS: stage_2/{relpath}/{pdf}/page_NNN.json
Wrapper around src/stages/ocr_stage_1.py that scans a directory tree for PDFs and launches a Docker container per directory.
python auto_run_stage1.py [--config CONFIG_PATH]| Flag | Default | Description |
|---|---|---|
--config |
auto_run.yaml |
Path to the auto-run configuration file. |
That's the only CLI flag — everything else is read from auto_run.yaml.
stage1:
input_directory: "src/input" # Where to look for PDFs (relative to cwd or absolute)
mode: "recursive" # "recursive" or "direct"| Field | Values | Description |
|---|---|---|
stage1.input_directory |
path | Directory to scan. Can be relative or absolute. |
stage1.mode |
recursive / direct |
recursive: walks the entire tree, finds every directory that contains PDFs, and runs Stage 1 once per directory. direct: treats input_directory as the only directory to process. |
Scenario A — Process a corpus organized into subdirectories
src/input/
├── corpus_A/document_001.pdf
├── corpus_A/document_002.pdf
└── corpus_B/document_001.pdf
stage1:
input_directory: "src/input"
mode: "recursive"python auto_run_stage1.pySpawns 2 Docker containers (one per subdirectory). GCS output: stage_1/corpus_A/document_001/page_NNN.json, etc.
Scenario B — Process only one directory
stage1:
input_directory: "src/input/corpus_A"
mode: "direct"One container, only the corpus_A/ directory.
Scenario C — Re-run a single PDF
There's no native single-file mode. Move the PDF to a fresh directory and use direct mode:
mkdir -p src/input/_rerun
cp src/input/corpus_A/document_001.pdf src/input/_rerun/
# auto_run.yaml: input_directory: "src/input/_rerun", mode: "direct"
python auto_run_stage1.pyGCS path will be stage_1/_rerun/document_001/page_NNN.json — to keep the original path, manually gsutil cp the result.
Scenario D — Recursive but skip a sub-tree
The scanner automatically skips hidden directories (any starting with ., e.g. .ipynb_checkpoints). To exclude a regular directory, rename it with a leading dot:
mv src/input/old_data src/input/.old_dataSee setup_guide.md §7.1 — the "First run — what you should see" section.
Each PDF's pages end up as JSON files in GCS:
gs://<bucket>/stage_1/{relative_path}/{pdf_name}/page_001.json
gs://<bucket>/stage_1/{relative_path}/{pdf_name}/page_002.json
...
The {relative_path} mirrors your input_directory tree.
Wrapper around src/stages/ocr_stage_2.py that processes multiple GCS paths in parallel via ThreadPoolExecutor.
python auto_run_stage2.py [--config CONFIG_PATH] [--gcs-path PATH] [--workers N] [--dry-run]| Flag | Default | Description |
|---|---|---|
--config |
auto_run.yaml |
Path to the auto-run configuration file. |
--gcs-path |
(from config) | Override the GCS paths from config — process only this single path. |
--workers |
(from config) | Override stage2.parallel_workers. |
--dry-run |
off | Don't actually call ChatGPT. Lists the directories that would be processed. |
stage2:
gcs_input_paths:
- "gs://your-bucket/stage_1/corpus_A"
- "gs://your-bucket/stage_1/corpus_B"
parallel_workers: 70| Field | Description |
|---|---|
stage2.gcs_input_paths |
List of GCS prefixes to process. Each gets its own worker. |
stage2.parallel_workers |
ThreadPoolExecutor size. |
Scenario A — Process every Stage 1 output
stage2:
gcs_input_paths:
- "gs://your-bucket/stage_1/corpus_A"
- "gs://your-bucket/stage_1/corpus_B"
parallel_workers: 30python auto_run_stage2.pyScenario B — Re-run only one path (CLI override)
python auto_run_stage2.py --gcs-path "gs://your-bucket/stage_1/corpus_A/document_001"Scenario C — Validate the config without making API calls
python auto_run_stage2.py --dry-runPrints the directories that would be processed; no OpenAI calls.
Scenario D — Reduce parallelism
python auto_run_stage2.py --workers 10Stage 2 automatically skips directories where it's already complete (compares file counts in stage_1/ vs stage_2/):
INFO - Stage 2 already complete for gs://.../stage_1/corpus_A/document_001, skipping.
INFO - (Found 50 files in Stage 1 and 50 in Stage 2)
To force re-processing, delete the existing Stage 2 result first:
gsutil -m rm -r gs://your-bucket/stage_2/corpus_A/document_001
python auto_run_stage2.py --gcs-path gs://your-bucket/stage_1/corpus_A/document_001A short description of each user-facing field. Full default values are in the shipped config.yaml.
| Key | Default | Description |
|---|---|---|
ocr.file_extensions |
[".pdf"] |
File extensions Stage 1 will look for in the input directory. |
ocr.max_display_files |
20 |
Maximum number of filenames listed in the log output. |
ocr.confidence_threshold |
0.5 |
Minimum confidence to retain a region after DocLayout-YOLO inference. |
ocr.use_cache |
true |
Toggle the image-hash–keyed cache for repeated OCR calls. |
ocr.cache_dir |
"cache" |
Directory used for the cache. |
ocr.language_hints |
["ja","en","ko"] |
Language priority hint passed to Google Vision OCR. |
ocr.pdf_dpi |
200 |
DPI used when rasterizing PDF pages with pdf2image. |
ocr.iou_threshold |
0.5 |
IoU threshold for merging duplicate same-type regions. |
ocr.image_processing.vision_max_dim |
1600 |
Max image dimension before calling Google Vision. |
ocr.image_processing.gemini_max_dim |
1024 |
Max image dimension before calling Gemini. |
ocr.image_processing.jpeg_quality |
85 |
JPEG re-encoding quality before API calls. |
| Key | Default | Description |
|---|---|---|
gemini.model |
"gemini-2.0-flash" |
Gemini model used for figure/table analysis. |
gemini.figure_prompt_path |
"prompts/gemini_figure.txt" |
Prompt file for figure regions. |
gemini.table_prompt_path |
"prompts/gemini_table.txt" |
Prompt file for table regions. |
stage2.model |
"gpt-5-nano" |
ChatGPT model used for Stage 2 correction. |
stage2.max_tokens |
100000 |
Max completion tokens per ChatGPT call. |
stage2.system_prompt_path |
"prompts/chatgpt_stage2.md" |
System prompt for ChatGPT correction. |
| Key | Default | Description |
|---|---|---|
doclayout_yolo.model_path |
null |
Local path to a model file. null means load from Hugging Face. |
doclayout_yolo.device |
"auto" |
"auto" / "cuda:0" / "cpu". |
doclayout_yolo.huggingface_repo_id |
"juliozhao/DocLayout-YOLO-DocStructBench" |
HF repo for the layout model. |
doclayout_yolo.huggingface_filename |
"doclayout_yolo_docstructbench_imgsz1024.pt" |
Model weight filename. |
doclayout_yolo.fallback_model |
"yolov8n.pt" |
Fallback to ultralytics YOLOv8 if DocLayout-YOLO fails to load. |
doclayout_yolo.default_imgsz |
1024 |
Input image size for layout inference. |
doclayout_yolo.default_conf |
0.25 |
Detection confidence threshold inside the YOLO model. |
| Key | Default | Description |
|---|---|---|
gcs.bucket_name |
"eju-ocr-results" |
Target GCS bucket. Can be overridden via the GCS_BUCKET_NAME env var. |
gcs.stage1_prefix |
"stage_1" |
GCS path prefix for Stage 1 output. |
gcs.stage2_prefix |
"stage_2" |
GCS path prefix for Stage 2 output. |
gcs.validate_max_results |
5 |
Sample size when validating that a GCS path contains JSON files. |
| Key | Default | Description |
|---|---|---|
docker.image_name |
"cantaloupe" |
Name of the Docker image built by docker_build.py. |
docker.gpu_enabled |
false |
Whether to pass --gpus all to docker run. Requires NVIDIA Container Toolkit. |
docker.runtime |
"nvidia" |
Runtime used when gpu_enabled is true. |
docker.source_files |
(map) | Files copied into the build context by docker_build.py. |
| Key | Default | Description |
|---|---|---|
env_file_path |
"/home/jupyter/Program/.env" |
Location of the .env file containing API keys. |
directories.credentials |
"/home/jupyter/Program/credentials" |
Directory containing the Google Cloud service-account JSON. |
directories.docker_build |
"/home/jupyter/Program/OCR/src/ocr" |
Directory used as the Docker build context root. |
credentials.google_vision_account |
"Google_Vision_S.Account.json" |
Filename of the service-account JSON inside directories.credentials. |
Runtime settings (models, thresholds, prompts) are read fresh on each run — no rebuild needed. Anything that affects the Docker image (e.g.
docker.source_files) does require re-runningdocker_build.py.
Three prompt files in prompts/ control what each LLM is told:
| File | Sent to | Purpose |
|---|---|---|
prompts/gemini_figure.txt |
Gemini | Figure region → JSON with description, related topics, characteristics |
prompts/gemini_table.txt |
Gemini | Table region → JSON with markdown table, headers, row/col counts |
prompts/chatgpt_stage2.md |
ChatGPT | Region-array correction with strict structure preservation |
Edit these as plain text. Changes take effect on the next run — no rebuild needed.
This prompt is load-bearing for safety. The pipeline depends on ChatGPT returning:
- The same number of regions as input
- Same
type,coords, andidvalues - Only modified
textfields
If your edits relax this requirement, the structure validation in ocr_stage_2.py:validate_region_structure will reject the response and fall back to the original regions — silently losing your corrections.
Keep the "CRITICAL REQUIREMENTS" and "VALIDATION CHECKLIST" sections intact unless you're also editing the validator.
The Gemini prompts ask for JSON output. The code uses a regex r'(\{.*\})' to extract the JSON portion from the response. If you change the prompt format such that JSON is no longer the dominant structure, parsing will fail and the result text will degrade.
Each page JSON has this shape:
{
"page": 1,
"regions": [
{
"type": "text",
"coords": {"x": 100, "y": 200, "width": 300, "height": 50},
"text": "Actual recognized text.",
"id": "page_1_region_0"
},
{
"type": "title",
"coords": {"x": 100, "y": 300, "width": 400, "height": 30},
"text": "Chapter heading",
"id": "page_1_region_1"
},
{
"type": "formula",
"coords": {...},
"text": "LaTeX: \\sqrt{a^2 + b^2}\nText: square root of a squared plus b squared",
"id": "page_1_region_2"
},
{
"type": "figure",
"coords": {...},
"text": "## Image Description:\n...\n\n## Related Topics:\n...\n\n## characteristics:\n...",
"id": "page_1_region_3"
},
{
"type": "table",
"coords": {...},
"text": "## Table Description: ...\n\n## Table Content:\n| col1 | col2 |\n|------|------|\n| ... | ... |\n\n## Table Info: 5 rows × 2 columns",
"id": "page_1_region_4"
}
]
}Region types: text, title, list, formula, figure, table. Regions are sorted by Y-coordinate (top to bottom).
from google.cloud import storage
import json
client = storage.Client()
bucket = client.bucket('your-bucket')
blob = bucket.blob('stage_2/corpus_A/document_001/page_001.json')
page = json.loads(blob.download_as_text())
# All text content, in reading order
combined = "\n\n".join(r['text'] for r in page['regions'])
print(combined)
# Only formulas
formulas = [r for r in page['regions'] if r['type'] == 'formula']mkdir -p ./local_results
gsutil -m cp -r gs://your-bucket/stage_2/corpus_A/document_001 ./local_results/for f in ./local_results/document_001/page_*.json; do
out="${f%.json}.txt"
jq -r '.regions[].text' "$f" > "$out"
doneThe Stage 2 output is already structured (JSON, per-region with type tags). Common downstream uses:
- Build a
(image_path, text)dataset for OCR fine-tuning. - Extract only
formularegions for a math-specific corpus. - Use
figure/tableregions' Gemini descriptions as captions for multimodal training.
Not idempotent. Re-running Stage 1 on the same input re-processes everything and re-uploads to the same GCS paths (overwriting).
Idempotent by file count. Re-running Stage 2 against a Stage 1 path that's already been fully Stage-2-processed → skip.
If Stage 1 has more files than Stage 2 (e.g. Stage 2 was interrupted), Stage 2 will re-process the whole directory (not just the missing files). To force a clean re-run, delete the partial Stage 2 result:
gsutil -m rm -r gs://your-bucket/stage_2/corpus_A/document_001Move it to a fresh directory and use mode: direct. See Stage 1 Scenario C.
Edit ocr.language_hints in config.yaml. Google Vision accepts ISO 639-1 codes (en, ja, ko, zh, de, fr, ...). Order matters — first listed has priority.
Either edit gcs.bucket_name in config.yaml, or set GCS_BUCKET_NAME=other-bucket in .env (env var overrides config).
Edit stage2.model or gemini.model. Make sure the model name is one your API key can access.
Edit gcs.stage1_prefix and gcs.stage2_prefix. The defaults stage_1 / stage_2 become whatever you set.
Either the --gcs-path doesn't match what Stage 1 actually wrote, or Unicode normalization is involved (Korean / Japanese paths uploaded from macOS arrive as NFD). ocr_stage_2.py tries NFC then NFD; if neither matches, check gsutil ls output and copy-paste the exact path.
Stage 1: Ctrl+C on the host — the running Docker container is killed. Partially uploaded page JSONs remain in GCS.
Stage 2: Ctrl+C — worker threads are cancelled. Partially uploaded results remain in GCS; a re-run picks up where it left off.
Reduce stage2.parallel_workers to 10–30, or override at the CLI: python auto_run_stage2.py --workers 15.
Not currently. Stage 2 is hardcoded to read from and write to GCS. Stage 1 must also upload to GCS for Stage 2 to find anything to correct.
The shipped wrapper goes through Docker. You can import advanced_ocr directly on a host with the right Python environment, but that bypasses the per-directory container isolation and reproducibility the project is designed around.