Skip to content

Latest commit

 

History

History
481 lines (368 loc) · 18.2 KB

File metadata and controls

481 lines (368 loc) · 18.2 KB

Usage Guide — Versatile-OCR-Program v3.0_initial

How to run the pipeline once setup is complete. If you haven't completed setup yet, see setup_guide.md first.


Quick start

After setup, processing a corpus of PDFs takes 4 commands:

# 1. (one time) Place PDFs in src/input/
ls src/input/                                            # confirm your PDFs are there

# 2. Edit auto_run.yaml — point stage1.input_directory at src/input
nano auto_run.yaml

# 3. Run Stage 1 (Docker container → OCR → upload to GCS)
python auto_run_stage1.py --config auto_run.yaml

# 4. Run Stage 2 (read from GCS → ChatGPT correction → write back to GCS)
python auto_run_stage2.py --config auto_run.yaml

Results land in gs://<bucket>/stage_1/... (raw) and gs://<bucket>/stage_2/... (corrected).


Pipeline at a glance

PDFs (src/input/)
   │
   ▼
[Stage 1]  auto_run_stage1.py
              │  Recursively scans for PDF directories
              │
              ▼  (one Docker container per directory)
           ocr_stage_1.py  →  docker run cantaloupe  →  advanced_ocr.py (inside container)
                                                          │
                                                          ▼  Per page:
                                                          │   - DocLayout-YOLO region detection
                                                          │   - text/title/list → Google Vision OCR
                                                          │   - formula        → MathPix
                                                          │   - figure / table → Gemini
                                                          ▼
                                                  GCS: stage_1/{relpath}/{pdf}/page_NNN.json
   ────────────────────────────────────────────────────────┘
   │
   ▼
[Stage 2]  auto_run_stage2.py
              │  ThreadPoolExecutor — processes multiple GCS paths in parallel
              │
              ▼
           ocr_stage_2.py
              │  Reads each page JSON from GCS
              │  → ChatGPT with strict structure-preserving prompt
              │  → 3-tier safety net (parse / structure validation / exception → fallback to originals)
              │  → Re-assigns region IDs
              ▼
        GCS: stage_2/{relpath}/{pdf}/page_NNN.json

Stage 1 — auto_run_stage1.py

Wrapper around src/stages/ocr_stage_1.py that scans a directory tree for PDFs and launches a Docker container per directory.

Command

python auto_run_stage1.py [--config CONFIG_PATH]

Arguments

Flag Default Description
--config auto_run.yaml Path to the auto-run configuration file.

That's the only CLI flag — everything else is read from auto_run.yaml.

auto_run.yaml — Stage 1 fields

stage1:
  input_directory: "src/input"   # Where to look for PDFs (relative to cwd or absolute)
  mode: "recursive"              # "recursive" or "direct"
Field Values Description
stage1.input_directory path Directory to scan. Can be relative or absolute.
stage1.mode recursive / direct recursive: walks the entire tree, finds every directory that contains PDFs, and runs Stage 1 once per directory. direct: treats input_directory as the only directory to process.

Scenarios

Scenario A — Process a corpus organized into subdirectories

src/input/
├── corpus_A/document_001.pdf
├── corpus_A/document_002.pdf
└── corpus_B/document_001.pdf
stage1:
  input_directory: "src/input"
  mode: "recursive"
python auto_run_stage1.py

Spawns 2 Docker containers (one per subdirectory). GCS output: stage_1/corpus_A/document_001/page_NNN.json, etc.

Scenario B — Process only one directory

stage1:
  input_directory: "src/input/corpus_A"
  mode: "direct"

One container, only the corpus_A/ directory.

Scenario C — Re-run a single PDF There's no native single-file mode. Move the PDF to a fresh directory and use direct mode:

mkdir -p src/input/_rerun
cp src/input/corpus_A/document_001.pdf src/input/_rerun/
# auto_run.yaml: input_directory: "src/input/_rerun", mode: "direct"
python auto_run_stage1.py

GCS path will be stage_1/_rerun/document_001/page_NNN.json — to keep the original path, manually gsutil cp the result.

Scenario D — Recursive but skip a sub-tree The scanner automatically skips hidden directories (any starting with ., e.g. .ipynb_checkpoints). To exclude a regular directory, rename it with a leading dot:

mv src/input/old_data src/input/.old_data

What you'll see at runtime

See setup_guide.md §7.1 — the "First run — what you should see" section.

When Stage 1 finishes

Each PDF's pages end up as JSON files in GCS:

gs://<bucket>/stage_1/{relative_path}/{pdf_name}/page_001.json
gs://<bucket>/stage_1/{relative_path}/{pdf_name}/page_002.json
...

The {relative_path} mirrors your input_directory tree.


Stage 2 — auto_run_stage2.py

Wrapper around src/stages/ocr_stage_2.py that processes multiple GCS paths in parallel via ThreadPoolExecutor.

Command

python auto_run_stage2.py [--config CONFIG_PATH] [--gcs-path PATH] [--workers N] [--dry-run]

Arguments

Flag Default Description
--config auto_run.yaml Path to the auto-run configuration file.
--gcs-path (from config) Override the GCS paths from config — process only this single path.
--workers (from config) Override stage2.parallel_workers.
--dry-run off Don't actually call ChatGPT. Lists the directories that would be processed.

auto_run.yaml — Stage 2 fields

stage2:
  gcs_input_paths:
    - "gs://your-bucket/stage_1/corpus_A"
    - "gs://your-bucket/stage_1/corpus_B"
  parallel_workers: 70
Field Description
stage2.gcs_input_paths List of GCS prefixes to process. Each gets its own worker.
stage2.parallel_workers ThreadPoolExecutor size.

Scenarios

Scenario A — Process every Stage 1 output

stage2:
  gcs_input_paths:
    - "gs://your-bucket/stage_1/corpus_A"
    - "gs://your-bucket/stage_1/corpus_B"
  parallel_workers: 30
python auto_run_stage2.py

Scenario B — Re-run only one path (CLI override)

python auto_run_stage2.py --gcs-path "gs://your-bucket/stage_1/corpus_A/document_001"

Scenario C — Validate the config without making API calls

python auto_run_stage2.py --dry-run

Prints the directories that would be processed; no OpenAI calls.

Scenario D — Reduce parallelism

python auto_run_stage2.py --workers 10

Idempotent re-runs

Stage 2 automatically skips directories where it's already complete (compares file counts in stage_1/ vs stage_2/):

INFO - Stage 2 already complete for gs://.../stage_1/corpus_A/document_001, skipping.
INFO -   (Found 50 files in Stage 1 and 50 in Stage 2)

To force re-processing, delete the existing Stage 2 result first:

gsutil -m rm -r gs://your-bucket/stage_2/corpus_A/document_001
python auto_run_stage2.py --gcs-path gs://your-bucket/stage_1/corpus_A/document_001

config.yaml reference

A short description of each user-facing field. Full default values are in the shipped config.yaml.

OCR

Key Default Description
ocr.file_extensions [".pdf"] File extensions Stage 1 will look for in the input directory.
ocr.max_display_files 20 Maximum number of filenames listed in the log output.
ocr.confidence_threshold 0.5 Minimum confidence to retain a region after DocLayout-YOLO inference.
ocr.use_cache true Toggle the image-hash–keyed cache for repeated OCR calls.
ocr.cache_dir "cache" Directory used for the cache.
ocr.language_hints ["ja","en","ko"] Language priority hint passed to Google Vision OCR.
ocr.pdf_dpi 200 DPI used when rasterizing PDF pages with pdf2image.
ocr.iou_threshold 0.5 IoU threshold for merging duplicate same-type regions.
ocr.image_processing.vision_max_dim 1600 Max image dimension before calling Google Vision.
ocr.image_processing.gemini_max_dim 1024 Max image dimension before calling Gemini.
ocr.image_processing.jpeg_quality 85 JPEG re-encoding quality before API calls.

Models

Key Default Description
gemini.model "gemini-2.0-flash" Gemini model used for figure/table analysis.
gemini.figure_prompt_path "prompts/gemini_figure.txt" Prompt file for figure regions.
gemini.table_prompt_path "prompts/gemini_table.txt" Prompt file for table regions.
stage2.model "gpt-5-nano" ChatGPT model used for Stage 2 correction.
stage2.max_tokens 100000 Max completion tokens per ChatGPT call.
stage2.system_prompt_path "prompts/chatgpt_stage2.md" System prompt for ChatGPT correction.

DocLayout-YOLO

Key Default Description
doclayout_yolo.model_path null Local path to a model file. null means load from Hugging Face.
doclayout_yolo.device "auto" "auto" / "cuda:0" / "cpu".
doclayout_yolo.huggingface_repo_id "juliozhao/DocLayout-YOLO-DocStructBench" HF repo for the layout model.
doclayout_yolo.huggingface_filename "doclayout_yolo_docstructbench_imgsz1024.pt" Model weight filename.
doclayout_yolo.fallback_model "yolov8n.pt" Fallback to ultralytics YOLOv8 if DocLayout-YOLO fails to load.
doclayout_yolo.default_imgsz 1024 Input image size for layout inference.
doclayout_yolo.default_conf 0.25 Detection confidence threshold inside the YOLO model.

GCS

Key Default Description
gcs.bucket_name "eju-ocr-results" Target GCS bucket. Can be overridden via the GCS_BUCKET_NAME env var.
gcs.stage1_prefix "stage_1" GCS path prefix for Stage 1 output.
gcs.stage2_prefix "stage_2" GCS path prefix for Stage 2 output.
gcs.validate_max_results 5 Sample size when validating that a GCS path contains JSON files.

Docker

Key Default Description
docker.image_name "cantaloupe" Name of the Docker image built by docker_build.py.
docker.gpu_enabled false Whether to pass --gpus all to docker run. Requires NVIDIA Container Toolkit.
docker.runtime "nvidia" Runtime used when gpu_enabled is true.
docker.source_files (map) Files copied into the build context by docker_build.py.

Paths

Key Default Description
env_file_path "/home/jupyter/Program/.env" Location of the .env file containing API keys.
directories.credentials "/home/jupyter/Program/credentials" Directory containing the Google Cloud service-account JSON.
directories.docker_build "/home/jupyter/Program/OCR/src/ocr" Directory used as the Docker build context root.
credentials.google_vision_account "Google_Vision_S.Account.json" Filename of the service-account JSON inside directories.credentials.

Runtime settings (models, thresholds, prompts) are read fresh on each run — no rebuild needed. Anything that affects the Docker image (e.g. docker.source_files) does require re-running docker_build.py.


Editing the prompts

Three prompt files in prompts/ control what each LLM is told:

File Sent to Purpose
prompts/gemini_figure.txt Gemini Figure region → JSON with description, related topics, characteristics
prompts/gemini_table.txt Gemini Table region → JSON with markdown table, headers, row/col counts
prompts/chatgpt_stage2.md ChatGPT Region-array correction with strict structure preservation

Edit these as plain text. Changes take effect on the next run — no rebuild needed.

When editing chatgpt_stage2.md (Stage 2 system prompt)

This prompt is load-bearing for safety. The pipeline depends on ChatGPT returning:

  • The same number of regions as input
  • Same type, coords, and id values
  • Only modified text fields

If your edits relax this requirement, the structure validation in ocr_stage_2.py:validate_region_structure will reject the response and fall back to the original regions — silently losing your corrections.

Keep the "CRITICAL REQUIREMENTS" and "VALIDATION CHECKLIST" sections intact unless you're also editing the validator.

When editing Gemini prompts

The Gemini prompts ask for JSON output. The code uses a regex r'(\{.*\})' to extract the JSON portion from the response. If you change the prompt format such that JSON is no longer the dominant structure, parsing will fail and the result text will degrade.


Working with results

Output JSON structure

Each page JSON has this shape:

{
  "page": 1,
  "regions": [
    {
      "type": "text",
      "coords": {"x": 100, "y": 200, "width": 300, "height": 50},
      "text": "Actual recognized text.",
      "id": "page_1_region_0"
    },
    {
      "type": "title",
      "coords": {"x": 100, "y": 300, "width": 400, "height": 30},
      "text": "Chapter heading",
      "id": "page_1_region_1"
    },
    {
      "type": "formula",
      "coords": {...},
      "text": "LaTeX: \\sqrt{a^2 + b^2}\nText: square root of a squared plus b squared",
      "id": "page_1_region_2"
    },
    {
      "type": "figure",
      "coords": {...},
      "text": "## Image Description:\n...\n\n## Related Topics:\n...\n\n## characteristics:\n...",
      "id": "page_1_region_3"
    },
    {
      "type": "table",
      "coords": {...},
      "text": "## Table Description: ...\n\n## Table Content:\n| col1 | col2 |\n|------|------|\n| ... | ... |\n\n## Table Info: 5 rows × 2 columns",
      "id": "page_1_region_4"
    }
  ]
}

Region types: text, title, list, formula, figure, table. Regions are sorted by Y-coordinate (top to bottom).

Reading results in Python

from google.cloud import storage
import json

client = storage.Client()
bucket = client.bucket('your-bucket')
blob = bucket.blob('stage_2/corpus_A/document_001/page_001.json')
page = json.loads(blob.download_as_text())

# All text content, in reading order
combined = "\n\n".join(r['text'] for r in page['regions'])
print(combined)

# Only formulas
formulas = [r for r in page['regions'] if r['type'] == 'formula']

Bulk-download a directory

mkdir -p ./local_results
gsutil -m cp -r gs://your-bucket/stage_2/corpus_A/document_001 ./local_results/

Convert each page to plain text

for f in ./local_results/document_001/page_*.json; do
  out="${f%.json}.txt"
  jq -r '.regions[].text' "$f" > "$out"
done

Use as ML training data

The Stage 2 output is already structured (JSON, per-region with type tags). Common downstream uses:

  • Build a (image_path, text) dataset for OCR fine-tuning.
  • Extract only formula regions for a math-specific corpus.
  • Use figure / table regions' Gemini descriptions as captions for multimodal training.

Re-execution and idempotency

Stage 1

Not idempotent. Re-running Stage 1 on the same input re-processes everything and re-uploads to the same GCS paths (overwriting).

Stage 2

Idempotent by file count. Re-running Stage 2 against a Stage 1 path that's already been fully Stage-2-processed → skip.

If Stage 1 has more files than Stage 2 (e.g. Stage 2 was interrupted), Stage 2 will re-process the whole directory (not just the missing files). To force a clean re-run, delete the partial Stage 2 result:

gsutil -m rm -r gs://your-bucket/stage_2/corpus_A/document_001

FAQ — common scenarios

"I only want to OCR a single PDF, not a whole directory."

Move it to a fresh directory and use mode: direct. See Stage 1 Scenario C.

"I want to add a new language to OCR."

Edit ocr.language_hints in config.yaml. Google Vision accepts ISO 639-1 codes (en, ja, ko, zh, de, fr, ...). Order matters — first listed has priority.

"I want to use a different GCS bucket."

Either edit gcs.bucket_name in config.yaml, or set GCS_BUCKET_NAME=other-bucket in .env (env var overrides config).

"I want to use a different ChatGPT or Gemini model."

Edit stage2.model or gemini.model. Make sure the model name is one your API key can access.

"The output path structure doesn't match my downstream tool."

Edit gcs.stage1_prefix and gcs.stage2_prefix. The defaults stage_1 / stage_2 become whatever you set.

"Stage 1 ran but Stage 2 says no JSON files found."

Either the --gcs-path doesn't match what Stage 1 actually wrote, or Unicode normalization is involved (Korean / Japanese paths uploaded from macOS arrive as NFD). ocr_stage_2.py tries NFC then NFD; if neither matches, check gsutil ls output and copy-paste the exact path.

"How do I cancel a long-running job?"

Stage 1: Ctrl+C on the host — the running Docker container is killed. Partially uploaded page JSONs remain in GCS. Stage 2: Ctrl+C — worker threads are cancelled. Partially uploaded results remain in GCS; a re-run picks up where it left off.

"ChatGPT is hitting rate limits with 70 workers."

Reduce stage2.parallel_workers to 10–30, or override at the CLI: python auto_run_stage2.py --workers 15.

"Can I run Stage 2 without GCS?"

Not currently. Stage 2 is hardcoded to read from and write to GCS. Stage 1 must also upload to GCS for Stage 2 to find anything to correct.

"Can I run Stage 1 without Docker?"

The shipped wrapper goes through Docker. You can import advanced_ocr directly on a host with the right Python environment, but that bypasses the per-directory container isolation and reproducibility the project is designed around.


Where to ask for help