Skip to content

VLM4OCR v0.5.0

Latest

Choose a tag to compare

@daviden1013 daviden1013 released this 05 May 04:45

📐Documentation Site

User guide, API reference, and documentation are available at Documentation Page.

⚡Highlights

New bbox output mode

OCREngine now supports output_mode="bbox", which returns the OCR text together with bounding-box coordinates and labels for each detected region. This unlocks two complementary workflows:

  • Full-text bbox OCR — leave user_prompt empty (or pass None). The model transcribes the entire page and returns one box per region (line/word/segment, depending on the VLM family).
  • Targeted extraction — set user_prompt to a free-text instruction such as "Extract patient name and date of birth". Only regions matching that instruction are returned.
from vlm4ocr import VLLMVLMEngine, OCREngine

vlm_engine = VLLMVLMEngine(model="Qwen/Qwen3-VL-30B-A3B-Instruct")

# Full-text OCR with bounding boxes
ocr = OCREngine(vlm_engine=vlm_engine, output_mode="bbox")
ocr_results = ocr.sequential_ocr(image_path)

# Targeted extraction with bounding boxes
ocr = OCREngine(
    vlm_engine=vlm_engine,
    output_mode="bbox",
    user_prompt="Extract patient name and date of birth",
)
ocr_results = ocr.sequential_ocr(image_path)

Each page in the result exposes a .bboxes list of BBoxItem records (bbox=[x1, y1, x2, y2], label, text) and a .plot_bboxes() helper that returns a PIL.Image.Image with the boxes drawn on top — handy for quickly visualizing or saving annotated pages.

# Inspect bbox results
for page_num, page in enumerate(ocr_results[0].pages):
    for item in page.bboxes:
        print(item.label, item.bbox, item.text)

    # Visualize on the source page
    annotated = page.plot_bboxes()
    annotated.save(f"annotated_page{page_num}.png")

Note that this is a synthesized data with NO real PHI!!

[
	{"bbox_2d": [508, 143, 620, 158], "label": "patient name", "text": "Mary Johnson"},
	{"bbox_2d": [508, 160, 610, 176], "label": "date of birth", "text": "Jan 29, 1990"},
	{"bbox_2d": [484, 405, 569, 421], "label": "Platelets", "text": "280 x10⁹/L"},
	{"bbox_2d": [484, 455, 560, 470], "label": "CMP glucose", "text": "90 mg/dL"},
	{"bbox_2d": [484, 540, 562, 555], "label": "BUN", "text": "12 mg/dL"},
	{"bbox_2d": [484, 778, 690, 794], "label": "RBCs", "text": "0 - 1 per high power field"}
]

Per-VLM bbox formats

Different VLM families emit boxes with different conventions (key names, axis order, coordinate scale). vlm4ocr resolves the right format automatically from the model name through a registry, with built-in support for Qwen3-VL, Gemma 3/4, and GPT-4.1. You can override the resolution by passing a custom BBoxFormat:

from vlm4ocr import OCREngine, BBoxFormat

ocr = OCREngine(
    vlm_engine=vlm_engine,
    output_mode="bbox",
    bbox_format=BBoxFormat(
        coord_scale="normalized_1000",
        axis_order="x0y0x1y1",
        bbox_key="bbox_2d",
        label_key="label",
        text_key="text",
    ),
)

If the model name matches no registered pattern, a default BBoxFormat() is used and a warning is logged.

Web app integration

The Flask web app now exposes BBox as an output format in both the single-file and batch tabs.

  • The output area shows an Image | Raw response pill toggle. The Image tab renders bounding boxes on top of the input preview at OCR resolution; the Raw tab streams the model's raw JSON response live during OCR.
  • The user-prompt field becomes optional (full-text OCR if empty, targeted extraction if filled).
  • Batch mode writes one annotated PNG per page (<stem>_page_<idx>_bbox.png) plus one consolidated <stem>_bbox.json per input file. Both are included in the Download All zip.

OCRPage is now a dataclass

Each entry in OCRResult.pages is now an OCRPage dataclass instead of a plain dict. The new attribute access is the canonical pattern, and dict-style access is preserved for backward compatibility:

page = ocr_results[0].pages[0]

# New attribute-style access (recommended)
page.text
page.image_processing_status
page.bboxes              # populated only in bbox mode
page.image_width         # populated only in bbox mode
page.image_height        # populated only in bbox mode

# Dict-style access still works
page["text"]
page.get("bboxes")

OCRPage also exposes .get_bboxes() and .plot_bboxes() for the bbox workflow.

Changes

  • New: output_mode="bbox" on OCREngine for region-level OCR with bounding boxes.
  • New: BBoxItem, BBoxFormat, OCRPage exported from the top-level vlm4ocr namespace.
  • New: OCRPage.plot_bboxes() returns an annotated PIL.Image.Image for quick visualization.
  • New: Web app — BBox output format with image/raw tabbed view, client-side bbox rendering on the input preview, and batch-mode PNG + JSON outputs.
  • Compat: OCRResult.pages entries are now OCRPage dataclasses. Dict-style access (page["text"], page.get("bboxes")) continues to work; prefer attribute access (page.text, page.bboxes) for new code.