📐Documentation Site
User guide, API reference, and documentation are available at Documentation Page.
⚡Highlights
New bbox output mode
OCREngine now supports output_mode="bbox", which returns the OCR text together with bounding-box coordinates and labels for each detected region. This unlocks two complementary workflows:
- Full-text bbox OCR — leave
user_promptempty (or passNone). The model transcribes the entire page and returns one box per region (line/word/segment, depending on the VLM family). - Targeted extraction — set
user_promptto a free-text instruction such as"Extract patient name and date of birth". Only regions matching that instruction are returned.
from vlm4ocr import VLLMVLMEngine, OCREngine
vlm_engine = VLLMVLMEngine(model="Qwen/Qwen3-VL-30B-A3B-Instruct")
# Full-text OCR with bounding boxes
ocr = OCREngine(vlm_engine=vlm_engine, output_mode="bbox")
ocr_results = ocr.sequential_ocr(image_path)
# Targeted extraction with bounding boxes
ocr = OCREngine(
vlm_engine=vlm_engine,
output_mode="bbox",
user_prompt="Extract patient name and date of birth",
)
ocr_results = ocr.sequential_ocr(image_path)Each page in the result exposes a .bboxes list of BBoxItem records (bbox=[x1, y1, x2, y2], label, text) and a .plot_bboxes() helper that returns a PIL.Image.Image with the boxes drawn on top — handy for quickly visualizing or saving annotated pages.
# Inspect bbox results
for page_num, page in enumerate(ocr_results[0].pages):
for item in page.bboxes:
print(item.label, item.bbox, item.text)
# Visualize on the source page
annotated = page.plot_bboxes()
annotated.save(f"annotated_page{page_num}.png")Note that this is a synthesized data with NO real PHI!!
[
{"bbox_2d": [508, 143, 620, 158], "label": "patient name", "text": "Mary Johnson"},
{"bbox_2d": [508, 160, 610, 176], "label": "date of birth", "text": "Jan 29, 1990"},
{"bbox_2d": [484, 405, 569, 421], "label": "Platelets", "text": "280 x10⁹/L"},
{"bbox_2d": [484, 455, 560, 470], "label": "CMP glucose", "text": "90 mg/dL"},
{"bbox_2d": [484, 540, 562, 555], "label": "BUN", "text": "12 mg/dL"},
{"bbox_2d": [484, 778, 690, 794], "label": "RBCs", "text": "0 - 1 per high power field"}
]Per-VLM bbox formats
Different VLM families emit boxes with different conventions (key names, axis order, coordinate scale). vlm4ocr resolves the right format automatically from the model name through a registry, with built-in support for Qwen3-VL, Gemma 3/4, and GPT-4.1. You can override the resolution by passing a custom BBoxFormat:
from vlm4ocr import OCREngine, BBoxFormat
ocr = OCREngine(
vlm_engine=vlm_engine,
output_mode="bbox",
bbox_format=BBoxFormat(
coord_scale="normalized_1000",
axis_order="x0y0x1y1",
bbox_key="bbox_2d",
label_key="label",
text_key="text",
),
)If the model name matches no registered pattern, a default BBoxFormat() is used and a warning is logged.
Web app integration
The Flask web app now exposes BBox as an output format in both the single-file and batch tabs.
- The output area shows an Image | Raw response pill toggle. The Image tab renders bounding boxes on top of the input preview at OCR resolution; the Raw tab streams the model's raw JSON response live during OCR.
- The user-prompt field becomes optional (full-text OCR if empty, targeted extraction if filled).
- Batch mode writes one annotated PNG per page (
<stem>_page_<idx>_bbox.png) plus one consolidated<stem>_bbox.jsonper input file. Both are included in the Download All zip.
OCRPage is now a dataclass
Each entry in OCRResult.pages is now an OCRPage dataclass instead of a plain dict. The new attribute access is the canonical pattern, and dict-style access is preserved for backward compatibility:
page = ocr_results[0].pages[0]
# New attribute-style access (recommended)
page.text
page.image_processing_status
page.bboxes # populated only in bbox mode
page.image_width # populated only in bbox mode
page.image_height # populated only in bbox mode
# Dict-style access still works
page["text"]
page.get("bboxes")OCRPage also exposes .get_bboxes() and .plot_bboxes() for the bbox workflow.
Changes
- New:
output_mode="bbox"onOCREnginefor region-level OCR with bounding boxes. - New:
BBoxItem,BBoxFormat,OCRPageexported from the top-levelvlm4ocrnamespace. - New:
OCRPage.plot_bboxes()returns an annotatedPIL.Image.Imagefor quick visualization. - New: Web app — BBox output format with image/raw tabbed view, client-side bbox rendering on the input preview, and batch-mode PNG + JSON outputs.
- Compat:
OCRResult.pagesentries are nowOCRPagedataclasses. Dict-style access (page["text"],page.get("bboxes")) continues to work; prefer attribute access (page.text,page.bboxes) for new code.

