Skip to content

Commit 95e1ba1

Browse files
committed
new version
1 parent 163387f commit 95e1ba1

2 files changed

Lines changed: 47 additions & 4 deletions

File tree

README.md

Lines changed: 46 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ Vision Language Models (VLMs) for Optical Character Recognition (OCR).
88
| :---------------------- | :---------------------------------------------------------------------- |
99
| **File Types** | :white_check_mark: PDF, TIFF, PNG, JPG/JPEG, BMP, GIF, WEBP |
1010
| **VLM Engines** | :white_check_mark: Ollama, OpenAI Compatible (vLLM, SGLang, OpenRouter), OpenAI, Azure OpenAI |
11-
| **Output Modes** | :white_check_mark: Markdown, HTML, plain text |
11+
| **Output Modes** | :white_check_mark: Markdown, HTML, plain text, JSON, BBox |
1212
| **Batch OCR** | :white_check_mark: Processes many files concurrently with Python, CLI, and web app |
1313

1414
## 🆕Recent Updates
@@ -26,10 +26,10 @@ Vision Language Models (VLMs) for Optical Character Recognition (OCR).
2626
- **Few-shot examples**: Added support for few-shot examples to improve OCR accuracy.
2727
- [v0.4.3](https://github.com/daviden1013/vlm4ocr/releases/tag/v0.4.3):
2828
- **SGLang support**: Added `SGLangVLMEngine` for serving VLMs with SGLang.
29-
- **Optional prompts**: `OCREngine` now accepts `system_prompt=False` / `user_prompt=False` for models that don't need them (e.g., PaddleOCR, LightOn-OCR).
30-
- **Graceful shutdown**: `concurrent_ocr()` cancels in-flight VLM calls when the consumer stops iterating. CLI Ctrl+C and the web app Stop button now abort cleanly.
3129
- [v0.4.4](https://github.com/daviden1013/vlm4ocr/releases/tag/v0.4.4):
3230
- **VLM-based rotation correction**: `rotate_correction` now accepts `"tesseract"`, `"vlm"`, or `False`. Use `"vlm"` when Tesseract isn't installed or struggles with noisy scans.
31+
- [v0.5.0](https://github.com/daviden1013/vlm4ocr/releases/tag/v0.5.0) (May 4, 2026):
32+
- **BBox output mode**: New `output_mode="bbox"` returns OCR text with bounding-box coordinates and labels per region. Leave `user_prompt` empty for full-text bbox OCR or set it to a free-text instruction (e.g., `"patient name and DOB"`) for targeted extraction. Built-in format registry covers Qwen3-VL, Gemma 3/4, and GPT-4.1.
3333

3434
## Table of Contents
3535
- [Overview](#overview)
@@ -243,6 +243,49 @@ async def run_ocr():
243243
asyncio.run(run_ocr())
244244
```
245245

246+
Run OCR with bounding boxes (`output_mode="bbox"`). Leave `user_prompt` empty for full-text bbox OCR, or set it to a free-text instruction for targeted extraction:
247+
248+
```python
249+
from vlm4ocr import VLLMVLMEngine, OCREngine
250+
251+
vlm_engine = VLLMVLMEngine(model="Qwen/Qwen3.5-35B-A3B")
252+
253+
# Full-text OCR with bounding boxes
254+
ocr = OCREngine(vlm_engine=vlm_engine, output_mode="bbox")
255+
ocr_results = ocr.sequential_ocr(image_path)
256+
257+
# Targeted extraction with bounding boxes
258+
ocr = OCREngine(
259+
vlm_engine=vlm_engine,
260+
output_mode="bbox",
261+
user_prompt="Extract patient name, date of birth, Platelets, CMP glucose, BUN, and RBCs values.",
262+
)
263+
ocr_results = ocr.sequential_ocr(image_path)
264+
265+
# Inspect bbox results
266+
for page_num, page in enumerate(ocr_results[0].pages):
267+
for item in page.bboxes:
268+
print(item.label, item.bbox, item.text)
269+
270+
# Visualize on the source page (targeted extraction: color by label, show both)
271+
annotated = page.plot_bboxes(show_label=True, show_text=True, color="label")
272+
annotated.save(f"annotated_page{page_num}.png")
273+
```
274+
275+
Note that this is a **synthesized data with NO real PHI!!**
276+
```json
277+
[
278+
{"bbox_2d": [508, 143, 620, 158], "label": "patient name", "text": "Mary Johnson"},
279+
{"bbox_2d": [508, 160, 610, 176], "label": "date of birth", "text": "Jan 29, 1990"},
280+
{"bbox_2d": [484, 405, 569, 421], "label": "Platelets", "text": "280 x10⁹/L"},
281+
{"bbox_2d": [484, 455, 560, 470], "label": "CMP glucose", "text": "90 mg/dL"},
282+
{"bbox_2d": [484, 540, 562, 555], "label": "BUN", "text": "12 mg/dL"},
283+
{"bbox_2d": [484, 778, 690, 794], "label": "RBCs", "text": "0 - 1 per high power field"}
284+
]
285+
```
286+
287+
<div align="left"><img src=docs/readme_img/bbox.png width=500 ></div>
288+
246289
Supply few-shot examples to improve OCR accuracy:
247290
```python
248291
from PIL import Image

packages/vlm4ocr/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[tool.poetry]
22
name = "vlm4ocr"
3-
version = "0.4.4"
3+
version = "0.5.0"
44
description = "Python package and Web App for OCR with vision language models."
55
authors = ["Enshuo (David) Hsu"]
66
license = "MIT"

0 commit comments

Comments
 (0)