This document defines the standard HTTP API that OCR servers must implement to work with LiteParse.
LiteParse expects a simple HTTP endpoint that accepts an image and returns text with bounding boxes. Your OCR server can internally use any OCR engine (EasyOCR, PaddleOCR, Tesseract, Cloud APIs, etc.) as long as it conforms to this API.
POST /ocr
Content-Type: multipart/form-data
Fields:
| Field | Type | Required | Description |
|---|---|---|---|
file |
binary | Yes | Image file (PNG, JPG, etc.) |
language |
string | No | Language code (default: en) |
page_number |
integer | No | Page number metadata for OCR servers that preserve page context |
strict_bboxes |
boolean string | No | Optional server-specific hint to drop OCR regions without usable bounding boxes |
Use ISO 639-1 two-letter codes:
en- Englishzh- Chineseja- Japaneseko- Koreanfr- Frenchde- Germanes- Spanishar- Arabic- etc.
Your server should map these to whatever format your underlying OCR engine expects.
Content-Type: application/json
Structure:
{
"results": [
{
"text": "recognized text",
"bbox": [x1, y1, x2, y2],
"confidence": 0.95
}
]
}Fields:
| Field | Type | Description |
|---|---|---|
results |
array | Array of text detection results |
results[].text |
string | Recognized text content |
results[].bbox |
[number, number, number, number] | Bounding box [x1, y1, x2, y2] where (x1,y1) is top-left and (x2,y2) is bottom-right |
results[].confidence |
number | Confidence score between 0.0 and 1.0 |
Servers may include extra top-level metadata such as engine, model, or warnings; LiteParse clients must continue to rely on the baseline results[] contract.
curl -X POST http://localhost:8080/ocr \
-F "file=@document.png" \
-F "language=en"{
"results": [
{
"text": "Hello",
"bbox": [10, 20, 60, 40],
"confidence": 0.98
},
{
"text": "World",
"bbox": [70, 20, 130, 40],
"confidence": 0.97
}
]
}Return appropriate HTTP status codes:
200 OK- Success400 Bad Request- Invalid request (missing file, invalid language, etc.)500 Internal Server Error- OCR processing failed
Error response format:
{
"error": "Description of the error"
}- Origin (0,0) is at the top-left of the image
- X increases to the right
- Y increases downward
- All coordinates are in pixels
Always return axis-aligned bounding boxes as [x1, y1, x2, y2]:
x1, y1= top-left cornerx2, y2= bottom-right cornerx2 > x1andy2 > y1
If your OCR engine returns rotated boxes or polygon coordinates, convert them to axis-aligned boxes by taking min/max coordinates.
- Normalize to range 0.0 to 1.0
- 1.0 = 100% confident
- 0.0 = 0% confident
- If your OCR engine doesn't provide confidence, use
1.0
Results should be ordered by reading order (top-to-bottom, left-to-right for most languages).
See the /ocr directory for reference implementations:
ocr/easyocr/- Wrapper for EasyOCRocr/paddleocr/- Wrapper for PaddleOCR
The custom V2 Node package also includes a Codex SDK OCR server:
cd packages/node
node dist/cli.js codex-ocr-server \
--host 127.0.0.1 \
--port 8833 \
--codex-home "$HOME/.codex-test"It exposes:
GET /healthwith readiness, package version, backendsdk, model, reasoning effort, resolvedcodex_home, and boolean auth/config readability.POST /ocrwith the baseline LiteParseresults[]response plus warning metadata.POST /ocr/analyzewith the full Codex OCR artifact.
For this fork, ~/.codex-test/auth.json and ~/.codex-test/config.toml are the live-test auth/config files. Do not copy their contents into tracked files, package artifacts, or logs.
Codex bounding boxes are model-inferred visual localization evidence. They are not deterministic layout-detector boxes, and successful responses include codex_bboxes_are_model_inferred in warning context.
Quick test:
# 1. Start your server
python server.py
# 2. Test with curl
curl -X POST http://localhost:8080/ocr \
-F "file=@test.png" \
-F "language=en" \
| jq .
# 3. Expected output:
# {
# "results": [
# {
# "text": "...",
# "bbox": [x1, y1, x2, y2],
# "confidence": 0.xx
# }
# ]
# }Use with LiteParse:
lit parse document.pdf --ocr-server-url http://localhost:8080/ocrConvert to axis-aligned boxes:
def polygon_to_bbox(polygon):
"""Convert polygon [[x1,y1], [x2,y2], ...] to [x1, y1, x2, y2]"""
xs = [point[0] for point in polygon]
ys = [point[1] for point in polygon]
return [min(xs), min(ys), max(xs), max(ys)]Just return 1.0 for all results.
Yes, return {"results": []} if no text is detected.
You can, but LiteParse will also handle filtering based on its own thresholds.
At minimum: PNG, JPG. Optionally: TIFF, WebP, BMP, GIF.
Optional. If your OCR engine supports it, you can auto-correct rotation before processing.
LiteParse handles page splitting. Your server only needs to process single images.
- Keep server response time under 10 seconds per image
- Support concurrent requests
- Consider GPU acceleration for better performance
- Cache OCR models in memory (don't reload per request)
- Accepts
POST /ocrendpoint - Accepts
fileandlanguageform fields - Returns JSON with
resultsarray - Each result has
text,bbox, andconfidence - Bounding boxes in
[x1, y1, x2, y2]format - Confidence normalized to 0.0-1.0 range
- Returns 200 status on success
- Returns appropriate error codes and messages
- Handles common image formats (PNG, JPG)
- Processes images in under 10 seconds
Questions? Open an issue on GitHub or refer to the example implementations in /ocr.