|
|
|
|
|
|
Docs
English | 简体中文
Looking for LiteParse V1? Follow this link to the old code
This branch is a custom V2 fork of upstream LiteParse crates-v2.0.6. It keeps the Rust/napi core close to upstream and changes the Node package identity to @zzwz/liteparse-vllm@2.0.6-custom.0.
LiteParse is a standalone OSS PDF parsing tool focused exclusively on fast and light parsing. It provides high-quality spatial text parsing with bounding boxes, without proprietary LLM features or cloud dependencies. The baseline parser and built-in OCR run locally on your machine. This custom fork also adds an optional, authenticated Codex SDK OCR server for documents that need model-backed OCR diagnostics.
Hitting the limits of local parsing? For complex documents (dense tables, multi-column layouts, charts, handwritten text, or scanned PDFs), you'll get significantly better results with LlamaParse, our cloud-based document parser built for production document pipelines. LlamaParse handles the hard stuff so your models see clean, structured data and markdown.
- Fast Text Parsing: Spatial text parsing using PDFium
- Flexible OCR System:
- Built-in: Tesseract (zero setup, bundled with the library)
- HTTP Servers: Plug in any OCR server (EasyOCR, PaddleOCR, custom)
- Standard API: Simple, well-defined OCR API specification
- Screenshot Generation: Generate high-quality page screenshots for LLM agents
- Multiple Output Formats: JSON and Text
- Bounding Boxes: Precise text positioning information
- Multi-language: Use from Rust, Node.js/TypeScript, Python, or the browser (WASM)
- Multi-platform: Linux, macOS (Intel/ARM), Windows
flowchart LR
subgraph Input["Input Formats"]
direction TB
PDF["PDF"]
DOCX["DOCX"]
XLSX["XLSX"]
PPTX["PPTX"]
IMG["Images"]
end
subgraph Core["Rust Core"]
direction TB
CONV["Format Conversion\nLibreOffice / ImageMagick"]
EXTRACT["Text Extraction\nPDFium C library"]
OCR["Selective OCR\nTesseract / HTTP / Custom"]
MERGE["OCR Merge\nNative text + OCR results"]
PROJ["Grid Projection\nSpatial layout reconstruction"]
CONV --> EXTRACT
EXTRACT --> OCR --> MERGE --> PROJ
EXTRACT --> MERGE
end
subgraph Output[" Output "]
direction TB
JSON["Structured JSON\ntext + bounding boxes"]
TEXT["Plain Text\nlayout-preserved"]
SCREEN["Screenshots\nPNG rendering"]
end
subgraph Bindings["Language Bindings"]
direction TB
NAPI["Node.js / TypeScript\nnapi-rs"]
PYO3["Python\nPyO3"]
WASM["Browser / WASM\nwasm-bindgen"]
CLI["CLI\ncargo / npm / pip"]
NAPI ~~~ PYO3 ~~~ WASM ~~~ CLI
end
PDF --> EXTRACT
DOCX & XLSX & PPTX & IMG --> CONV
PROJ --> JSON & TEXT & SCREEN
JSON & TEXT & SCREEN --> Bindings
style Input fill:#F5F5F5,color:#000000,stroke:#37D7FA,stroke-width:2px
style Core fill:#F5F5F5,color:#000000,stroke:#3E18F9,stroke-width:2px
style Output fill:#F5F5F5,color:#000000,stroke:#FF8705,stroke-width:2px
style Bindings fill:#F5F5F5,color:#000000,stroke:#FF8DF2,stroke-width:2px
style PDF fill:#96E7F9,color:#000000,stroke:#37D7FA,stroke-width:1px
style DOCX fill:#96E7F9,color:#000000,stroke:#37D7FA,stroke-width:1px
style XLSX fill:#96E7F9,color:#000000,stroke:#37D7FA,stroke-width:1px
style PPTX fill:#96E7F9,color:#000000,stroke:#37D7FA,stroke-width:1px
style IMG fill:#96E7F9,color:#000000,stroke:#37D7FA,stroke-width:1px
style CONV fill:#92AEFF,color:#000000,stroke:#4B72FE,stroke-width:1px
style EXTRACT fill:#92AEFF,color:#000000,stroke:#4B72FE,stroke-width:1px
style OCR fill:#92AEFF,color:#000000,stroke:#4B72FE,stroke-width:1px
style MERGE fill:#92AEFF,color:#000000,stroke:#4B72FE,stroke-width:1px
style PROJ fill:#4B72FE,color:#FFFFFF,stroke:#3E18F9,stroke-width:2px
style JSON fill:#FFBD74,color:#000000,stroke:#FF8705,stroke-width:1px
style TEXT fill:#FFBD74,color:#000000,stroke:#FF8705,stroke-width:1px
style SCREEN fill:#FFBD74,color:#000000,stroke:#FF8705,stroke-width:1px
style NAPI fill:#FFBFF8,color:#000000,stroke:#FF8DF2,stroke-width:1px
style PYO3 fill:#FFBFF8,color:#000000,stroke:#FF8DF2,stroke-width:1px
style WASM fill:#FFBFF8,color:#000000,stroke:#FF8DF2,stroke-width:1px
style CLI fill:#FFBFF8,color:#000000,stroke:#FF8DF2,stroke-width:1px
Install via your preferred package manager. All versions (except WASM) ship with the same lit CLI.
| Language | Install | Library Docs |
|---|---|---|
| Node.js / TypeScript | npm i @zzwz/liteparse-vllm |
Node.js README |
| Python | pip install liteparse |
Python README |
| Rust | cargo install liteparse (CLI) / cargo add liteparse (lib) |
Rust README (crates.io) |
| Browser (WASM) | npm i @llamaindex/liteparse-wasm |
WASM README |
This fork maintains its V2 skill source in
skills/liteparse-ocr-vllm/SKILL.md.
It covers local LiteParse V2 parsing plus the custom Codex SDK OCR server pipe.
Validate the repo-maintained skill source with:
python3 skills/scripts/ensure_frontmatter.py
python3 skills/scripts/validate_liteparse_ocr_vllm_skills.pyThe upstream generic LiteParse skill is useful context, but it targets
@llamaindex/liteparse. Use this repo's liteparse-ocr-vllm skill for
@zzwz/liteparse-vllm and Codex OCR server workflows.
The CLI is the same across all installations (npm, pip, cargo install).
# Basic parsing
lit parse document.pdf
# Parse with specific format
lit parse document.pdf --format json -o output.json
# Parse specific pages
lit parse document.pdf --target-pages "1-5,10,15-20"
# Parse without OCR
lit parse document.pdf --no-ocr
# Parse a remote PDF
curl -sL https://example.com/report.pdf | lit parse -Parse an entire directory of documents:
lit batch-parse ./input-directory ./output-directoryScreenshots are essential for LLM agents to extract visual information that text alone cannot capture.
# Screenshot all pages
lit screenshot document.pdf -o ./screenshots
# Screenshot specific pages
lit screenshot document.pdf --target-pages "1,3,5" -o ./screenshots
# Custom DPI
lit screenshot document.pdf --dpi 300 -o ./screenshotslit parse [OPTIONS] <file>
Options:
-o, --output <file> Output file path
--format <format> Output format: json|text [default: text]
--no-ocr Disable OCR
--ocr-language <lang> OCR language, Tesseract format [default: eng]
--ocr-server-url <url> HTTP OCR server URL (uses Tesseract if not provided)
--tessdata-path <path> Path to tessdata directory
--max-pages <n> Max pages to parse [default: 1000]
--target-pages <pages> Pages to parse (e.g., "1-5,10,15-20")
--dpi <dpi> Rendering DPI [default: 150]
--preserve-small-text Keep very small text
--password <password> Password for encrypted documents
--num-workers <n> Concurrent OCR workers [default: CPU cores - 1]
-q, --quiet Suppress progress output
-h, --help Print help
lit batch-parse [OPTIONS] <input-dir> <output-dir>
Options:
--format <format> Output format: json|text [default: text]
--no-ocr Disable OCR
--ocr-language <lang> OCR language [default: eng]
--ocr-server-url <url> HTTP OCR server URL
--tessdata-path <path> Path to tessdata directory
--max-pages <n> Max pages per file [default: 1000]
--dpi <dpi> Rendering DPI [default: 150]
--recursive Recursively search input directory
--extension <ext> Only process files with this extension (e.g., ".pdf")
--password <password> Password for encrypted documents
--num-workers <n> Concurrent OCR workers
-q, --quiet Suppress progress output
-h, --help Print help
lit screenshot [OPTIONS] <file>
Options:
-o, --output-dir <dir> Output directory [default: ./screenshots]
--target-pages <pages> Pages to screenshot (e.g., "1,3,5" or "1-5")
--dpi <dpi> Rendering DPI [default: 150]
--password <password> Password for encrypted documents
-q, --quiet Suppress progress output
-h, --help Print help
Tesseract is bundled and works out of the box:
lit parse document.pdf # OCR enabled by default
lit parse document.pdf --ocr-language fra # Specify language
lit parse document.pdf --no-ocr # Disable OCRFor offline or air-gapped environments, set TESSDATA_PREFIX to a directory containing pre-downloaded .traineddata files:
export TESSDATA_PREFIX=/path/to/tessdata
lit parse document.pdf --ocr-language engOr pass the path directly:
lit parse document.pdf --tessdata-path /path/to/tessdataFor higher accuracy or better performance, you can use an HTTP OCR server. We provide ready-to-use example wrappers for popular OCR engines:
You can integrate any OCR service by implementing the simple LiteParse OCR API specification (see OCR_API_SPEC.md).
The API requires:
- POST
/ocrendpoint - Accepts
fileandlanguageparameters - Returns JSON:
{ results: [{ text, bbox: [x1,y1,x2,y2], confidence }] }
The custom Node package includes lit codex-ocr and lit codex-ocr-server. This path is SDK-only and uses @openai/codex-sdk; it is online/authenticated and must be configured with a readable Codex home.
For live tests in this fork, use ~/.codex-test as the Codex home root:
cd packages/node
npm run build
node dist/cli.js codex-ocr-server \
--host 127.0.0.1 \
--port 8833 \
--codex-home "$HOME/.codex-test"The server exposes:
GET /healthfor readiness, model, reasoning effort, resolvedcodex_home, and auth/config readability.POST /ocrfor LiteParse-compatible OCR results.POST /ocr/analyzefor the richer Codex OCR artifact.
Use it through the standard LiteParse OCR server option:
node dist/cli.js parse ../../integration_tests_data/receipt.png \
--ocr-server-url http://127.0.0.1:8833/ocr \
--format jsonCodex bounding boxes are model-inferred visual localization evidence, not deterministic layout-detector output. Successful Codex OCR warning context includes codex_bboxes_are_model_inferred.
LiteParse supports automatic conversion of various document formats to PDF before parsing.
- Word:
.doc,.docx,.docm,.odt,.rtf,.pages - PowerPoint:
.ppt,.pptx,.pptm,.odp,.key - Spreadsheets:
.xls,.xlsx,.xlsm,.ods,.csv,.tsv,.numbers
Install LibreOffice for automatic conversion:
# macOS
brew install --cask libreoffice
# Ubuntu/Debian
apt-get install libreoffice
# Windows
choco install libreoffice-freshOn Windows, you may need to add LibreOffice's program directory (usually
C:\Program Files\LibreOffice\program) to your PATH.
- Formats:
.jpg,.jpeg,.png,.gif,.bmp,.tiff,.webp,.svg
Install ImageMagick for image-to-PDF conversion:
# macOS
brew install imagemagick
# Ubuntu/Debian
apt-get install imagemagick
# Windows
choco install imagemagick.app| Variable | Description |
|---|---|
TESSDATA_PREFIX |
Path to a directory containing Tesseract .traineddata files. Used for offline/air-gapped environments. |
LITEPARSE_CODEX_HOME |
Codex home directory for the custom Codex SDK OCR server. For local live tests in this fork, use $HOME/.codex-test. |
The project is a Rust workspace with the core library and language-specific binding crates.
crates/
├── liteparse/ # Core library + CLI binary
├── liteparse-napi/ # Node.js bindings (napi-rs)
├── liteparse-python/ # Python bindings (PyO3)
├── liteparse-wasm/ # WASM bindings (wasm-bindgen)
├── pdfium/ # PDFium Rust wrapper
└── pdfium-sys/ # PDFium FFI bindings
packages/
├── node/ # npm package (TS wrapper + native binary)
├── python/ # PyPI package (Python wrapper + native binary)
└── wasm/ # WASM npm package
# Build the CLI
cargo build --release -p liteparse
# Build Node.js bindings
cd packages/node && npm run build
# Build Python bindings
cd packages/python && maturin develop --release
# Build WASM
cd packages/wasm && npm run buildWe provide a fairly rich AGENTS.md/CLAUDE.md that we recommend using to help with development + coding agents.
Apache 2.0
Built on top of: