Skip to content

lwyBZss8924d/liteparse-ocr-vllm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

677 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

LiteParse V2 Custom Codex OCR

CI | Crates.io version | npm version | wasm version | PyPI version | License | Docs

English | 简体中文

out

Looking for LiteParse V1? Follow this link to the old code

This branch is a custom V2 fork of upstream LiteParse crates-v2.0.6. It keeps the Rust/napi core close to upstream and changes the Node package identity to @zzwz/liteparse-vllm@2.0.6-custom.0.

LiteParse is a standalone OSS PDF parsing tool focused exclusively on fast and light parsing. It provides high-quality spatial text parsing with bounding boxes, without proprietary LLM features or cloud dependencies. The baseline parser and built-in OCR run locally on your machine. This custom fork also adds an optional, authenticated Codex SDK OCR server for documents that need model-backed OCR diagnostics.

Hitting the limits of local parsing? For complex documents (dense tables, multi-column layouts, charts, handwritten text, or scanned PDFs), you'll get significantly better results with LlamaParse, our cloud-based document parser built for production document pipelines. LlamaParse handles the hard stuff so your models see clean, structured data and markdown.

Sign up for LlamaParse free

Overview

  • Fast Text Parsing: Spatial text parsing using PDFium
  • Flexible OCR System:
    • Built-in: Tesseract (zero setup, bundled with the library)
    • HTTP Servers: Plug in any OCR server (EasyOCR, PaddleOCR, custom)
    • Standard API: Simple, well-defined OCR API specification
  • Screenshot Generation: Generate high-quality page screenshots for LLM agents
  • Multiple Output Formats: JSON and Text
  • Bounding Boxes: Precise text positioning information
  • Multi-language: Use from Rust, Node.js/TypeScript, Python, or the browser (WASM)
  • Multi-platform: Linux, macOS (Intel/ARM), Windows
flowchart LR
      subgraph Input["Input Formats"]
          direction TB
          PDF["PDF"]
          DOCX["DOCX"]
          XLSX["XLSX"]
          PPTX["PPTX"]
          IMG["Images"]
      end

      subgraph Core["Rust Core"]
          direction TB
          CONV["Format Conversion\nLibreOffice / ImageMagick"]
          EXTRACT["Text Extraction\nPDFium C library"]
          OCR["Selective OCR\nTesseract / HTTP / Custom"]
          MERGE["OCR Merge\nNative text + OCR results"]
          PROJ["Grid Projection\nSpatial layout reconstruction"]
          CONV --> EXTRACT
          EXTRACT --> OCR --> MERGE --> PROJ
          EXTRACT --> MERGE
      end

      subgraph Output[" Output "]
          direction TB
          JSON["Structured JSON\ntext + bounding boxes"]
          TEXT["Plain Text\nlayout-preserved"]
          SCREEN["Screenshots\nPNG rendering"]
      end

      subgraph Bindings["Language Bindings"]
          direction TB
          NAPI["Node.js / TypeScript\nnapi-rs"]
          PYO3["Python\nPyO3"]
          WASM["Browser / WASM\nwasm-bindgen"]
          CLI["CLI\ncargo / npm / pip"]
          NAPI ~~~ PYO3 ~~~ WASM ~~~ CLI
      end

      PDF --> EXTRACT
      DOCX & XLSX & PPTX & IMG --> CONV
      PROJ --> JSON & TEXT & SCREEN
      JSON & TEXT & SCREEN --> Bindings

      style Input fill:#F5F5F5,color:#000000,stroke:#37D7FA,stroke-width:2px
      style Core fill:#F5F5F5,color:#000000,stroke:#3E18F9,stroke-width:2px
      style Output fill:#F5F5F5,color:#000000,stroke:#FF8705,stroke-width:2px
      style Bindings fill:#F5F5F5,color:#000000,stroke:#FF8DF2,stroke-width:2px

      style PDF fill:#96E7F9,color:#000000,stroke:#37D7FA,stroke-width:1px
      style DOCX fill:#96E7F9,color:#000000,stroke:#37D7FA,stroke-width:1px
      style XLSX fill:#96E7F9,color:#000000,stroke:#37D7FA,stroke-width:1px
      style PPTX fill:#96E7F9,color:#000000,stroke:#37D7FA,stroke-width:1px
      style IMG fill:#96E7F9,color:#000000,stroke:#37D7FA,stroke-width:1px

      style CONV fill:#92AEFF,color:#000000,stroke:#4B72FE,stroke-width:1px
      style EXTRACT fill:#92AEFF,color:#000000,stroke:#4B72FE,stroke-width:1px
      style OCR fill:#92AEFF,color:#000000,stroke:#4B72FE,stroke-width:1px
      style MERGE fill:#92AEFF,color:#000000,stroke:#4B72FE,stroke-width:1px
      style PROJ fill:#4B72FE,color:#FFFFFF,stroke:#3E18F9,stroke-width:2px

      style JSON fill:#FFBD74,color:#000000,stroke:#FF8705,stroke-width:1px
      style TEXT fill:#FFBD74,color:#000000,stroke:#FF8705,stroke-width:1px
      style SCREEN fill:#FFBD74,color:#000000,stroke:#FF8705,stroke-width:1px

      style NAPI fill:#FFBFF8,color:#000000,stroke:#FF8DF2,stroke-width:1px
      style PYO3 fill:#FFBFF8,color:#000000,stroke:#FF8DF2,stroke-width:1px
      style WASM fill:#FFBFF8,color:#000000,stroke:#FF8DF2,stroke-width:1px
      style CLI fill:#FFBFF8,color:#000000,stroke:#FF8DF2,stroke-width:1px
Loading

Installation

Install via your preferred package manager. All versions (except WASM) ship with the same lit CLI.

Language Install Library Docs
Node.js / TypeScript npm i @zzwz/liteparse-vllm Node.js README
Python pip install liteparse Python README
Rust cargo install liteparse (CLI) / cargo add liteparse (lib) Rust README (crates.io)
Browser (WASM) npm i @llamaindex/liteparse-wasm WASM README

Agent Skill

This fork maintains its V2 skill source in skills/liteparse-ocr-vllm/SKILL.md. It covers local LiteParse V2 parsing plus the custom Codex SDK OCR server pipe.

Validate the repo-maintained skill source with:

python3 skills/scripts/ensure_frontmatter.py
python3 skills/scripts/validate_liteparse_ocr_vllm_skills.py

The upstream generic LiteParse skill is useful context, but it targets @llamaindex/liteparse. Use this repo's liteparse-ocr-vllm skill for @zzwz/liteparse-vllm and Codex OCR server workflows.

CLI Usage

The CLI is the same across all installations (npm, pip, cargo install).

Parse Files

# Basic parsing
lit parse document.pdf

# Parse with specific format
lit parse document.pdf --format json -o output.json

# Parse specific pages
lit parse document.pdf --target-pages "1-5,10,15-20"

# Parse without OCR
lit parse document.pdf --no-ocr

# Parse a remote PDF
curl -sL https://example.com/report.pdf | lit parse -

Batch Parsing

Parse an entire directory of documents:

lit batch-parse ./input-directory ./output-directory

Generate Screenshots

Screenshots are essential for LLM agents to extract visual information that text alone cannot capture.

# Screenshot all pages
lit screenshot document.pdf -o ./screenshots

# Screenshot specific pages
lit screenshot document.pdf --target-pages "1,3,5" -o ./screenshots

# Custom DPI
lit screenshot document.pdf --dpi 300 -o ./screenshots

CLI Reference

Parse Command

lit parse [OPTIONS] <file>

Options:
  -o, --output <file>          Output file path
      --format <format>        Output format: json|text [default: text]
      --no-ocr                 Disable OCR
      --ocr-language <lang>    OCR language, Tesseract format [default: eng]
      --ocr-server-url <url>   HTTP OCR server URL (uses Tesseract if not provided)
      --tessdata-path <path>   Path to tessdata directory
      --max-pages <n>          Max pages to parse [default: 1000]
      --target-pages <pages>   Pages to parse (e.g., "1-5,10,15-20")
      --dpi <dpi>              Rendering DPI [default: 150]
      --preserve-small-text    Keep very small text
      --password <password>    Password for encrypted documents
      --num-workers <n>        Concurrent OCR workers [default: CPU cores - 1]
  -q, --quiet                  Suppress progress output
  -h, --help                   Print help

Batch Parse Command

lit batch-parse [OPTIONS] <input-dir> <output-dir>

Options:
      --format <format>        Output format: json|text [default: text]
      --no-ocr                 Disable OCR
      --ocr-language <lang>    OCR language [default: eng]
      --ocr-server-url <url>   HTTP OCR server URL
      --tessdata-path <path>   Path to tessdata directory
      --max-pages <n>          Max pages per file [default: 1000]
      --dpi <dpi>              Rendering DPI [default: 150]
      --recursive              Recursively search input directory
      --extension <ext>        Only process files with this extension (e.g., ".pdf")
      --password <password>    Password for encrypted documents
      --num-workers <n>        Concurrent OCR workers
  -q, --quiet                  Suppress progress output
  -h, --help                   Print help

Screenshot Command

lit screenshot [OPTIONS] <file>

Options:
  -o, --output-dir <dir>       Output directory [default: ./screenshots]
      --target-pages <pages>   Pages to screenshot (e.g., "1,3,5" or "1-5")
      --dpi <dpi>              Rendering DPI [default: 150]
      --password <password>    Password for encrypted documents
  -q, --quiet                  Suppress progress output
  -h, --help                   Print help

OCR Setup

Default: Tesseract

Tesseract is bundled and works out of the box:

lit parse document.pdf                    # OCR enabled by default
lit parse document.pdf --ocr-language fra # Specify language
lit parse document.pdf --no-ocr           # Disable OCR

For offline or air-gapped environments, set TESSDATA_PREFIX to a directory containing pre-downloaded .traineddata files:

export TESSDATA_PREFIX=/path/to/tessdata
lit parse document.pdf --ocr-language eng

Or pass the path directly:

lit parse document.pdf --tessdata-path /path/to/tessdata

Optional: HTTP OCR Servers

For higher accuracy or better performance, you can use an HTTP OCR server. We provide ready-to-use example wrappers for popular OCR engines:

You can integrate any OCR service by implementing the simple LiteParse OCR API specification (see OCR_API_SPEC.md).

The API requires:

  • POST /ocr endpoint
  • Accepts file and language parameters
  • Returns JSON: { results: [{ text, bbox: [x1,y1,x2,y2], confidence }] }

Custom Codex SDK OCR Server

The custom Node package includes lit codex-ocr and lit codex-ocr-server. This path is SDK-only and uses @openai/codex-sdk; it is online/authenticated and must be configured with a readable Codex home.

For live tests in this fork, use ~/.codex-test as the Codex home root:

cd packages/node
npm run build
node dist/cli.js codex-ocr-server \
  --host 127.0.0.1 \
  --port 8833 \
  --codex-home "$HOME/.codex-test"

The server exposes:

  • GET /health for readiness, model, reasoning effort, resolved codex_home, and auth/config readability.
  • POST /ocr for LiteParse-compatible OCR results.
  • POST /ocr/analyze for the richer Codex OCR artifact.

Use it through the standard LiteParse OCR server option:

node dist/cli.js parse ../../integration_tests_data/receipt.png \
  --ocr-server-url http://127.0.0.1:8833/ocr \
  --format json

Codex bounding boxes are model-inferred visual localization evidence, not deterministic layout-detector output. Successful Codex OCR warning context includes codex_bboxes_are_model_inferred.

Multi-Format Input Support

LiteParse supports automatic conversion of various document formats to PDF before parsing.

Supported Input Formats

Office Documents (via LibreOffice)

  • Word: .doc, .docx, .docm, .odt, .rtf, .pages
  • PowerPoint: .ppt, .pptx, .pptm, .odp, .key
  • Spreadsheets: .xls, .xlsx, .xlsm, .ods, .csv, .tsv, .numbers

Install LibreOffice for automatic conversion:

# macOS
brew install --cask libreoffice

# Ubuntu/Debian
apt-get install libreoffice

# Windows
choco install libreoffice-fresh

On Windows, you may need to add LibreOffice's program directory (usually C:\Program Files\LibreOffice\program) to your PATH.

Images (via ImageMagick)

  • Formats: .jpg, .jpeg, .png, .gif, .bmp, .tiff, .webp, .svg

Install ImageMagick for image-to-PDF conversion:

# macOS
brew install imagemagick

# Ubuntu/Debian
apt-get install imagemagick

# Windows
choco install imagemagick.app

Environment Variables

Variable Description
TESSDATA_PREFIX Path to a directory containing Tesseract .traineddata files. Used for offline/air-gapped environments.
LITEPARSE_CODEX_HOME Codex home directory for the custom Codex SDK OCR server. For local live tests in this fork, use $HOME/.codex-test.

Development

The project is a Rust workspace with the core library and language-specific binding crates.

crates/
├── liteparse/          # Core library + CLI binary
├── liteparse-napi/     # Node.js bindings (napi-rs)
├── liteparse-python/   # Python bindings (PyO3)
├── liteparse-wasm/     # WASM bindings (wasm-bindgen)
├── pdfium/             # PDFium Rust wrapper
└── pdfium-sys/         # PDFium FFI bindings
packages/
├── node/               # npm package (TS wrapper + native binary)
├── python/             # PyPI package (Python wrapper + native binary)
└── wasm/               # WASM npm package

Building

# Build the CLI
cargo build --release -p liteparse

# Build Node.js bindings
cd packages/node && npm run build

# Build Python bindings
cd packages/python && maturin develop --release

# Build WASM
cd packages/wasm && npm run build

We provide a fairly rich AGENTS.md/CLAUDE.md that we recommend using to help with development + coding agents.

License

Apache 2.0

Credits

Built on top of:

About

PDF and Multi-Format Document Conversion with Spatial Text Extraction for Agentic AI: A Liteparse Fork with Custom vLLM OCR Servers – Codex GPT OCR Pipeline

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors