Rust library and CLI for fast, lightweight PDF and document parsing with spatial text extraction. Runs entirely locally with zero cloud dependencies.
LiteParse is also available for Node.js/TypeScript, Python, and the browser (WASM). See the project README for all options.
Add to your Cargo.toml:
[dependencies]
liteparse = "2"Or install the CLI:
cargo install liteparseuse liteparse::{LiteParse, LiteParseConfig};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let parser = LiteParse::new(LiteParseConfig::default());
let result = parser.parse("document.pdf").await?;
println!("{}", result.text);
for page in &result.pages {
println!("Page {}: {} text items", page.page_num, page.text_items.len());
}
Ok(())
}use liteparse::{LiteParse, LiteParseConfig, OutputFormat};
let config = LiteParseConfig {
ocr_enabled: true, // Enable OCR (default: true)
ocr_language: "eng".to_string(), // Tesseract language code
ocr_server_url: None, // HTTP OCR server URL (optional)
tessdata_path: None, // Path to tessdata directory (optional)
max_pages: 1000, // Max pages to parse
target_pages: Some("1-5,10".into()), // Specific pages (optional)
dpi: 150.0, // Rendering DPI
output_format: OutputFormat::Json, // Output format
preserve_very_small_text: false, // Keep tiny text
password: None, // Password for protected documents
quiet: false, // Suppress progress output
..Default::default()
};
let parser = LiteParse::new(config);use liteparse::types::PdfInput;
let pdf_bytes: Vec<u8> = std::fs::read("document.pdf")?;
let result = parser.parse_input(PdfInput::Bytes(pdf_bytes)).await?;
println!("{}", result.text);Implement the OcrEngine trait to plug in your own OCR backend:
use liteparse::ocr::OcrEngine;
use std::sync::Arc;
let parser = LiteParse::new(LiteParseConfig::default())
.with_ocr_engine(Arc::new(my_engine));tesseract(default) — Built-in Tesseract OCR viatesseract-rs. Disable withdefault-features = falseif you don't need OCR or want to use an HTTP OCR server instead.
- PDF (
.pdf) - Microsoft Office (
.docx,.xlsx,.pptx, etc.) — requires LibreOffice - OpenDocument (
.odt,.ods,.odp) — requires LibreOffice - Images (
.png,.jpg,.tiff, etc.) — requires ImageMagick
The crate also builds the lit CLI binary:
lit parse document.pdf
lit parse document.pdf --format json -o output.json
lit screenshot document.pdf -o ./screenshots
lit batch-parse ./input ./outputSee lit --help for all options.
Apache-2.0