This project delivers automatic extraction of semantic outlines from PDF files—titles and document headings, with their hierarchy and page numbers. It is optimized for offline, CPU-only environments, and is fully containerized (Docker, AMD64) for seamless deployment and reproducibility in real-world or hackathon scenarios.
- Smart Section Heading Detection: Extracts headings using a hybrid of font size, position, layout clustering, and transformer-based semantic classification (DistilBERT-ONNX).
- Hierarchical JSON Output: Provides strong structure: title, H1/H2/H3... headings with associated page numbers.
- CPU-Optimized and Under 200MB: Quantized ONNX transformer model ensures rapid, memory-light inference.
- Batch, Hands-free Processing:
Scans all PDFs from
/app/input/, writes[filename].jsonfor each into/app/output/. - Runs Completely Offline: Zero internet required after build—perfect for secure/censored or constrained environments.
- DevOps Ready: Packaged for Docker, supporting host folder mounts and AMD64 CPU image.
- Put your
.pdffiles in a host folder calledinputat the same directory as your Dockerfile.
docker build --platform linux/amd64 -t pdf-outline-extractor:latest .docker run --rm \
-v $(pwd)/input:/app/input \
-v $(pwd)/output:/app/output \
--network none \
pdf-outline-extractor:latest- All JSON output will be in
./output/, one file per input PDF.
- PyMuPDF (fitz): Fast PDF layout/geometry parsing—including font/style/position info.
- ONNX Runtime: Executes a quantized DistilBERT for heading/paragraph discrimination.
- Python 3.10 (slim base): Fast, minimal footprint.
- No external API/model downloads—everything included during build.