Skip to content

ishahmshah1025/Adobe-Hackathon-Round-1A

Repository files navigation

Team Name: Fruits of Binary Tree

Challenge: Adobe Hackathon Round 1A

This project delivers automatic extraction of semantic outlines from PDF files—titles and document headings, with their hierarchy and page numbers. It is optimized for offline, CPU-only environments, and is fully containerized (Docker, AMD64) for seamless deployment and reproducibility in real-world or hackathon scenarios.

Key Features

  • Smart Section Heading Detection: Extracts headings using a hybrid of font size, position, layout clustering, and transformer-based semantic classification (DistilBERT-ONNX).
  • Hierarchical JSON Output: Provides strong structure: title, H1/H2/H3... headings with associated page numbers.
  • CPU-Optimized and Under 200MB: Quantized ONNX transformer model ensures rapid, memory-light inference.
  • Batch, Hands-free Processing: Scans all PDFs from /app/input/, writes [filename].json for each into /app/output/.
  • Runs Completely Offline: Zero internet required after build—perfect for secure/censored or constrained environments.
  • DevOps Ready: Packaged for Docker, supporting host folder mounts and AMD64 CPU image.

Quickstart: Build & Run

1. Place your input PDFs

  • Put your .pdf files in a host folder called input at the same directory as your Dockerfile.

2. Build the Docker Image

docker build --platform linux/amd64 -t pdf-outline-extractor:latest .

3. Run the Container

docker run --rm \
    -v $(pwd)/input:/app/input \
    -v $(pwd)/output:/app/output \
    --network none \
    pdf-outline-extractor:latest
  • All JSON output will be in ./output/, one file per input PDF.

Project Workflow Overview

image

Core Tech & Model Choices

  • PyMuPDF (fitz): Fast PDF layout/geometry parsing—including font/style/position info.
  • ONNX Runtime: Executes a quantized DistilBERT for heading/paragraph discrimination.
  • Python 3.10 (slim base): Fast, minimal footprint.
  • No external API/model downloads—everything included during build.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages