Challenge: Adobe Hackathon Round 1A

Team Name: Fruits of Binary Tree

Challenge: Adobe Hackathon Round 1A

This project delivers automatic extraction of semantic outlines from PDF files—titles and document headings, with their hierarchy and page numbers. It is optimized for offline, CPU-only environments, and is fully containerized (Docker, AMD64) for seamless deployment and reproducibility in real-world or hackathon scenarios.

Key Features

Smart Section Heading Detection: Extracts headings using a hybrid of font size, position, layout clustering, and transformer-based semantic classification (DistilBERT-ONNX).
Hierarchical JSON Output: Provides strong structure: title, H1/H2/H3... headings with associated page numbers.
CPU-Optimized and Under 200MB: Quantized ONNX transformer model ensures rapid, memory-light inference.
Batch, Hands-free Processing: Scans all PDFs from /app/input/, writes [filename].json for each into /app/output/.
Runs Completely Offline: Zero internet required after build—perfect for secure/censored or constrained environments.
DevOps Ready: Packaged for Docker, supporting host folder mounts and AMD64 CPU image.

Quickstart: Build & Run

1. Place your input PDFs

Put your .pdf files in a host folder called input at the same directory as your Dockerfile.

2. Build the Docker Image

docker build --platform linux/amd64 -t pdf-outline-extractor:latest .

3. Run the Container

docker run --rm \
    -v $(pwd)/input:/app/input \
    -v $(pwd)/output:/app/output \
    --network none \
    pdf-outline-extractor:latest

All JSON output will be in ./output/, one file per input PDF.

Project Workflow Overview

Core Tech & Model Choices

PyMuPDF (fitz): Fast PDF layout/geometry parsing—including font/style/position info.
ONNX Runtime: Executes a quantized DistilBERT for heading/paragraph discrimination.
Python 3.10 (slim base): Fast, minimal footprint.
No external API/model downloads—everything included during build.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
input		input
quant_model		quant_model
.gitattributes		.gitattributes
DockerFile		DockerFile
README.md		README.md
app.py		app.py
classifier.py		classifier.py
pdf_outline.py		pdf_outline.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Team Name: Fruits of Binary Tree

Challenge: Adobe Hackathon Round 1A

Key Features

Quickstart: Build & Run

1. Place your input PDFs

2. Build the Docker Image

3. Run the Container

Project Workflow Overview

Core Tech & Model Choices

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

ishahmshah1025/Adobe-Hackathon-Round-1A

Folders and files

Latest commit

History

Repository files navigation

Team Name: Fruits of Binary Tree

Challenge: Adobe Hackathon Round 1A

Key Features

Quickstart: Build & Run

1. Place your input PDFs

2. Build the Docker Image

3. Run the Container

Project Workflow Overview

Core Tech & Model Choices

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages