Skip to content

Latest commit

 

History

History
292 lines (220 loc) · 12.3 KB

File metadata and controls

292 lines (220 loc) · 12.3 KB

Docling Studio

License: MIT Python Node Docling CI GitHub Stars

A visual document analysis studio powered by Docling. Upload a PDF, configure the extraction pipeline, and visualize the results — text, tables, images, formulas, bounding boxes — all from your browser.

Docling Studio — Presentation

Star History

Star History Chart

Features

  • Home page with quick upload and recent documents
  • PDF viewer with page navigation, bounding box overlay, and resizable results panel
  • Configurable Docling pipeline — OCR, table extraction, code/formula enrichment, picture classification & description, image generation
  • Bounding box visualization — color-coded element overlay directly on the PDF
  • Per-page results — right panel syncs with the current PDF page
  • Chunking — split extracted content into semantic chunks (hierarchical, hybrid, or page-based) with configurable token limits and inline editing
  • Ingestion pipeline — Docling → chunking → embedding → OpenSearch vector indexing (one-click from Studio)
  • Markdown & HTML export of extracted content
  • Document management — upload, list, delete, search, filter by indexing status
  • Analysis history — re-visit and open past analyses
  • Upload limits — configurable max file size and max page count per document
  • Rate limiting — configurable requests per minute per IP
  • Dark / Light theme and FR / EN localization

Architecture

┌────────────┐         ┌──────────────────────┐
│  Frontend  │────────▶│   Document Parser    │
│  Vue 3     │  /api/* │ FastAPI + Docling    │
│  port 3000 │         │ SQLite + file storage│
└────────────┘         │   port 8000          │
                       └──────────────────────┘
Service Stack Role
frontend Vue 3, TypeScript, Vite, Pinia UI, PDF viewer, results display
document-parser FastAPI, Docling, SQLite, pdf2image REST API, document parsing, storage

Backend structure (clean architecture)

document-parser/
├── main.py                   # FastAPI app, CORS, lifespan
├── domain/                   # Pure domain — no HTTP, no DB
│   ├── models.py             # Document, AnalysisJob dataclasses
│   ├── ports.py              # Abstract protocols (converter, chunker)
│   └── value_objects.py      # ConversionResult, PageDetail, ChunkResult
├── api/                      # HTTP layer (FastAPI routers)
│   ├── schemas.py            # Pydantic DTOs (camelCase serialization)
│   ├── documents.py          # /api/documents endpoints
│   └── analyses.py           # /api/analyses endpoints
├── persistence/              # Data layer (SQLite via aiosqlite)
│   ├── database.py           # Connection management, schema init
│   ├── document_repo.py      # Document CRUD
│   └── analysis_repo.py      # AnalysisJob CRUD
├── services/                 # Use case orchestration
│   ├── document_service.py   # Upload, delete, preview
│   └── analysis_service.py   # Async Docling processing
└── tests/                    # 377 tests (pytest)

Frontend structure (feature-based)

frontend/src/
├── app/                      # App shell, router, global styles
├── pages/                    # Route-level pages
│   ├── HomePage.vue          # Landing page with upload & stats
│   ├── StudioPage.vue        # PDF viewer + config + results
│   ├── DocumentsPage.vue     # Document management
│   ├── HistoryPage.vue       # Past analyses
│   └── SettingsPage.vue      # Theme, language, API URL
├── features/                 # Feature modules
│   ├── analysis/             # Analysis store, API, bbox, UI components
│   ├── document/             # Document store, API, upload, list
│   ├── history/              # History store, API, navigation
│   └── settings/             # Settings store
└── shared/                   # Shared utilities (types, i18n, http, format)

Quick Start

One command, nothing else to install:

docker run -p 3000:3000 ghcr.io/scub-france/docling-studio:latest-local

Open http://localhost:3000, upload a PDF, and get results. That's it.

Note: The first analysis takes longer as Docling downloads its ML models (~400 MB). Subsequent runs are fast.

Image variants

Variant Image tag Size Description
local latest-local ~1.9 GB Full — runs Docling in-process, CPU-only
remote latest-remote ~270 MB Lightweight — delegates to an external Docling Serve instance

For remote mode:

docker run -p 3000:3000 \
  -e DOCLING_SERVE_URL=http://your-docling-serve:5001 \
  ghcr.io/scub-france/docling-studio:latest-remote

Docker Compose

git clone https://github.com/scub-france/Docling-Studio.git
cd Docling-Studio

# Simple mode (backend + frontend only)
docker compose up --build

# With ingestion pipeline (OpenSearch + embeddings)
docker compose --profile ingestion -f docker-compose.yml -f docker-compose.ingestion.yml up --build

Local Development

Backend (Python 3.12+):

cd document-parser
python -m venv .venv && source .venv/bin/activate

# Remote mode (lightweight)
pip install -r requirements.txt

# Local mode (with Docling)
pip install -r requirements-local.txt

uvicorn main:app --reload --port 8000

Frontend (Node 20+):

cd frontend
npm install
npm run dev

Running Tests

# Backend (377 tests)
cd document-parser
pip install pytest pytest-asyncio httpx
pytest tests/ -v

# Frontend (156 tests)
cd frontend
npm run test:run

Pipeline Options

These options map directly to Docling's PdfPipelineOptions. See the Docling documentation for details on each feature.

Option Default Description
do_ocr true OCR for scanned pages and embedded images
do_table_structure true Table detection and row/column reconstruction
table_mode accurate accurate (TableFormer) or fast
do_code_enrichment false Specialized OCR for code blocks
do_formula_enrichment false Math formula recognition (LaTeX output)
do_picture_classification false Classify images by type (chart, photo, diagram…)
do_picture_description false Generate image descriptions via VLM
generate_picture_images false Extract detected images as separate files
generate_page_images false Rasterize each page as an image
images_scale 1.0 Scale factor for generated images (0.1–10)

Configuration

All configuration is done via environment variables. See .env.example.

Variable Default Description
CONVERSION_ENGINE local local (in-process Docling) or remote (Docling Serve)
DOCLING_SERVE_URL http://localhost:5001 Docling Serve endpoint (remote mode only)
DOCLING_SERVE_API_KEY API key for Docling Serve (optional)
CORS_ORIGINS http://localhost:3000,... CORS allowed origins (comma-separated)
UPLOAD_DIR ./uploads File storage directory
DB_PATH ./data/docling_studio.db SQLite database path
CONVERSION_TIMEOUT 600 Max seconds for a single Docling conversion
BATCH_PAGE_SIZE 10 Pages per batch (0 = process all at once)
MAX_FILE_SIZE_MB 50 Maximum upload file size in MB (0 = unlimited)
MAX_PAGE_COUNT 0 Maximum number of pages per document (0 = unlimited)
RATE_LIMIT_RPM 100 Max requests per minute per IP (0 = disabled)

Upload Limits

Docling Studio enforces configurable limits on uploaded documents to protect the server against oversized files and long-running analyses:

  • MAX_FILE_SIZE_MB (default 50) — rejects uploads exceeding this size. Validated at two levels: early Content-Length check and streaming byte count.
  • MAX_PAGE_COUNT (default 0 = unlimited) — rejects documents with more pages than allowed. Useful on shared instances or Hugging Face Spaces to cap processing time.

Both limits are exposed in the /api/health endpoint so the frontend can display them to the user before upload. Set either to 0 to disable the corresponding check.

Ingestion Pipeline (opt-in)

Docling Studio can optionally index extracted chunks into OpenSearch for vector and full-text search. This requires two additional services (OpenSearch + embedding) and is disabled by default.

To enable ingestion with Docker Compose:

docker compose --profile ingestion \
  -f docker-compose.yml -f docker-compose.ingestion.yml \
  up --build

When ingestion is enabled, the UI shows:

  • An Ingest button in Studio to push chunks to OpenSearch
  • An OpenSearch connection status badge in the sidebar
  • Indexed / Not indexed filters on the Documents page
  • A Search page for full-text and vector search across indexed documents
Variable Default Description
OPENSEARCH_URL OpenSearch endpoint (empty = ingestion disabled)
EMBEDDING_URL Embedding service endpoint (empty = ingestion disabled)
EMBEDDING_DIMENSION 384 Vector dimension (must match embedding model)

CI / Release

GitHub Actions pipelines (see .github/workflows/):

Workflow Trigger What it does
CI push to main, pull requests Lint + type check + Backend tests + Frontend tests + build
Release push tag v* Build & push two multi-arch Docker images (remote + local) to ghcr.io
Docs push to main (docs changes) Build & deploy MkDocs to GitHub Pages

We follow Semantic Versioning with a simplified Git Flow. See CONTRIBUTING.md for the full release process.

Performance & System Requirements

Document type Pages Approx. time (CPU)
Simple report 5–10 ~30s–1 min
Research paper 10–30 ~1–2 min
Large document 100+ ~2–5 min

Docker Desktop settings

Remote image Local image
Image size ~270 MB ~1.9 GB
Memory 2 GB 6 GB (recommended 8 GB+)
CPUs 2 4 (recommended 8+)

Platform support

All Docker images are multi-arch (linux/amd64 + linux/arm64). No GPU required.

Tech Stack

  • Frontend: Vue 3, TypeScript, Vite, Pinia, DOMPurify
  • Backend: FastAPI, Docling 2.x, SQLite (aiosqlite), pdf2image
  • CI: GitHub Actions
  • Infra: Docker Compose + Nginx

Contributing

Contributions are welcome! Please open an issue first to discuss what you'd like to change.

License

MIT — Pier-Jean Malandrino