A visual document analysis studio powered by Docling. Upload a PDF, configure the extraction pipeline, and visualize the results — text, tables, images, formulas, bounding boxes — all from your browser.
- Home page with quick upload and recent documents
- PDF viewer with page navigation, bounding box overlay, and resizable results panel
- Configurable Docling pipeline — OCR, table extraction, code/formula enrichment, picture classification & description, image generation
- Bounding box visualization — color-coded element overlay directly on the PDF
- Per-page results — right panel syncs with the current PDF page
- Chunking — split extracted content into semantic chunks (hierarchical, hybrid, or page-based) with configurable token limits and inline editing
- Ingestion pipeline — Docling → chunking → embedding → OpenSearch vector indexing (one-click from Studio)
- Markdown & HTML export of extracted content
- Document management — upload, list, delete, search, filter by indexing status
- Analysis history — re-visit and open past analyses
- Upload limits — configurable max file size and max page count per document
- Rate limiting — configurable requests per minute per IP
- Dark / Light theme and FR / EN localization
┌────────────┐ ┌──────────────────────┐
│ Frontend │────────▶│ Document Parser │
│ Vue 3 │ /api/* │ FastAPI + Docling │
│ port 3000 │ │ SQLite + file storage│
└────────────┘ │ port 8000 │
└──────────────────────┘
| Service | Stack | Role |
|---|---|---|
| frontend | Vue 3, TypeScript, Vite, Pinia | UI, PDF viewer, results display |
| document-parser | FastAPI, Docling, SQLite, pdf2image | REST API, document parsing, storage |
document-parser/
├── main.py # FastAPI app, CORS, lifespan
├── domain/ # Pure domain — no HTTP, no DB
│ ├── models.py # Document, AnalysisJob dataclasses
│ ├── ports.py # Abstract protocols (converter, chunker)
│ └── value_objects.py # ConversionResult, PageDetail, ChunkResult
├── api/ # HTTP layer (FastAPI routers)
│ ├── schemas.py # Pydantic DTOs (camelCase serialization)
│ ├── documents.py # /api/documents endpoints
│ └── analyses.py # /api/analyses endpoints
├── persistence/ # Data layer (SQLite via aiosqlite)
│ ├── database.py # Connection management, schema init
│ ├── document_repo.py # Document CRUD
│ └── analysis_repo.py # AnalysisJob CRUD
├── services/ # Use case orchestration
│ ├── document_service.py # Upload, delete, preview
│ └── analysis_service.py # Async Docling processing
└── tests/ # 377 tests (pytest)
frontend/src/
├── app/ # App shell, router, global styles
├── pages/ # Route-level pages
│ ├── HomePage.vue # Landing page with upload & stats
│ ├── StudioPage.vue # PDF viewer + config + results
│ ├── DocumentsPage.vue # Document management
│ ├── HistoryPage.vue # Past analyses
│ └── SettingsPage.vue # Theme, language, API URL
├── features/ # Feature modules
│ ├── analysis/ # Analysis store, API, bbox, UI components
│ ├── document/ # Document store, API, upload, list
│ ├── history/ # History store, API, navigation
│ └── settings/ # Settings store
└── shared/ # Shared utilities (types, i18n, http, format)
One command, nothing else to install:
docker run -p 3000:3000 ghcr.io/scub-france/docling-studio:latest-localOpen http://localhost:3000, upload a PDF, and get results. That's it.
Note: The first analysis takes longer as Docling downloads its ML models (~400 MB). Subsequent runs are fast.
| Variant | Image tag | Size | Description |
|---|---|---|---|
| local | latest-local |
~1.9 GB | Full — runs Docling in-process, CPU-only |
| remote | latest-remote |
~270 MB | Lightweight — delegates to an external Docling Serve instance |
For remote mode:
docker run -p 3000:3000 \
-e DOCLING_SERVE_URL=http://your-docling-serve:5001 \
ghcr.io/scub-france/docling-studio:latest-remotegit clone https://github.com/scub-france/Docling-Studio.git
cd Docling-Studio
# Simple mode (backend + frontend only)
docker compose up --build
# With ingestion pipeline (OpenSearch + embeddings)
docker compose --profile ingestion -f docker-compose.yml -f docker-compose.ingestion.yml up --buildBackend (Python 3.12+):
cd document-parser
python -m venv .venv && source .venv/bin/activate
# Remote mode (lightweight)
pip install -r requirements.txt
# Local mode (with Docling)
pip install -r requirements-local.txt
uvicorn main:app --reload --port 8000Frontend (Node 20+):
cd frontend
npm install
npm run dev# Backend (377 tests)
cd document-parser
pip install pytest pytest-asyncio httpx
pytest tests/ -v
# Frontend (156 tests)
cd frontend
npm run test:runThese options map directly to Docling's PdfPipelineOptions. See the Docling documentation for details on each feature.
| Option | Default | Description |
|---|---|---|
do_ocr |
true |
OCR for scanned pages and embedded images |
do_table_structure |
true |
Table detection and row/column reconstruction |
table_mode |
accurate |
accurate (TableFormer) or fast |
do_code_enrichment |
false |
Specialized OCR for code blocks |
do_formula_enrichment |
false |
Math formula recognition (LaTeX output) |
do_picture_classification |
false |
Classify images by type (chart, photo, diagram…) |
do_picture_description |
false |
Generate image descriptions via VLM |
generate_picture_images |
false |
Extract detected images as separate files |
generate_page_images |
false |
Rasterize each page as an image |
images_scale |
1.0 |
Scale factor for generated images (0.1–10) |
All configuration is done via environment variables. See .env.example.
| Variable | Default | Description |
|---|---|---|
CONVERSION_ENGINE |
local |
local (in-process Docling) or remote (Docling Serve) |
DOCLING_SERVE_URL |
http://localhost:5001 |
Docling Serve endpoint (remote mode only) |
DOCLING_SERVE_API_KEY |
— | API key for Docling Serve (optional) |
CORS_ORIGINS |
http://localhost:3000,... |
CORS allowed origins (comma-separated) |
UPLOAD_DIR |
./uploads |
File storage directory |
DB_PATH |
./data/docling_studio.db |
SQLite database path |
CONVERSION_TIMEOUT |
600 |
Max seconds for a single Docling conversion |
BATCH_PAGE_SIZE |
10 |
Pages per batch (0 = process all at once) |
MAX_FILE_SIZE_MB |
50 |
Maximum upload file size in MB (0 = unlimited) |
MAX_PAGE_COUNT |
0 |
Maximum number of pages per document (0 = unlimited) |
RATE_LIMIT_RPM |
100 |
Max requests per minute per IP (0 = disabled) |
Docling Studio enforces configurable limits on uploaded documents to protect the server against oversized files and long-running analyses:
MAX_FILE_SIZE_MB(default50) — rejects uploads exceeding this size. Validated at two levels: earlyContent-Lengthcheck and streaming byte count.MAX_PAGE_COUNT(default0= unlimited) — rejects documents with more pages than allowed. Useful on shared instances or Hugging Face Spaces to cap processing time.
Both limits are exposed in the /api/health endpoint so the frontend can display them to the user before upload. Set either to 0 to disable the corresponding check.
Docling Studio can optionally index extracted chunks into OpenSearch for vector and full-text search. This requires two additional services (OpenSearch + embedding) and is disabled by default.
To enable ingestion with Docker Compose:
docker compose --profile ingestion \
-f docker-compose.yml -f docker-compose.ingestion.yml \
up --buildWhen ingestion is enabled, the UI shows:
- An Ingest button in Studio to push chunks to OpenSearch
- An OpenSearch connection status badge in the sidebar
- Indexed / Not indexed filters on the Documents page
- A Search page for full-text and vector search across indexed documents
| Variable | Default | Description |
|---|---|---|
OPENSEARCH_URL |
— | OpenSearch endpoint (empty = ingestion disabled) |
EMBEDDING_URL |
— | Embedding service endpoint (empty = ingestion disabled) |
EMBEDDING_DIMENSION |
384 |
Vector dimension (must match embedding model) |
GitHub Actions pipelines (see .github/workflows/):
| Workflow | Trigger | What it does |
|---|---|---|
| CI | push to main, pull requests |
Lint + type check + Backend tests + Frontend tests + build |
| Release | push tag v* |
Build & push two multi-arch Docker images (remote + local) to ghcr.io |
| Docs | push to main (docs changes) |
Build & deploy MkDocs to GitHub Pages |
We follow Semantic Versioning with a simplified Git Flow. See CONTRIBUTING.md for the full release process.
| Document type | Pages | Approx. time (CPU) |
|---|---|---|
| Simple report | 5–10 | ~30s–1 min |
| Research paper | 10–30 | ~1–2 min |
| Large document | 100+ | ~2–5 min |
| Remote image | Local image | |
|---|---|---|
| Image size | ~270 MB | ~1.9 GB |
| Memory | 2 GB | 6 GB (recommended 8 GB+) |
| CPUs | 2 | 4 (recommended 8+) |
All Docker images are multi-arch (linux/amd64 + linux/arm64). No GPU required.
- Frontend: Vue 3, TypeScript, Vite, Pinia, DOMPurify
- Backend: FastAPI, Docling 2.x, SQLite (aiosqlite), pdf2image
- CI: GitHub Actions
- Infra: Docker Compose + Nginx
Contributions are welcome! Please open an issue first to discuss what you'd like to change.
MIT — Pier-Jean Malandrino
