Kreuzberg Development Roadmap #24

Goldziher · 2025-03-02T14:15:28Z

Goldziher
Mar 2, 2025
Maintainer

The following is my current thinking about the roadmap. It shouldn't be seen as set in stone, and the time boxing is also tentative and might be a bit over ambitious.

Please give your feedback!

Note About Versioning

Kreuzberg follows a rapid development cycle with regular major version releases. Why? Because major versions (1.x, 2.x etc.) rather than pre-1.0 versioning (0.1.11 etc.) allows us to evolve the library's interfaces confidently while being explicit about breaking changes. And we do and will always follow SemVer.

Current: Version 2.x

Core Functionality

Unified async/sync API for document text extraction
Support for PDF, images, Office documents, and markup formats
OCR capabilities via Tesseract integration
Text extraction and metadata extraction via Pandoc
Efficient batch processing

Version 3.x (Q2 2025)

Extensibility

Architecture Update:

Support for creating and using custom extractors for any file format
Capability to override existing extractors
Pre-processing, validation, and post-processing hooks
Extended metadata extraction

Enhanced Document Structure

Optional Features (available via extra install groups):

Multiple OCR backends (Paddle OCR, EasyOCR, etc.) with Tesseract becoming optional
Table extraction and representation (awaiting release as V3.1)

Possible path forward (TBD):

Version 4.x (Q3 2025)

Automatic language detection
Entity/keyword extraction

Model-Based Processing

Optional Vision Model Integration:

Structured text extraction using open source vision models (QWEN 2.5, Phi 3 Vision, etc.)
Plug-and-play support for both CPU and GPU (via HF transformers or ONNX)
Custom prompting with structured output generation (similar to Pydantic for document extraction)

Optional Specialized OCR:

Support for advanced OCR models (TrOCR, Donut, etc.)
Auto-finetuning capabilities for improved accuracy with user data
Lightweight deployment options for serverless environments

Optional Heuristics:

Model-based heuristics for automatic pipeline optimization
Automatic document type detection and processing selection
Result validation and quality assessment
Parameter optimization through automated feedback

Version 5.x (Q4 2025)

Integration & Ecosystem

Optional Enterprise Integrations:

Connectors for major cloud document platforms:
- Azure Document Intelligence
- AWS Textract
- Google Cloud Document AI
- NVIDIA Document Understanding
User-provided credential management
Standardized response format using Kreuzberg's data types
Integration with Kreuzberg's intelligent processing heuristics

TBD...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kreuzberg

Kreuzberg Development Roadmap #24

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

kreuzberg

Kreuzberg Development Roadmap #24

Uh oh!

Uh oh!

Goldziher Mar 2, 2025 Maintainer

Note About Versioning

Current: Version 2.x

Core Functionality

Version 3.x (Q2 2025)

Extensibility

Enhanced Document Structure

Version 4.x (Q3 2025)

Model-Based Processing

Version 5.x (Q4 2025)

Integration & Ecosystem

Replies: 0 comments

Goldziher
Mar 2, 2025
Maintainer