feat: add Djot format support #263

dereuromark · 2026-01-04T15:45:13Z

Summary

Add new DjotExtractor for parsing Djot markup documents
Uses the jotdown crate (pull-parser similar to pulldown-cmark)
Supports YAML frontmatter metadata extraction
Supports table extraction as structured data
Registered under the office feature flag

Implementation Details

Djot is a modern markup language designed by John MacFarlane (creator of Pandoc and CommonMark). It has simpler parsing rules than CommonMark while supporting similar features.

The implementation follows the same pattern as the existing MarkdownExtractor:

YAML frontmatter parsing for metadata (title, author, date, keywords, etc.)
Event-based text extraction from the AST
Table extraction with markdown output format
Title extraction from first heading if not in frontmatter

MIME types: text/djot, text/x-djot

Test plan

Unit tests included in the PR
cargo test -p kreuzberg --features office passes
Manual testing with sample .djot files

Closes #262

Add a new DjotExtractor that parses Djot markup documents using the jotdown crate. Djot is a modern markup language with simpler parsing rules than CommonMark. Features: - YAML frontmatter metadata extraction - Table extraction as structured data - Heading structure preservation - Code block and link extraction - Smart punctuation handling The implementation follows the same pattern as the Markdown extractor, making it consistent with the existing codebase. MIME types: text/djot, text/x-djot Closes kreuzberg-dev#262

Goldziher · 2026-01-04T15:55:50Z

crates/kreuzberg/src/extractors/djot.rs

+//! Djot is a modern markup language with simpler parsing rules than CommonMark.
+//! See https://djot.net for the specification.
+//!
+//! Requires the `office` feature.


Q: why is this needed?

This doc comment follows the pattern used in other extractors (e.g. markdown.rs line 13). It documents that the extractor is only compiled when the office feature is enabled, which is useful for users reading the docs.

That said, it's not strictly necessary - I can remove it if you prefer a cleaner module doc.

no, i meant - why do we need the office features?

This is a seperate format, and as far as i saw (cursory glance) you dont really need anything form that group.

Good point - refactored to a separate djot feature. It now only pulls in jotdown + tokio-runtime, independent of office.

Move Djot extractor to its own feature flag since it only needs jotdown and serde_yaml_ng (already a core dep), without requiring the full office feature dependencies. - Add `djot` feature with just `dep:jotdown` + `tokio-runtime` - Include `djot` in the `full` feature - Update all cfg attributes from `office` to `djot`

Goldziher

Thanks for this contribution! The implementation is solid overall, but there are several issues that need to be addressed before merge:

Critical Issues

1. Doc domment (line 13)

The doc comment still says Requires the 'office' feature but should say Requires the 'djot' feature after your refactor.

2. Code Duplication

extract_frontmatter() and extract_metadata_from_yaml() are identical to the markdown extractor (142 lines duplicated). This creates maintenance burden.

3. Inefficient Double Parsing (lines 348-363)

The content is parsed twice - once for text extraction, once for tables. You're already collecting events into a Vec, so they could be processed in a single pass.

4. Test Plan Incomplete

The PR description shows unchecked test boxes, and one still references the wrong feature (office instead of djot).

5. Frontmatter Parser Edge Case

The parser breaks if YAML content contains --- on its own line or uses ... as a terminator (both valid YAML). This is also a bug in the markdown extractor, but worth fixing here.

Also -- please update the root readme, and the documentation under docs as required.

Goldziher reviewed Jan 4, 2026

View reviewed changes

Goldziher reviewed Jan 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add Djot format support #263

feat: add Djot format support #263

Uh oh!

dereuromark commented Jan 4, 2026

Uh oh!

Goldziher Jan 4, 2026

Uh oh!

dereuromark Jan 4, 2026

Uh oh!

Goldziher Jan 4, 2026

Uh oh!

dereuromark Jan 4, 2026

Uh oh!

Goldziher left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add Djot format support #263

Are you sure you want to change the base?

feat: add Djot format support #263

Uh oh!

Conversation

dereuromark commented Jan 4, 2026

Summary

Implementation Details

Test plan

Uh oh!

Goldziher Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

dereuromark Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

Goldziher Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

dereuromark Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

Goldziher left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Critical Issues

1. Doc domment (line 13)

2. Code Duplication

3. Inefficient Double Parsing (lines 348-363)

4. Test Plan Incomplete

5. Frontmatter Parser Edge Case

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Goldziher left a comment •

edited

Loading