Skip to content

Conversation

@dereuromark
Copy link

Summary

  • Add new DjotExtractor for parsing Djot markup documents
  • Uses the jotdown crate (pull-parser similar to pulldown-cmark)
  • Supports YAML frontmatter metadata extraction
  • Supports table extraction as structured data
  • Registered under the office feature flag

Implementation Details

Djot is a modern markup language designed by John MacFarlane (creator of Pandoc and CommonMark). It has simpler parsing rules than CommonMark while supporting similar features.

The implementation follows the same pattern as the existing MarkdownExtractor:

  • YAML frontmatter parsing for metadata (title, author, date, keywords, etc.)
  • Event-based text extraction from the AST
  • Table extraction with markdown output format
  • Title extraction from first heading if not in frontmatter

MIME types: text/djot, text/x-djot

Test plan

  • Unit tests included in the PR
  • cargo test -p kreuzberg --features office passes
  • Manual testing with sample .djot files

Closes #262

Add a new DjotExtractor that parses Djot markup documents using the
jotdown crate. Djot is a modern markup language with simpler parsing
rules than CommonMark.

Features:
- YAML frontmatter metadata extraction
- Table extraction as structured data
- Heading structure preservation
- Code block and link extraction
- Smart punctuation handling

The implementation follows the same pattern as the Markdown extractor,
making it consistent with the existing codebase.

MIME types: text/djot, text/x-djot

Closes kreuzberg-dev#262
//! Djot is a modern markup language with simpler parsing rules than CommonMark.
//! See https://djot.net for the specification.
//!
//! Requires the `office` feature.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: why is this needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc comment follows the pattern used in other extractors (e.g. markdown.rs line 13). It documents that the extractor is only compiled when the office feature is enabled, which is useful for users reading the docs.

That said, it's not strictly necessary - I can remove it if you prefer a cleaner module doc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, i meant - why do we need the office features?

This is a seperate format, and as far as i saw (cursory glance) you dont really need anything form that group.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - refactored to a separate djot feature. It now only pulls in jotdown + tokio-runtime, independent of office.

Move Djot extractor to its own feature flag since it only needs
jotdown and serde_yaml_ng (already a core dep), without requiring
the full office feature dependencies.

- Add `djot` feature with just `dep:jotdown` + `tokio-runtime`
- Include `djot` in the `full` feature
- Update all cfg attributes from `office` to `djot`
Copy link
Collaborator

@Goldziher Goldziher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution! The implementation is solid overall, but there are several issues that need to be addressed before merge:

Critical Issues

1. Doc domment (line 13)

The doc comment still says Requires the 'office' feature but should say Requires the 'djot' feature after your refactor.

2. Code Duplication

extract_frontmatter() and extract_metadata_from_yaml() are identical to the markdown extractor (142 lines duplicated). This creates maintenance burden.

3. Inefficient Double Parsing (lines 348-363)

The content is parsed twice - once for text extraction, once for tables. You're already collecting events into a Vec, so they could be processed in a single pass.

4. Test Plan Incomplete

The PR description shows unchecked test boxes, and one still references the wrong feature (office instead of djot).

5. Frontmatter Parser Edge Case

The parser breaks if YAML content contains --- on its own line or uses ... as a terminator (both valid YAML). This is also a bug in the markdown extractor, but worth fixing here.

Also -- please update the root readme, and the documentation under docs as required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add Djot format support

2 participants