-
Notifications
You must be signed in to change notification settings - Fork 143
feat: add Djot format support #263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add a new DjotExtractor that parses Djot markup documents using the jotdown crate. Djot is a modern markup language with simpler parsing rules than CommonMark. Features: - YAML frontmatter metadata extraction - Table extraction as structured data - Heading structure preservation - Code block and link extraction - Smart punctuation handling The implementation follows the same pattern as the Markdown extractor, making it consistent with the existing codebase. MIME types: text/djot, text/x-djot Closes kreuzberg-dev#262
| //! Djot is a modern markup language with simpler parsing rules than CommonMark. | ||
| //! See https://djot.net for the specification. | ||
| //! | ||
| //! Requires the `office` feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc comment follows the pattern used in other extractors (e.g. markdown.rs line 13). It documents that the extractor is only compiled when the office feature is enabled, which is useful for users reading the docs.
That said, it's not strictly necessary - I can remove it if you prefer a cleaner module doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, i meant - why do we need the office features?
This is a seperate format, and as far as i saw (cursory glance) you dont really need anything form that group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - refactored to a separate djot feature. It now only pulls in jotdown + tokio-runtime, independent of office.
Move Djot extractor to its own feature flag since it only needs jotdown and serde_yaml_ng (already a core dep), without requiring the full office feature dependencies. - Add `djot` feature with just `dep:jotdown` + `tokio-runtime` - Include `djot` in the `full` feature - Update all cfg attributes from `office` to `djot`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this contribution! The implementation is solid overall, but there are several issues that need to be addressed before merge:
Critical Issues
1. Doc domment (line 13)
The doc comment still says Requires the 'office' feature but should say Requires the 'djot' feature after your refactor.
2. Code Duplication
extract_frontmatter() and extract_metadata_from_yaml() are identical to the markdown extractor (142 lines duplicated). This creates maintenance burden.
3. Inefficient Double Parsing (lines 348-363)
The content is parsed twice - once for text extraction, once for tables. You're already collecting events into a Vec, so they could be processed in a single pass.
4. Test Plan Incomplete
The PR description shows unchecked test boxes, and one still references the wrong feature (office instead of djot).
5. Frontmatter Parser Edge Case
The parser breaks if YAML content contains --- on its own line or uses ... as a terminator (both valid YAML). This is also a bug in the markdown extractor, but worth fixing here.
Also -- please update the root readme, and the documentation under docs as required.
Summary
DjotExtractorfor parsing Djot markup documentsjotdowncrate (pull-parser similar topulldown-cmark)officefeature flagImplementation Details
Djot is a modern markup language designed by John MacFarlane (creator of Pandoc and CommonMark). It has simpler parsing rules than CommonMark while supporting similar features.
The implementation follows the same pattern as the existing
MarkdownExtractor:MIME types:
text/djot,text/x-djotTest plan
cargo test -p kreuzberg --features officepasses.djotfilesCloses #262