Skip to content

Conversation

@OGuggenbuehl
Copy link

@OGuggenbuehl OGuggenbuehl commented Jul 29, 2025

Proposed Changes:

Implement MarkdownHeaderSplitter to split Documents written in .md based on their headers

How did you test it?

unit tests

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@CLAassistant
Copy link

CLAassistant commented Jul 29, 2025

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Jul 29, 2025
@OGuggenbuehl OGuggenbuehl changed the title Feature/md header splitter feat:MarkdownHeaderSplitter Jul 29, 2025
@sjrl sjrl self-assigned this Aug 19, 2025
@sjrl
Copy link
Contributor

sjrl commented Aug 19, 2025

@OGuggenbuehl definitely looks like an interesting approach! I've left an initial set of comments, but to further review I'd appreciate if you could add a set of tests like the ones we have for the DocumentSplitter https://github.com/deepset-ai/haystack/blob/main/test/components/preprocessors/test_document_splitter.py

This will help me be able to review the actual algorithm for splitting since it's easier to understand with examples.

@sjrl sjrl changed the title feat:MarkdownHeaderSplitter feat: MarkdownHeaderSplitter Aug 27, 2025
@OGuggenbuehl OGuggenbuehl force-pushed the feature/md-header-splitter branch from 61a8396 to bcbbf9a Compare September 16, 2025 13:57
@sjrl
Copy link
Contributor

sjrl commented Sep 18, 2025

Thanks for your continued work on this @OGuggenbuehl!

Some general comments. Could you:

  • Add a release note for this PR following the instructions here
  • Could you make sure to include our license header to the beginning of each file you've added. You can find an example of the license header here
  • Please make sure to sign the CLA agreement (docs about it here) from this comment
  • If you haven't already please also set up pre-commit hooks using pre-commit install. You can find more info about that in this section of our contribution guidelines.
  • Also in the future feel free to open branches directly in Haystack instead of using a fork. This makes it slightly easier to pull down your code to review locally.

@coveralls
Copy link
Collaborator

coveralls commented Sep 19, 2025

Pull Request Test Coverage Report for Build 19575164151

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.03%) to 91.467%

Totals Coverage Status
Change from base Build 19573565148: 0.03%
Covered Lines: 13860
Relevant Lines: 15153

💛 - Coveralls

@OGuggenbuehl OGuggenbuehl marked this pull request as ready for review September 19, 2025 16:05
@OGuggenbuehl OGuggenbuehl requested review from a team as code owners September 19, 2025 16:05
@OGuggenbuehl OGuggenbuehl removed the request for review from a team September 19, 2025 16:05
OGuggenbuehl and others added 28 commits November 21, 2025 16:24
minor commenting
@OGuggenbuehl OGuggenbuehl force-pushed the feature/md-header-splitter branch from 70e46d4 to c7264e6 Compare November 21, 2025 15:25
@OGuggenbuehl
Copy link
Author

@sjrl mypy has started complaining about a file that is unrelated to this pr:

haystack/tracing/datadog.py:34: error: Argument 2 to "set_tag" of "Span" has incompatible type "Union[str, int, float]"; expected "Optional[str]"  [arg-type]

I assume this is an upstream issue and I can simply rebase after it's been fixed upstream?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants