Skip to content

Conversation

dluc
Copy link
Collaborator

@dluc dluc commented Feb 6, 2025

The original chunkers ported from SK had some bugs introduced while refactoring, leading to incorrect split. This is a full rewrite following the original logic, with some changes:

  • remove MaxTokensPerLine setting
  • overlap doesn't use sentences anymore, and copy raw tokens from the previous chunk instead
  • markdown chunker uses better splitting logic, although it should be rewritten to use a markdown parser
  • chunkers now work with a Chunk class which is used also by the file parsers. This will allow to port properties from files to chunks, such as page number and other metadata
  • chunkers now take a dependency on tokenizers directly, rather than just TokenCount
  • chunkers are now out of Core and into a dedicated nuget, for future reuse outside KM

@dluc dluc force-pushed the extendedchunks branch 4 times, most recently from 291fd6b to 87adf99 Compare February 6, 2025 22:36
@dluc dluc merged commit a490102 into microsoft:main Feb 6, 2025
6 checks passed
@dluc dluc deleted the extendedchunks branch February 6, 2025 22:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant