New chunking classes #996

dluc · 2025-02-06T21:55:24Z

The original chunkers ported from SK had some bugs introduced while refactoring, leading to incorrect split. This is a full rewrite following the original logic, with some changes:

remove MaxTokensPerLine setting
overlap doesn't use sentences anymore, and copy raw tokens from the previous chunk instead
markdown chunker uses better splitting logic, although it should be rewritten to use a markdown parser
chunkers now work with a Chunk class which is used also by the file parsers. This will allow to port properties from files to chunks, such as page number and other metadata
chunkers now take a dependency on tokenizers directly, rather than just TokenCount
chunkers are now out of Core and into a dedicated nuget, for future reuse outside KM

Rewrite text and markdown chunkers

94b5d0d

dluc force-pushed the extendedchunks branch 4 times, most recently from 291fd6b to 87adf99 Compare February 6, 2025 22:36

Refactoring: merge Fragment into Chunk

e0d474b

dluc force-pushed the extendedchunks branch from 87adf99 to e0d474b Compare February 6, 2025 22:44

dluc merged commit a490102 into microsoft:main Feb 6, 2025
6 checks passed

dluc deleted the extendedchunks branch February 6, 2025 22:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New chunking classes #996

New chunking classes #996

Uh oh!

dluc commented Feb 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

New chunking classes #996

New chunking classes #996

Uh oh!

Conversation

dluc commented Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dluc commented Feb 6, 2025 •

edited

Loading