Commit 3097645
feat: add code chunking functionality (#398)
* initial code chunking for docling-core
* DCO Remediation Commit for Bridget McGinn <[email protected]>
I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 334811a
Signed-off-by: Bridget McGinn <[email protected]>
* include language detections, add code chunking into hierarchical chunker
* add serializer, internal marking of chunkers, typing
* Update pyproject.toml
Co-authored-by: Panos Vagenas <[email protected]>
Signed-off-by: Bridget <[email protected]>
* Update docling_core/transforms/chunker/hierarchical_chunker.py
Co-authored-by: Panos Vagenas <[email protected]>
Signed-off-by: Bridget <[email protected]>
* run all pre-commit less pytest
* update test files for code ID
* DCO Remediation Commit for Bridget McGinn <[email protected]>
I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 46bb88a
I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 10e9ed8
I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: d9827c7
I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 814dc61
Signed-off-by: Bridget McGinn <[email protected]>
* update uv.lock
Signed-off-by: Bridget McGinn <[email protected]>
* revert to stricter treesitter versioning due to compatibility
Signed-off-by: Bridget McGinn <[email protected]>
* DCO Remediation Commit for Bridget McGinn <[email protected]>
I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: a4a21e9
I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 0266c63
I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 336dd6a
I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 68890e9
I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 3c65eef
Signed-off-by: Bridget McGinn <[email protected]>
* remove language detection (to be run by client, i.e. docling)
Signed-off-by: Panos Vagenas <[email protected]>
* align new dependency specs
Signed-off-by: Panos Vagenas <[email protected]>
* address backticks, ABC, and supported languages feedback
Signed-off-by: Bridget McGinn <[email protected]>
* remove Language class and reuse CodeLanguageLabel
Signed-off-by: Bridget McGinn <[email protected]>
* DCO Remediation Commit for Bridget McGinn <[email protected]>
I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 63c7739
I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 431d357
I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: f3175c2
I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 1a01de8
I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 025aea3
Signed-off-by: Bridget McGinn <[email protected]>
* refactoring and improvements
- encapsulated code chunking specifics to separate package
- clearly separated public vs internal API via module and method naming conventions
- simplified or removed parts not stricly necessary for public API (e.g. lang support querying, noopstrategy)
- split chunk data model to separate modules to prevent circular dependencies
- renamed DefaultCodeChunkingStrategy to Standard... for clarity as it need not be the default strategy
- fixed some issues (e.g. gen flag in test)
Signed-off-by: Panos Vagenas <[email protected]>
---------
Signed-off-by: Bridget McGinn <[email protected]>
Signed-off-by: Bridget <[email protected]>
Signed-off-by: Panos Vagenas <[email protected]>
Co-authored-by: Panos Vagenas <[email protected]>
Co-authored-by: Panos Vagenas <[email protected]>1 parent a54f6f0 commit 3097645
File tree
74 files changed
+10987
-304
lines changed- docling_core
- transforms
- chunker
- code_chunking
- serializer
- types/doc
- examples
- test
- data
- chunker_repo
- C
- JavaScript
- Java
- Python
- TypeScript
- repos
- acmeair
- docling
- jquery
- json-c
- outline
- chunker
- doc
- repo_chunking
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
74 files changed
+10987
-304
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
9 | | - | |
10 | | - | |
11 | | - | |
12 | | - | |
| 9 | + | |
| 10 | + | |
13 | 11 | | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
14 | 23 | | |
| 24 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
0 commit comments