Skip to content

Commit 4faaf16

Browse files
author
speedyk-005
committed
feat: Modularize CodeChunker, add BaseChunker inheritance, and fix decorator separation bug
- Split CodeChunker into CodeChunker and _CodeStructureExtractor for better modularity and reduced cognitive complexity. - Introduced BaseChunker abstract class for consistent chunker interfaces across CodeChunker, PlainTextChunker, and DocumentChunker. - Fixed bug in CodeChunker where decorators (e.g., @Property) were separated from their associated functions into different chunks.
1 parent ec41faa commit 4faaf16

10 files changed

Lines changed: 738 additions & 534 deletions

File tree

CHANGELOG.md

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1010
## [2.1.0] - 2025-12-11
1111

1212
### Changed
13-
- Changed default `include_comments` to True
13+
- **Default `include_comments`:** Changed the default value of the `include_comments` parameter to `True` in the `CodeChunker.chunk()` method to align with most developer expectations for comprehensive code processing.
14+
- **Code Chunker Modularization:** Refactored the `CodeChunker` class for better maintainability.
15+
- Split into two files: `code_chunker.py` (main chunker logic) and `_code_structure_extractor.py` (structure extraction).
16+
- Modularized the complex `extract_code_structure` method by extracting helper functions to reduce cognitive load.
17+
- **Base Chunker Inheritance:** Introduced a new `BaseChunker` abstract base class in `base_chunker.py` to standardize the interface for all chunkers.
1418

1519
### Fixed
16-
- Fixed late-binding issue in code chunker by modifying lambda pattern substitution
17-
- Fixed duplicate line de-annotation logic in code chunker
20+
- **Late-Binding Closure Bug:** Fixed a classic Python closure bug in the code annotation loop of `CodeChunker`. The original `pattern.sub(lambda match: self._annotate_block(tag, match), code)` caused the lambda to reference the final value of `tag` after the loop completed. Resolved by changing to `pattern.sub(lambda match, tag=tag: self._annotate_block(tag, match), code)`, using the default argument trick to capture the current `tag` value at definition time.
21+
- **Duplicate Line De-annotation:** Removed redundant string slicing logic in `CodeChunker`'s internal processing. The line de-annotation was being called twice—once during regex substitution and again via manual slicing—creating ambiguity and potential "ghost slicing" where lines could be misinterpreted. Now relies solely on regex substitution for de-annotation, simplifying the control flow.
22+
- **Decorator Separation Bug:** Fixed an issue in `CodeChunker` where decorators (e.g., `@property`) were incorrectly separated from their associated functions into different chunks. Added a flush condition in `extract_code_structure` to handle the first decorator/attribute (`len(buffer["META"]) == 1`) and non-consecutive DOC lines, ensuring decorators group with their functions for better semantic chunking.
1823

1924
---
2025

demo.py

Lines changed: 47 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,29 +1,57 @@
1-
from chunklet.plain_text_chunker import PlainTextChunker
2-
from chunklet.common.token_utils import count_tokens
1+
from chunklet.code_chunker import CodeChunker
32

43

5-
def simple_token_counter(text: str) -> int:
6-
"""A simple token counter that splits by spaces."""
7-
return len(text.split())
4+
# Python code sample with decorators
5+
code_sample = '''
6+
"""Module docstring for demo."""
87
8+
import os
99
10-
# Text from the example
11-
haystack = "I am writing a letter ! Sometimes, I forget to put spaces and do weird stuff with punctuation ?"
10+
class Calculator:
11+
"""A simple calculator class."""
1212
13-
# Instantiate the chunker with a simple token counter
14-
chunker = PlainTextChunker(token_counter=simple_token_counter)
13+
def __init__(self):
14+
self._value = 0
15+
self._verbose = True
1516
16-
# Chunk the text with a max_tokens limit that will likely split the text
17-
# The goal is to see if the span of the second chunk is correctly identified.
18-
chunk_boxes = chunker.chunk(text=haystack, max_tokens=12)
17+
@property
18+
def current_value(self):
19+
"""Get the current value."""
20+
return self.value
21+
22+
@current_value.setter
23+
def current_value(self, value):
24+
"""Set the current value."""
25+
self.value = value
26+
27+
def add(self, x, y):
28+
"""Add two numbers."""
29+
result = x + y
30+
return result
31+
32+
def multiply(self, x, y):
33+
"""Multiply two numbers."""
34+
return x * y
35+
36+
def standalone_function():
37+
"""A standalone function."""
38+
return True
39+
'''
40+
41+
# Instantiate the chunker
42+
chunker = CodeChunker(verbose=True)
43+
44+
# Chunk the code with max_functions=1 to see splitting
45+
chunk_boxes = chunker.chunk(source=code_sample, max_functions=1)
1946

2047
# Print the results
21-
print(f"Original Text: '{haystack}'")
22-
print("-" * 20)
48+
print("=" * 50)
2349
for i, chunk_box in enumerate(chunk_boxes):
2450
print(f"Chunk #{i+1}:")
25-
print(f" Content: '{chunk_box.content}'")
26-
print(f" Metadata Span: {chunk_box.metadata.span}")
27-
start, end = chunk_box.metadata.span
28-
print(f" Span in Original: '{haystack[start:end]}'")
29-
print("-" * 20)
51+
print(f" Content:\n{chunk_box.content}")
52+
print(f" Tree: {chunk_box.metadata.tree}")
53+
print(f" Start Line: {chunk_box.metadata.start_line}")
54+
print(f" End Line: {chunk_box.metadata.end_line}")
55+
print(f" Span: {chunk_box.metadata.span}")
56+
print(f" Source: {chunk_box.metadata.source}")
57+
print("=" * 50)

src/chunklet/base_chunker.py

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
"""
2+
Base Chunker Abstract Class
3+
4+
Defines the interface for chunkers.
5+
"""
6+
7+
from abc import ABC, abstractmethod
8+
from typing import Generator
9+
from box import Box
10+
from loguru import logger
11+
12+
13+
class BaseChunker(ABC):
14+
"""
15+
Abstract base class for chunkers.
16+
17+
Defines the standard interface for chunking content into units.
18+
"""
19+
20+
def __init__(self, verbose: bool = False):
21+
self.verbose = verbose
22+
23+
@abstractmethod
24+
def chunk(self, *args, **kwargs) -> list[Box]:
25+
"""
26+
Extract chunks.
27+
28+
Returns:
29+
list[Box]: List of chunks with content and metadata.
30+
"""
31+
pass
32+
33+
@abstractmethod
34+
def batch_chunk(self, *args, **kwargs) -> Generator[Box, None, None]:
35+
"""
36+
Process multiple items in parallel.
37+
38+
Yields:
39+
Box: `Box` object, representing a chunk with its content and metadata.
40+
"""
41+
pass
42+
43+
def log_info(self, *args, **kwargs) -> None:
44+
"""Log an info message if verbose is enabled."""
45+
if self.verbose:
46+
logger.info(*args, **kwargs)

0 commit comments

Comments
 (0)