Skip to content

Commit 0afdcd9

Browse files
authored
Docs/sync (#32)
* docs: correct html_to_md docstring to match actual behavior * docs: add v2.3.0 section to whats-new.md with witty tone - Add Termux tip to installation.md - Reorder migration checker to top, move closing before CLI ref - Add v2.3.0 section to whats-new.md - Sync extras descriptions between installation.md and README * refactor: improve audit_migration.py naming and performance - Rename CHUNKER_CLASSES -> CHUNKER_CLASS_NAMES (clarity) - Rename V1_ARGS -> DEPRECATED_ARGUMENTS - Rename _found_any -> _has_legacy_issues - Rename _get_chunklet_instances -> _get_chunker_instances - Reduce deep nesting with helper method _is_chunker_call - Cache file lines once instead of re-reading per line (performance) - Add support for single file input - Filter out .venv and site-packages directories * docs: add contextlib.closing option for batch hang fix * docs: fix overlap-percent range and exception reference - Change overlap-percent range from 0-85 to 0-75 (matches code) - Update FileProcessingError to UnsupportedFileTypeError * docs: add content negotiation feature to whats-new v2.3.0 - Document JSON default and MessagePack opt-in via Accept header - Note browser visualizer requests MessagePack automatically * fix: remove duplicate and broken methods in audit_migration.py - Remove broken _audit_imports duplicate (lines 129-135) - Remove duplicate _get_chunker_instances (lines 137-147) - Remove unused _get_chunker_instance_target (lines 149-159) - Keep correct implementations at lines 109-127 and 161+ - Reduces file by 31 lines of dead code * docs: update migration.md and installation.md to reflect Python 3.11 minimum requirement
1 parent 4b962ba commit 0afdcd9

9 files changed

Lines changed: 206 additions & 124 deletions

File tree

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@ chunklet --version
107107
> [!TIP]
108108
> <b>Termux (Android)</b>
109109
>
110-
> No rust toolchain on Termux (especially python 3.13) ? Install pydantic-core pre-built wheels first then retry installing chunklet-py:
110+
> No rust toolchain on Termux (especially python 3.13)? Install pydantic-core pre-built wheels first then retry installing chunklet-py:
111111
>
112112
> ```bash
113113
> pip install typing-extensions
@@ -121,11 +121,11 @@ That's it! You're all set to start chunking.
121121
122122
Want to unlock more Chunklet-py superpowers? Add these optional dependencies based on what you need:
123123
124-
* **Document Processing:** For handling `.pdf`, `.docx`, `.epub`, and other document formats:
124+
* **Structured Documents:** For handling `.pdf`, `.docx`, `.epub`, and other document formats:
125125
```bash
126126
pip install "chunklet-py[structured-document]"
127127
```
128-
* **Code Chunking:** For advanced code analysis and chunking features:
128+
* **Code Chunking:** For Language-agnostic code chunking features:
129129
```bash
130130
pip install "chunklet-py[code]"
131131
```

audit_migration.py

Lines changed: 125 additions & 103 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,9 @@
1010

1111
console = Console()
1212

13-
CHUNKER_CLASSES = {"Chunklet", "PlainTextChunker", "DocumentChunker", "CodeChunker"}
13+
DEPRECATED_ARGUMENTS = {"use_cache": "Remove it.", "custom_splitters": "Remove it."}
1414

15-
V1_ARGS = {"use_cache": "Remove it.", "custom_splitters": "Remove it."}
15+
CHUNKER_CLASS_NAMES = {"Chunklet", "PlainTextChunker", "DocumentChunker", "CodeChunker"}
1616

1717
LEGACY_IMPORTS = {
1818
"chunklet.utils.detect_text_language": "Use 'SentenceSplitter.detected_top_language()' instead.",
@@ -36,21 +36,27 @@ class MigrationAuditor:
3636
"""
3737

3838
def __init__(self):
39-
self._found_any = False
39+
self._has_legacy_issues = False
4040
self._console = console
4141

42-
def audit(self, directory="."):
42+
def audit(self, path="."):
4343
"""
44-
Public method to audit a directory for legacy v1 patterns.
44+
Public method to audit a file or directory for legacy v1 patterns.
4545
"""
46-
self._found_any = False
47-
directory = Path(directory)
46+
self._has_legacy_issues = False
47+
path = Path(path)
4848

4949
self._print_header()
5050

51-
for file_path in directory.rglob("*.py"):
51+
# Handle single file or directory
52+
targets = [path] if path.is_file() else path.rglob("*.py")
53+
54+
for file_path in targets:
55+
# Skip the audit script itself and virtual environments
5256
if file_path.name == SCRIPT_NAME:
5357
continue
58+
if ".venv" in file_path.parts or "site-packages" in file_path.parts:
59+
continue
5460

5561
try:
5662
content = file_path.read_text(encoding="utf-8")
@@ -71,7 +77,7 @@ def _print_header(self):
7177
)
7278

7379
def _print_summary(self):
74-
if not self._found_any:
80+
if not self._has_legacy_issues:
7581
self._console.print(
7682
"[bold green]✓ No legacy v1 patterns found. Code is up to date![/bold green]"
7783
)
@@ -86,37 +92,51 @@ def _audit_file(self, file_path: Path, content: str):
8692
except SyntaxError:
8793
return
8894

89-
tracked_instances = self._get_chunklet_instances(tree)
95+
lines = content.splitlines()
96+
tracked_instances = self._get_chunker_instances(tree)
9097

91-
self._audit_imports(file_path, tree)
92-
self._audit_class_instantiation(file_path, tree, tracked_instances)
93-
self._audit_calls(file_path, tree, tracked_instances)
94-
self._audit_exceptions(file_path, tree)
98+
self._audit_imports(file_path, tree, lines)
99+
self._audit_class_instantiation(file_path, tree, tracked_instances, lines)
100+
self._audit_calls(file_path, tree, tracked_instances, lines)
101+
self._audit_exceptions(file_path, tree, lines)
95102

96-
def _get_chunklet_instances(self, tree: ast.AST) -> set:
97-
instances = set()
98-
for node in ast.walk(tree):
99-
if isinstance(node, ast.Assign):
100-
for target in node.targets:
101-
if isinstance(target, ast.Name):
102-
if isinstance(node.value, ast.Call):
103-
if isinstance(node.value.func, ast.Name):
104-
if node.value.func.id in CHUNKER_CLASSES:
105-
instances.add(target.id)
106-
return instances
107-
108-
def _audit_imports(self, file_path: Path, tree: ast.AST):
103+
def _get_line(self, lines: list[str], line_no: int) -> str:
104+
try:
105+
return lines[line_no - 1]
106+
except IndexError:
107+
return ""
108+
109+
def _is_chunker_call(self, node: ast.AST) -> bool:
110+
"""Check if a node is a call to a known chunker class."""
111+
if not isinstance(node, ast.Call):
112+
return False
113+
if not isinstance(node.func, ast.Name):
114+
return False
115+
return node.func.id in CHUNKER_CLASS_NAMES
116+
117+
def _get_chunker_instances(self, tree: ast.AST) -> set:
118+
"""Find all chunker class instantiations in the AST."""
119+
return {
120+
target.id
121+
for node in ast.walk(tree)
122+
if isinstance(node, ast.Assign)
123+
and isinstance(node.value, ast.Call)
124+
and self._is_chunker_call(node.value)
125+
for target in node.targets
126+
if isinstance(target, ast.Name)
127+
}
128+
129+
def _audit_imports(self, file_path: Path, tree: ast.AST, lines: list[str]):
109130
for node in ast.walk(tree):
110131
if isinstance(node, ast.ImportFrom):
111132
if node.module:
112133
for legacy_module, fix_msg in LEGACY_IMPORTS.items():
113134
if legacy_module in node.module:
114-
self._found_any = True
115-
line_no = node.lineno
135+
self._has_legacy_issues = True
116136
self._print_issue(
117137
file_path,
118-
line_no,
119-
self._get_line(file_path, line_no),
138+
node.lineno,
139+
self._get_line(lines, node.lineno),
120140
f"Import '{node.module}'. {fix_msg}",
121141
"bold red",
122142
)
@@ -125,96 +145,98 @@ def _audit_imports(self, file_path: Path, tree: ast.AST):
125145
for alias in node.names:
126146
for legacy_module, fix_msg in LEGACY_IMPORTS.items():
127147
if legacy_module.split(".")[-1] == alias.name:
128-
self._found_any = True
129-
line_no = node.lineno
148+
self._has_legacy_issues = True
130149
self._print_issue(
131150
file_path,
132-
line_no,
133-
self._get_line(file_path, line_no),
151+
node.lineno,
152+
self._get_line(lines, node.lineno),
134153
f"Import '{alias.name}'. {fix_msg}",
135154
"bold red",
136155
)
137156

138157
def _audit_class_instantiation(
139-
self, file_path: Path, tree: ast.AST, tracked_instances: set
158+
self, file_path: Path, tree: ast.AST, tracked_instances: set, lines: list[str]
140159
):
160+
# Find "Chunklet" instantiations specifically (not all chunker classes)
141161
for node in ast.walk(tree):
142-
if isinstance(node, ast.Assign):
143-
for target in node.targets:
144-
if isinstance(target, ast.Name):
145-
if isinstance(node.value, ast.Call):
146-
if isinstance(node.value.func, ast.Name):
147-
if node.value.func.id == "Chunklet":
148-
tracked_instances.add(target.id)
149-
self._found_any = True
150-
line_no = node.lineno
151-
self._print_issue(
152-
file_path,
153-
line_no,
154-
self._get_line(file_path, line_no),
155-
"Rename 'Chunklet' to 'DocumentChunker'.",
156-
"bold red",
157-
)
158-
159-
def _audit_calls(self, file_path: Path, tree: ast.AST, tracked_instances: set):
160-
for node in ast.walk(tree):
161-
if isinstance(node, ast.Call):
162-
if isinstance(node.func, ast.Attribute):
163-
if isinstance(node.func.value, ast.Name):
164-
inst_name = node.func.value.id
162+
if not isinstance(node, ast.Assign):
163+
continue
164+
if not isinstance(node.value, ast.Call):
165+
continue
166+
if not isinstance(node.value.func, ast.Name):
167+
continue
168+
if node.value.func.id != "Chunklet":
169+
continue
170+
if not node.targets or not isinstance(node.targets[0], ast.Name):
171+
continue
165172

166-
if inst_name not in tracked_instances:
167-
continue
173+
target = node.targets[0].id
174+
tracked_instances.add(target)
175+
self._has_legacy_issues = True
176+
self._print_issue(
177+
file_path,
178+
node.lineno,
179+
self._get_line(lines, node.lineno),
180+
"Rename 'Chunklet' to 'DocumentChunker'.",
181+
"bold red",
182+
)
168183

169-
method_name = node.func.attr
184+
def _audit_calls(self, file_path: Path, tree: ast.AST, tracked_instances: set, lines: list[str]):
185+
for node in ast.walk(tree):
186+
if not isinstance(node, ast.Call):
187+
continue
188+
if not isinstance(node.func, ast.Attribute):
189+
continue
190+
if not isinstance(node.func.value, ast.Name):
191+
continue
170192

171-
if method_name in LEGACY_METHODS:
172-
style = (
173-
"bold red"
174-
if method_name in ("preview_sentences",)
175-
else "yellow"
176-
)
177-
self._found_any = True
178-
self._print_issue(
179-
file_path,
180-
node.lineno,
181-
self._get_line(file_path, node.lineno),
182-
f"'{inst_name}.{method_name}()' - {LEGACY_METHODS[method_name]}",
183-
style,
184-
)
193+
inst_name = node.func.value.id
194+
if inst_name not in tracked_instances:
195+
continue
185196

186-
for arg, fix_msg in V1_ARGS.items():
187-
for kw in node.keywords:
188-
if kw.arg == arg:
189-
self._found_any = True
190-
self._print_issue(
191-
file_path,
192-
node.lineno,
193-
self._get_line(file_path, node.lineno),
194-
f"Argument '{arg}' is no longer supported. {fix_msg}",
195-
"bold red",
196-
)
197-
198-
def _audit_exceptions(self, file_path: Path, tree: ast.AST):
199-
for node in ast.walk(tree):
200-
if isinstance(node, ast.Name):
201-
for old_name, fix_msg in LEGACY_EXCEPTIONS.items():
202-
if node.id == old_name:
203-
self._found_any = True
204-
line_no = node.lineno
197+
method_name = node.func.attr
198+
199+
if method_name in LEGACY_METHODS:
200+
style = (
201+
"bold red"
202+
if method_name in ("preview_sentences",)
203+
else "yellow"
204+
)
205+
self._has_legacy_issues = True
206+
self._print_issue(
207+
file_path,
208+
node.lineno,
209+
self._get_line(lines, node.lineno),
210+
f"'{inst_name}.{method_name}()' - {LEGACY_METHODS[method_name]}",
211+
style,
212+
)
213+
214+
for arg, fix_msg in DEPRECATED_ARGUMENTS.items():
215+
for kw in node.keywords:
216+
if kw.arg == arg:
217+
self._has_legacy_issues = True
205218
self._print_issue(
206219
file_path,
207-
line_no,
208-
self._get_line(file_path, line_no),
209-
f"'{old_name}' - {fix_msg}",
220+
node.lineno,
221+
self._get_line(lines, node.lineno),
222+
f"Argument '{arg}' is no longer supported. {fix_msg}",
210223
"bold red",
211224
)
212225

213-
def _get_line(self, file_path: Path, line_no: int) -> str:
214-
try:
215-
return file_path.read_text(encoding="utf-8").splitlines()[line_no - 1]
216-
except (IndexError, IOError):
217-
return ""
226+
def _audit_exceptions(self, file_path: Path, tree: ast.AST, lines: list[str]):
227+
for node in ast.walk(tree):
228+
if not isinstance(node, ast.Name):
229+
continue
230+
if node.id not in LEGACY_EXCEPTIONS:
231+
continue
232+
self._has_legacy_issues = True
233+
self._print_issue(
234+
file_path,
235+
node.lineno,
236+
self._get_line(lines, node.lineno),
237+
f"'{node.id}' - {LEGACY_EXCEPTIONS[node.id]}",
238+
"bold red",
239+
)
218240

219241
def _print_issue(self, file_path, line_no, line_content, message, style):
220242
msg = Text()

docs/getting-started/cli.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ The `chunk` command is where the real magic happens! It's your versatile tool fo
6969
| `--max-tokens` | Maximum number of tokens per chunk. Applies to all chunking strategies. (Must be >= 12) | None |
7070
| `--max-sentences` | Maximum number of sentences per chunk. Applies to DocumentChunker. (Must be >= 1) | None |
7171
| `--max-section-breaks` | Maximum number of section breaks per chunk. Section breaks include Markdown headings (# to ######), horizontal rules (---, ***, ___), and <details> tags. Applies to DocumentChunker. (Must be >= 1) | None |
72-
| `--overlap-percent` | Percentage of overlap between chunks (0-85). Applies to DocumentChunker. | 20.0 |
72+
| `--overlap-percent` | Percentage of overlap between chunks (0-75). Applies to DocumentChunker. | 20.0 |
7373
| `--offset` | Starting sentence offset for chunking. Applies to DocumentChunker. | 0 |
7474
| `--lang` | Language of the text (e.g., 'en', 'fr', 'auto'). | auto |
7575
| `--metadata` | Include rich metadata (source, span, chunk num, etc.) in the output. If `--destination` is a directory, metadata is saved as separate `.json` files; otherwise, it's included inline in the output. | False |

docs/getting-started/installation.md

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
Ready to get Chunklet-py up and running? Fantastic! This guide will walk you through the installation process, making it as smooth as possible.
44

55
!!! info "Requirements"
6-
Chunklet-py requires **Python 3.10 or newer**. We recommend using Python 3.11+ for the best experience.
6+
Chunklet-py requires **Python 3.11 or newer**. We recommend using Python 3.12+ for the best experience.
77

88
!!! note "chunklet-py (aka chunklet)"
99
The old `chunklet` package is no longer maintained. Use `chunklet-py` to get the latest version.
@@ -18,6 +18,15 @@ pip install chunklet-py
1818
chunklet --version
1919
```
2020

21+
!!! tip "Termux (Android)"
22+
No rust toolchain on Termux (especially python 3.13)? Install pydantic-core pre-built wheels first then retry installing chunklet-py:
23+
24+
```bash
25+
pip install typing-extensions
26+
pip install pydantic-core --index-url https://termux-user-repository.github.io/pypi/
27+
pip install "pydantic>=2.12.4,<2.13"
28+
```
29+
2130
And that's all there is to it! You're now ready to start using Chunklet-py.
2231

2332
## Optional Dependencies
@@ -51,7 +60,7 @@ cd chunklet-py
5160
pip install .[all]
5261
```
5362

54-
But why would you want to do that? The easy way is so much easier.
63+
But why would you want to do that? The pip way is so much easier.
5564

5665
## Contributing to Chunklet-py
5766

docs/getting-started/programmatic/document_chunker.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -411,7 +411,7 @@ for i, chunk in enumerate(chunks):
411411
!!! note "Special Handling for Streaming Processors"
412412
Some processors work differently due to their streaming nature - they yield content page by page or in blocks rather than all at once. This means they require special care:
413413

414-
**Streaming processors** (PDF, EPUB, DOCX, ODT): These beauties process content as they go, so they're designed for `chunk_files` method. Using them with `chunk_file` will throw a [`FileProcessingError`](../../exceptions-and-warnings.md#fileprocessingerror) since `chunk_file` expects all content upfront.
414+
**Streaming processors** (PDF, EPUB, DOCX, ODT): These beauties process content as they go, so they're designed for `chunk_files` method. Using them with `chunk_file` will throw an [`UnsupportedFileTypeError`](../../exceptions-and-warnings.md#unsupportedfiletypeerror) since `chunk_file` expects all content upfront.
415415

416416
**Regular processors** work fine with both `chunk_file` and `chunk_files` methods.
417417

0 commit comments

Comments
 (0)