Releases: speedyk-005/chunklet-py
v2.3.2 — DotDict Gets Its Groove Back
v2.3.2 — DotDict Gets Its Groove Back
What happened
v2.3.0 replaced python-box with dotdict3 for a 12x speedup, but dropped all serialization methods. .to_dict() calls in the CLI and user code would crash with AttributeError. This release fixes that — we vendored dotdict3 in-tree and added back all the Box-compatible serialization methods.
What's new
DotDict.to_dict()— recursive conversion to plain dicts/listsDotDict.to_json()— JSON string or file (stdlib, zero deps)DotDict.to_yaml()— YAML string or file (needspyyaml)DotDict.to_toml()— TOML string or file (needstoml)DotDict.to_msgpack()— msgpack bytes or file (needsmsgpack)
Performance
to_dict() is 2.3x faster than python-box's equivalent (8.44µs vs 19.23µs per op), since we're not dragging in SphinxBox, configBox, and the rest of python-box's feature creep.
Breaking changes
dotdict3is no longer a dependency — if you were importingfrom dotdict3 import DotDictdirectly in your code, change tofrom chunklet.common.dotdict import DotDict
What's gone
- External
dotdict3dependency (vendored inchunklet.common.dotdict) - Stale "Box" references in docs (now pointing to auto-generated API docs)
Full changelog
https://github.com/speedyk-005/chunklet-py/blob/main/CHANGELOG.md#232
🚀 Chunklet-py v2.3.1 - Patch Release
Quick patch release to fix a couple of Things That Should Have Worked.
🐛 What's Fixed
Android Detection (Actually Fixed This Time)
v2.3.0 tried to detect Android with platform_system markers. Problem: Android reports as 'Linux', not 'Android' — so nobody was getting the right sentencex version. Fixed now with sys_platform + platform_machine markers.
Side effect: ARM Linux devices (Raspberry Pi, etc.) also get the legacy sentencex<=0.6.1 without Rust bindings. Temporary workaround until we figure out a better detection method.
DotDict TypeError
Using DotDict() without arguments threw TypeError on dotdict3 < 1.4.2. Now using DotDict({}) for backward compatibility.
🚀 Chunklet-py v2.3.0 — smarter sentence splitting, faster visualizer
✨ What's New
- Non-Latin scripts in fallback splitter — Arabic, Chinese, Japanese, etc. now handled correctly via Unicode property escapes (
\p{Lo},\p{Lt}) - Fallback splitter preserves quotes, parens, and numbered lists — quoted text, parenthesized content, and
1. 2. 3.lists stay as single sentences instead of getting split apart (uses hash-based masking) - Visualizer API now supports MessagePack — browser requests it automatically for ~30-50% smaller payloads; programmatic clients can opt in via
Accept: application/msgpackheader (JSON still default) - ~2x faster span detection — replaced regex-based
_find_spanwith a deterministic finder, no more backtracking on large texts - Visualizer extra has a new shortcut "chunklet-py[viz]"
- Lazy imports for splitter libraries — faster startup
- Better markdown heading detection in DocumentChunker
🔧 The Fixes
pkg_resourcescrash on install — finally sorted out the setuptools dependency mess- Custom splitter registration — no more
TypeErrorwhen registeringfunctools.partialor other callables without a__name__ - Log spam with
lang='auto'— stopped warning you every single time you auto-detect a language - CodeChunker tree hierarchy — methods now appear under their class instead of "global"
🧹 Removed
- Python 3.10 support — Dropped becuase of recurring CI multiprocessing hangs + approaching EOL.
📦 Quick Install
pip install chunklet-py -U🔗 Additional Information
Feedback and bug reports welcome. Thanks!
Chunklet-py v2.2.0 "The Unification Edition"
What's New?
Check out What's New for the full scoop.
✨ Quick Summary
- Unified API — Consistent method names across all chunkers (
chunk_text,chunk_file,chunk_texts,chunk_files) - PlainTextChunker merged into DocumentChunker — Handle both text and documents with one class
- SentenceSplitter rename —
split()renamed tosplit_text(), also addedsplit_file() - Shorter CLI flags —
-lfor--lang,-hfor--host,-mfor--metadata,-tfor--tokenizer-timeout - Visualizer overhaul — Fullscreen mode, 3-row layout, smoother hovers
- Code chunking improvements — Fixed comment artifacts, added string protection
- More code languages — ColdFusion, VB.NET, PHP 8 attributes, Pascal support
- Dependency fixes — No more
pkg_resourcesheadaches - Direct imports — Now you can do
from chunklet import DocumentChunkerwithout performance issues - Test coverage — From 87% to
90.67%
Install
# Upgrade to latest
pip install chunklet-py -U
# Or install a specific version
pip install chunklet-py==2.2.0Migration
Upgrading from v2.1.x? Here's what changed:
| Old | New |
|---|---|
chunker.chunk() |
chunker.chunk_text() or chunker.chunk_file() |
chunker.batch_chunk() |
chunker.chunk_texts() or chunker.chunk_files() |
splitter.split() |
splitter.split_text() |
The old methods still work — they'll just yell at you with a deprecation warning.
Full Changelog
Everything else is in the changelog.
🚀 Chunklet-py v2.1.1 - Critical Bug Fix Release
🚨 Critical Fix
Fixed a breaking bug where the Chunk Visualizer static files (CSS, JS, HTML) were missing from the PyPI package distribution. This caused RuntimeError: Directory does not exist when running chunklet visualize.
📦 Installation
pip install --upgrade chunklet-py📋 What's Changed
- Added proper package data configuration to include visualizer static files
- Fixed PyPI package distribution to include all necessary files
- Updated documentation and changelog
📖 Full Details
See the complete changelog for all changes.
🎯 Impact
The visualizer now works correctly after installation. All other features remain unchanged and fully functional.
🚀 Chunklet-py v2.1.0 - Release
✨ What's New?
Chunklet v2.1.0 is here, and it's bringing the heat with real-time visualization and expanded file support. Whether you're debugging decorators or chunking Excel sheets, we've got you covered.
🚀 Highlights
- Interactive Visualizer: Launch a web-based UI to tune your parameters in real-time.
- New Formats: Support added for
.odt,.csv, and.xlsx. - Legacy Love: Restored support for Python 3.9 (while staying 3.14-ready!).
🛠️ Bug Fixes & Refactors
- CodeChunker: Fixed line skipping, decorator separation, and redundant logic.
- CLI: Resolved
PosixPathTypeError. Big thanks to @arnoldfranz! - CI/CD: Fixed Coveralls 422 errors and stabilized the test matrix.
Full Changelog: View here
Install: pip install chunklet-py==2.1.0
🚀 Chunklet-py v2.0.3 - Patch Release
Overview
Version 2.0.3 is a patch release that fixes critical span detection issues and improves performance by replacing the fuzzysearch dependency with an enhanced regex-based implementation.
🐛 Fixed Issues
- Span Detection Failure: Fixed hardcoded distance limit (
max_l_dist=10) in the old fuzzysearch-based_find_spanmethod that caused spans to always return(-1, -1)for longer text portions - Performance Issues: Resolved hanging problems during chunking operations for large documents
✨ Improvements
Enhanced Find Span Implementation
- Regex-Based Approach: Replaced fuzzysearch dependency with lightweight regex-based fuzzy matching
- Adaptive Budget Calculation: Uses
len(text_portion) // 4for proportional error tolerance - Flexible Separator Matching: Handles newlines, Unicode separators, and punctuation between lines
- Exact Match Fast Path: Prioritizes exact string matching for better performance
- Continuation Marker Handling: Properly removes continuation markers before span search
Dependency Management
- Removed fuzzysearch: Eliminated external dependency, reducing package size and complexity
- Improved Reliability: More predictable behavior across different text patterns
📦 Installation
pip install chunklet-py==2.0.3🚀 Chunklet-py v2.0.2 - Patch Release
This is a minor patch release that removes some internal debugging statements that were unintentionally left in the code.
🧹 Housekeeping
- Internal: Removed debug print statements from the _filter_sentences method in SentenceSplitter.
You can view the full details in the Changelog.
🚀 Chunklet-py v2.0.1 - Patch Release
This is a patch release that addresses a critical bug in the split command of the CLI.
🐞 Bug Fixes
- CLI Bug: Fixed a critical unpacking bug in the split command. The line intended to extract sentences and confidence from splitter.split (e.g.,
sentences, confidence = splitter.split(...)) caused either a ValueError (if splitter.split returned a number of sentences other than exactly
two) or silent, incorrect unpacking (if exactly two sentences were returned, assigning the first sentence string to sentences and the second to
confidence, leading to character-level iteration). The fix now correctly separates language detection and confidence retrieval from sentence
splitting, resolving both issues and ensuring accurate output.
📑 Documentaion
- Documentation for installing optional and development dependencies has been updated and clarified.
You can view the full details in the Changelog.
Chunklet-py 2.0.0 Released: Major Enhancements and New Features
We are thrilled to announce the release of Chunklet-py version 2.0.0! This is a major update that brings a host of new features, significant performance improvements, and a more intuitive user experience.
✨ What's New in Version 2.0.0?
-
New Chunking Engines:
DocumentChunker: You can now seamlessly process various document formats including.pdf,.docx,.epub,.html,.rst, and.tex. TheDocumentChunkerautomatically converts documents to Markdown where possible, extracts rich metadata, and provides a unified interface for all your document processing needs.CodeChunker: A new language-agnostic chunker for source code has been introduced. It is designed to understand and preserve the structural integrity of your code for more meaningful chunks.
-
Expanded Multilingual Support: We've significantly improved our multilingual capabilities, now offering robust sentence splitting for over 50 languages.
-
Enhanced Customization:
- Custom Document Processors: You can now create and plug in your own custom processors to handle any file type you need.
- Custom Tokenizer Commands: The CLI now supports custom tokenizer commands, allowing for more accurate token counting with your preferred tokenizer.
-
Streamlined CLI: The command-line interface has been refactored for a more user-friendly experience, with simplified flags for input (
--source) and output (--destination). -
Comprehensive Documentation: Our documentation has been completely overhauled for clarity and ease of use. It now includes more examples, detailed guides for each chunker, and a new section comparing
chunklet-pyto other libraries.
📈 Improvements
- Performance: Batch processing has been optimized for better performance and reduced memory usage.
- Code Quality: The codebase has undergone significant refactoring for improved readability, maintainability, and security.
- Error Handling: We have introduced more specific and informative error messages to aid in debugging.
⚠️ Breaking Changes
This release introduces breaking changes, particularly in the CLI and the renaming of some core components. Please consult the Migration Guide for a smooth transition.
📚 Further Information
- Full Changelog: For a detailed list of every change, bug fix, and improvement, please see our Changelog.
- Documentation: Explore all features and usage examples on our Documentation Site.
It is on pypi as of now Pypi
We're excited to see what you'll build with the new and improved chunklet-py! Your feedback is always welcome.
Full Changelog: v1.3.2...v2.0.0