Releases · speedyk-005/chunklet-py

02 Jun 22:11

speedyk-005

v2.3.2

d1a248d

v2.3.2 — DotDict Gets Its Groove Back Latest

Latest

v2.3.2 — DotDict Gets Its Groove Back

What happened

v2.3.0 replaced python-box with dotdict3 for a 12x speedup, but dropped all serialization methods. .to_dict() calls in the CLI and user code would crash with AttributeError. This release fixes that — we vendored dotdict3 in-tree and added back all the Box-compatible serialization methods.

What's new

DotDict.to_dict() — recursive conversion to plain dicts/lists
DotDict.to_json() — JSON string or file (stdlib, zero deps)
DotDict.to_yaml() — YAML string or file (needs pyyaml)
DotDict.to_toml() — TOML string or file (needs toml)
DotDict.to_msgpack() — msgpack bytes or file (needs msgpack)

Performance

to_dict() is 2.3x faster than python-box's equivalent (8.44µs vs 19.23µs per op), since we're not dragging in SphinxBox, configBox, and the rest of python-box's feature creep.

Breaking changes

dotdict3 is no longer a dependency — if you were importing from dotdict3 import DotDict directly in your code, change to from chunklet.common.dotdict import DotDict

What's gone

External dotdict3 dependency (vendored in chunklet.common.dotdict)
Stale "Box" references in docs (now pointing to auto-generated API docs)

Full changelog

https://github.com/speedyk-005/chunklet-py/blob/main/CHANGELOG.md#232

Assets 4

02 May 21:37

speedyk-005

v2.3.1

0d4f42e

🚀 Chunklet-py v2.3.1 - Patch Release

Quick patch release to fix a couple of Things That Should Have Worked.

🐛 What's Fixed

Android Detection (Actually Fixed This Time)
v2.3.0 tried to detect Android with platform_system markers. Problem: Android reports as 'Linux', not 'Android' — so nobody was getting the right sentencex version. Fixed now with sys_platform + platform_machine markers.
Side effect: ARM Linux devices (Raspberry Pi, etc.) also get the legacy sentencex<=0.6.1 without Rust bindings. Temporary workaround until we figure out a better detection method.
DotDict TypeError
Using DotDict() without arguments threw TypeError on dotdict3 < 1.4.2. Now using DotDict({}) for backward compatibility.

Assets 2

01 May 21:58

speedyk-005

v2.3.0

0870504

🚀 Chunklet-py v2.3.0 — smarter sentence splitting, faster visualizer

✨ What's New

Non-Latin scripts in fallback splitter — Arabic, Chinese, Japanese, etc. now handled correctly via Unicode property escapes (\p{Lo}, \p{Lt})
Fallback splitter preserves quotes, parens, and numbered lists — quoted text, parenthesized content, and 1. 2. 3. lists stay as single sentences instead of getting split apart (uses hash-based masking)
Visualizer API now supports MessagePack — browser requests it automatically for ~30-50% smaller payloads; programmatic clients can opt in via Accept: application/msgpack header (JSON still default)
~2x faster span detection — replaced regex-based _find_span with a deterministic finder, no more backtracking on large texts
Visualizer extra has a new shortcut "chunklet-py[viz]"
Lazy imports for splitter libraries — faster startup
Better markdown heading detection in DocumentChunker

🔧 The Fixes

pkg_resources crash on install — finally sorted out the setuptools dependency mess
Custom splitter registration — no more TypeError when registering functools.partial or other callables without a __name__
Log spam with lang='auto' — stopped warning you every single time you auto-detect a language
CodeChunker tree hierarchy — methods now appear under their class instead of "global"

🧹 Removed

Python 3.10 support — Dropped becuase of recurring CI multiprocessing hangs + approaching EOL.

📦 Quick Install

pip install chunklet-py -U

🔗 Additional Information

GitHub | Docs | Changelog

Feedback and bug reports welcome. Thanks!

Assets 2

22 Feb 19:55

speedyk-005

v2.2.0

80dcf2d

Chunklet-py v2.2.0 "The Unification Edition"

What's New?

Check out What's New for the full scoop.

✨ Quick Summary

Unified API — Consistent method names across all chunkers (chunk_text, chunk_file, chunk_texts, chunk_files)
PlainTextChunker merged into DocumentChunker — Handle both text and documents with one class
SentenceSplitter rename — split() renamed to split_text(), also added split_file()
Shorter CLI flags — -l for --lang, -h for --host, -m for --metadata, -t for --tokenizer-timeout
Visualizer overhaul — Fullscreen mode, 3-row layout, smoother hovers
Code chunking improvements — Fixed comment artifacts, added string protection
More code languages — ColdFusion, VB.NET, PHP 8 attributes, Pascal support
Dependency fixes — No more pkg_resources headaches
Direct imports — Now you can do from chunklet import DocumentChunker without performance issues
Test coverage — From 87% to 90.67%

Install

# Upgrade to latest
pip install chunklet-py -U

# Or install a specific version
pip install chunklet-py==2.2.0

Migration

Upgrading from v2.1.x? Here's what changed:

Old	New
`chunker.chunk()`	`chunker.chunk_text()` or `chunker.chunk_file()`
`chunker.batch_chunk()`	`chunker.chunk_texts()` or `chunker.chunk_files()`
`splitter.split()`	`splitter.split_text()`

The old methods still work — they'll just yell at you with a deprecation warning.

Full Changelog

Everything else is in the changelog.

Assets 2

21 Dec 17:08

speedyk-005

v2.1.1

0071643

🚀 Chunklet-py v2.1.1 - Critical Bug Fix Release

🚨 Critical Fix

Fixed a breaking bug where the Chunk Visualizer static files (CSS, JS, HTML) were missing from the PyPI package distribution. This caused RuntimeError: Directory does not exist when running chunklet visualize.

📦 Installation

pip install --upgrade chunklet-py

📋 What's Changed

Added proper package data configuration to include visualizer static files
Fixed PyPI package distribution to include all necessary files
Updated documentation and changelog

📖 Full Details

See the complete changelog for all changes.

🎯 Impact

The visualizer now works correctly after installation. All other features remain unchanged and fully functional.

Assets 2

20 Dec 16:43

speedyk-005

v2.1.0

81772b3

🚀 Chunklet-py v2.1.0 - Release

✨ What's New?

Chunklet v2.1.0 is here, and it's bringing the heat with real-time visualization and expanded file support. Whether you're debugging decorators or chunking Excel sheets, we've got you covered.

🚀 Highlights

Interactive Visualizer: Launch a web-based UI to tune your parameters in real-time.
New Formats: Support added for .odt, .csv, and .xlsx.
Legacy Love: Restored support for Python 3.9 (while staying 3.14-ready!).

🛠️ Bug Fixes & Refactors

CodeChunker: Fixed line skipping, decorator separation, and redundant logic.
CLI: Resolved PosixPath TypeError. Big thanks to @arnoldfranz!
CI/CD: Fixed Coveralls 422 errors and stabilized the test matrix.

Full Changelog: View here
Install: pip install chunklet-py==2.1.0

Contributors

arnoldfranz

Assets 2

21 Nov 22:44

speedyk-005

v2.0.3

fb7fdaa

🚀 Chunklet-py v2.0.3 - Patch Release

Overview

Version 2.0.3 is a patch release that fixes critical span detection issues and improves performance by replacing the fuzzysearch dependency with an enhanced regex-based implementation.

🐛 Fixed Issues

Span Detection Failure: Fixed hardcoded distance limit (max_l_dist=10) in the old fuzzysearch-based _find_span method that caused spans to always return (-1, -1) for longer text portions
Performance Issues: Resolved hanging problems during chunking operations for large documents

✨ Improvements

Enhanced Find Span Implementation

Regex-Based Approach: Replaced fuzzysearch dependency with lightweight regex-based fuzzy matching
Adaptive Budget Calculation: Uses len(text_portion) // 4 for proportional error tolerance
Flexible Separator Matching: Handles newlines, Unicode separators, and punctuation between lines
Exact Match Fast Path: Prioritizes exact string matching for better performance
Continuation Marker Handling: Properly removes continuation markers before span search

Dependency Management

Removed fuzzysearch: Eliminated external dependency, reducing package size and complexity
Improved Reliability: More predictable behavior across different text patterns

📦 Installation

pip install chunklet-py==2.0.3

Assets 2

21 Nov 03:26

speedyk-005

v2.0.2

17a29c8

🚀 Chunklet-py v2.0.2 - Patch Release

This is a minor patch release that removes some internal debugging statements that were unintentionally left in the code.

🧹 Housekeeping

Internal: Removed debug print statements from the _filter_sentences method in SentenceSplitter.

You can view the full details in the Changelog.

Assets 2

20 Nov 17:51

speedyk-005

v2.0.1

e70bcee

🚀 Chunklet-py v2.0.1 - Patch Release

This is a patch release that addresses a critical bug in the split command of the CLI.

🐞 Bug Fixes

CLI Bug: Fixed a critical unpacking bug in the split command. The line intended to extract sentences and confidence from splitter.split (e.g.,
sentences, confidence = splitter.split(...)) caused either a ValueError (if splitter.split returned a number of sentences other than exactly
two) or silent, incorrect unpacking (if exactly two sentences were returned, assigning the first sentence string to sentences and the second to
confidence, leading to character-level iteration). The fix now correctly separates language detection and confidence retrieval from sentence
splitting, resolving both issues and ensuring accurate output.

📑 Documentaion

Documentation for installing optional and development dependencies has been updated and clarified.

You can view the full details in the Changelog.

Assets 2

20 Nov 02:21

speedyk-005

v2.0.0

6ef8cea

Chunklet-py 2.0.0 Released: Major Enhancements and New Features

We are thrilled to announce the release of Chunklet-py version 2.0.0! This is a major update that brings a host of new features, significant performance improvements, and a more intuitive user experience.

✨ What's New in Version 2.0.0?

New Chunking Engines:
- DocumentChunker: You can now seamlessly process various document formats including .pdf, .docx, .epub, .html, .rst, and .tex. The DocumentChunker automatically converts documents to Markdown where possible, extracts rich metadata, and provides a unified interface for all your document processing needs.
- CodeChunker: A new language-agnostic chunker for source code has been introduced. It is designed to understand and preserve the structural integrity of your code for more meaningful chunks.
Expanded Multilingual Support: We've significantly improved our multilingual capabilities, now offering robust sentence splitting for over 50 languages.
Enhanced Customization:
- Custom Document Processors: You can now create and plug in your own custom processors to handle any file type you need.
- Custom Tokenizer Commands: The CLI now supports custom tokenizer commands, allowing for more accurate token counting with your preferred tokenizer.
Streamlined CLI: The command-line interface has been refactored for a more user-friendly experience, with simplified flags for input (--source) and output (--destination).
Comprehensive Documentation: Our documentation has been completely overhauled for clarity and ease of use. It now includes more examples, detailed guides for each chunker, and a new section comparing chunklet-py to other libraries.

📈 Improvements

Performance: Batch processing has been optimized for better performance and reduced memory usage.
Code Quality: The codebase has undergone significant refactoring for improved readability, maintainability, and security.
Error Handling: We have introduced more specific and informative error messages to aid in debugging.

⚠️ Breaking Changes

This release introduces breaking changes, particularly in the CLI and the renaming of some core components. Please consult the Migration Guide for a smooth transition.

📚 Further Information

Full Changelog: For a detailed list of every change, bug fix, and improvement, please see our Changelog.
Documentation: Explore all features and usage examples on our Documentation Site.

It is on pypi as of now Pypi
We're excited to see what you'll build with the new and improved chunklet-py! Your feedback is always welcome.

Full Changelog: v1.3.2...v2.0.0

Assets 2

Releases: speedyk-005/chunklet-py

v2.3.2 — DotDict Gets Its Groove Back

v2.3.2 — DotDict Gets Its Groove Back

What happened

What's new

Performance

Breaking changes

What's gone

Full changelog

Uh oh!

🚀 Chunklet-py v2.3.1 - Patch Release

🐛 What's Fixed

Uh oh!

🚀 Chunklet-py v2.3.0 — smarter sentence splitting, faster visualizer

✨ What's New

🔧 The Fixes

🧹 Removed

📦 Quick Install

🔗 Additional Information

Uh oh!

Chunklet-py v2.2.0 "The Unification Edition"

What's New?

✨ Quick Summary

Install

Migration

Full Changelog

Uh oh!

🚀 Chunklet-py v2.1.1 - Critical Bug Fix Release

🚨 Critical Fix

📦 Installation

📋 What's Changed

📖 Full Details

🎯 Impact

Uh oh!

🚀 Chunklet-py v2.1.0 - Release

✨ What's New?

🚀 Highlights

🛠️ Bug Fixes & Refactors

Contributors

Uh oh!

🚀 Chunklet-py v2.0.3 - Patch Release

Overview

🐛 Fixed Issues

✨ Improvements

Enhanced Find Span Implementation

Dependency Management

📦 Installation

Uh oh!

🚀 Chunklet-py v2.0.2 - Patch Release

Uh oh!

🚀 Chunklet-py v2.0.1 - Patch Release

Uh oh!

Chunklet-py 2.0.0 Released: Major Enhancements and New Features

✨ What's New in Version 2.0.0?

📈 Improvements

⚠️ Breaking Changes

📚 Further Information

Uh oh!