Releases: rhnfzl/SqueakyCleanText
Releases · rhnfzl/SqueakyCleanText
v0.6.1: Bug fixes, functional tests, GLiNER ONNX fallback
Bug Fixes
- GLiClass API fix: Document classification (
check_classify_document=True) was completely broken —candidate_labels=→labels=and updated return format parsing tolist[list[dict]] - Presidio GLiNER fix:
presidio_glinerbackend was completely broken —model_path=renamed tomodel_name=in upstreampresidio-analyzer, addedthreshold=passthrough - GLiNER ONNX fallback:
gliner_onnx=Trueno longer crashes withFileNotFoundErrorwhenmodel.onnxis missing (GLiNER issue #314) — gracefully falls back to PyTorch with a warning
New
- 69 end-to-end functional tests (
tests/test_functional.py) covering 100% of feature surface with real model inference across 13 categories: core pipeline, language detection, NER backends (ONNX, Torch, GLiNER, ensemble, Presidio), replacement modes, document classification, synthetic replacement, PII mode, and config edge cases - Shared test infrastructure (
tests/conftest.py) — DRY test constants
Code Quality
- Simplified code: hoisted loop invariants, flattened branches, consolidated regex constants to
constants.py - Uniform
AnonymizeResultreturn type across all anonymization paths - Removed dead code and redundant decorators
- Side-effect-free package detection in tests (
importlib.util.find_spec)
Test Results
| Suite | Passed | Failed |
|---|---|---|
| Unit tests | 194 | 0 |
| Functional tests (real models) | 69 | 0 |
| Total | 263 | 0 |
Full Changelog: v0.6.0...v0.6.1
v0.6.0 — GLiNER modernization, reversible anonymization, language ISO codes
What's New in v0.6.0
Major Features
- Reversible anonymization (
replacement_mode='reversible'): indexed placeholders<PERSON_0>withAnonymizationMapfor round-trip deanonymization - Synthetic replacement (
replacement_mode='synthetic'): Faker-generated realistic fake values for entities - PII mode (
ner_mode='pii'): auto-configures GLiNER with 60+ entity labels for comprehensive PII detection - Document classification (
check_classify_document=True): GLiClass zero-shot pre-classification before text processing - ProcessResult: backward-compatible 3-tuple unpacking +
.metadatadict for reversible maps and classification results - GLiNER ONNX (
gliner_onnx=True): loads GLiNER with pre-built ONNX weights from HuggingFace Hub - Bi-encoder support: auto-detects ModernBERT, caches label embeddings, dynamic context windows
Language Parameter Enhancement
- ISO code acceptance:
language='en'(ISO 639-1),language='eng'(ISO 639-3), orlanguage='ENGLISH'— all resolve to canonical Lingua name - Tuple-of-languages:
language=('en', 'nl')restricts auto-detection to listed languages (vs. pinning to one) extra_languagesalso accepts ISO codes:extra_languages=('fr', 'pt')resolves correctly- All 75 Lingua languages supported via ISO code lookup
Improvements
- Thread-safe
TextCleanerConfigfrozen dataclass with per-instance configuration - Parallel
process_batch()viaThreadPoolExecutor(ONNX releases the GIL) - Async
aprocess_batch()for FastAPI/aiohttp integration warmup()method for pre-loading NER models at startup
Full Changelog
v0.5.2
What's changed
- Consolidated README: Combined "What's New" v0.5.0 and v0.4.5 sections into a single concise summary with only major user-facing features
- Fixed llms.txt link: Changed from relative path (
./llms.txt) to absolute GitHub URL so it works on both GitHub and PyPI - Added llms.txt to PyPI sidebar: Now appears as a clickable link in PyPI project URLs
- Single-source version: Version is now defined only in
sct/__init__.pyand read dynamically bypyproject.tomlviasetuptools.dynamic- no more dual maintenance - Removed em-dashes: Replaced all em-dash characters with standard punctuation throughout README
Full Changelog: v0.5.1...v0.5.2
v0.5.1
What's New
Bug Fixes
- asyncio: Replace deprecated
get_event_loop()withget_running_loop()— fixesRuntimeErrorin Python 3.12+ when called from FastAPI/aiohttp - Quantize cache path: ONNX quantized models now write to
~/.cache/sct_quantized/instead of the read-only HuggingFace Hub cache directory
Performance
- ONNX session sharing: Languages sharing the same model (e.g. FR/PT/IT all use
wikineural-multilingual-ner-onnx) now reuse a single session — saves ~600 MB RAM - Finer-grained inference locks: Per-model locks replace the coarse per-language lock, allowing true concurrent inference across different models
- Lock-free tokenization:
split_text()no longer holds the inference lock during chunking (HF fast tokenizer is thread-safe) - CJK/Arabic chunk safety:
_simple_chunk()uses 2 chars/token (down from 4) to prevent context-window overflow for CJK and Arabic text
API
load_language(lang): New public method onGeneralNERfor stable warm-up without coupling to private_ensure_loaded()ner_batch_sizevalidation:TextCleanerConfig(ner_batch_size=0)now raisesValueErrorimmediately instead of silently producing empty results
Regex
- Sentence boundary:
SENTENCE_BOUNDARY_PATTERNupgraded to theregexlibrary with an abbreviation guard — "Dr. Smith", "Mr. Jones", "U.S. Army" no longer cause false splits in NER chunking
Tests
- 12 new tests covering all 9 changes
v0.4.5
v0.4.5 — Multi-Backend NER, ONNX-First Architecture, Language Extensibility
Major release with architectural overhaul since v0.3.0.
Architecture
- Frozen
TextCleanerConfigdataclass — thread-safe, per-instance configuration replaces mutable module-level globals - Backward-compatible: old
config.CHECK_*module vars still work - ONNX-first NER inference — torch-free base install (~400MB models vs ~7GB)
- Thread-parallel batch processing via
ThreadPoolExecutor(ONNX releases the GIL) - Migrated from
setup.pytopyproject.toml(PEP 517)
Multi-Backend NER
- 5 backends:
onnx(default),torch,gliner,ensemble_onnx,ensemble_torch - GLiNER zero-shot NER for custom entity types (PRODUCT, EVENT, SKILL, etc.)
- Ensemble voting across backends for improved recall
- Lazy per-language model loading — models load on first use, not all at init
- Batched chunk inference — long texts split with semantic boundaries (paragraph > sentence > clause > word)
- Language-specific + multilingual models with averaged confidence scores
- Entity key collision fix — uses
start:endseparator instead of string concatenation - MISC entity handling — no longer silently dropped
- ONNX-quantized models hosted on HuggingFace Hub
- Configurable confidence thresholds per backend
- GPU acceleration (CUDA) for both ONNX and PyTorch backends
Language Extensibility
- Add custom languages via config — no source changes needed
extra_languages: extend Lingua detection to new languagescustom_stopwords: per-language stopword setscustom_month_names: per-language month names for date detectionner_models: per-language ONNX model overrides- Language-keyed model dict replaces fragile positional tuple
- See docs/ADDING_LANGUAGES.md for the full guide
Multilingual Date Detection
- Extended
DATE_REGEXwith month names in 4 languages: English, Dutch, German, Spanish - Supports ISO 8601, DD/MM/YYYY, month-name-first, and day-first date formats
- Optional fuzzy date matching via RapidFuzz catches misspelled months (e.g. "Janury", "Feburary")
- Install with:
pip install squeakycleantext[fuzzy] - Enable:
TextCleanerConfig(check_fuzzy_replace_dates=True) - Language-aware vocabulary filtering with English fallback
Text Processing
- Configurable emoji removal (
check_remove_emoji) - Configurable bracket/brace content removal (
check_remove_bracket_content,check_remove_brace_content) - Smart case folding — preserves abbreviations (NATO, UNESCO) and camelCase (iPhone, eBay) while lowercasing regular words; stopword-aware
Dependencies
stop-wordsreplaces NLTK (50KB bundled vs 30MB download, no post-install step)- Stopwords stored as sets for O(1) lookup (was lists with O(n))
- PyTorch/Transformers moved to optional
[torch]extra - New optional extras:
[gpu],[fuzzy],[gliner],[gliner2],[all-ner]
Code Quality
- Python 3.11–3.13 support
rufflinter (replaces flake8)- 79+ tests with Hypothesis property-based testing and Faker
- Thread-safety verified with concurrent NER tests
- Pre-compiled punctuation and symbol regexes
- CI with coverage reporting
Bug Fixes
- Fixed ReDoS vulnerability in
ACRONYM_REGEX - Fixed Unicode escape bug in
URL_REGEXandEMAIL_REGEX - Fixed
ISOLATED_LETTERS_REGEXmatching commas/spaces - Fixed
process()always returning consistent 3-tuple - Fixed thread-safety (no instance mutation during processing)
- Fixed entity key collision in NER deduplication
Full Changelog: v0.3.0...v0.4.5