Releases · rhnfzl/SqueakyCleanText

28 Feb 21:18

rhnfzl

v0.6.1

e23b104

v0.6.1: Bug fixes, functional tests, GLiNER ONNX fallback Latest

Latest

Bug Fixes

GLiClass API fix: Document classification (check_classify_document=True) was completely broken — candidate_labels= → labels= and updated return format parsing to list[list[dict]]
Presidio GLiNER fix: presidio_gliner backend was completely broken — model_path= renamed to model_name= in upstream presidio-analyzer, added threshold= passthrough
GLiNER ONNX fallback: gliner_onnx=True no longer crashes with FileNotFoundError when model.onnx is missing (GLiNER issue #314) — gracefully falls back to PyTorch with a warning

New

69 end-to-end functional tests (tests/test_functional.py) covering 100% of feature surface with real model inference across 13 categories: core pipeline, language detection, NER backends (ONNX, Torch, GLiNER, ensemble, Presidio), replacement modes, document classification, synthetic replacement, PII mode, and config edge cases
Shared test infrastructure (tests/conftest.py) — DRY test constants

Code Quality

Simplified code: hoisted loop invariants, flattened branches, consolidated regex constants to constants.py
Uniform AnonymizeResult return type across all anonymization paths
Removed dead code and redundant decorators
Side-effect-free package detection in tests (importlib.util.find_spec)

Test Results

Suite	Passed	Failed
Unit tests	194	0
Functional tests (real models)	69	0
Total	263	0

Full Changelog: v0.6.0...v0.6.1

Assets 2

28 Feb 18:01

rhnfzl

v0.6.0

9480bc2

v0.6.0 — GLiNER modernization, reversible anonymization, language ISO codes

What's New in v0.6.0

Major Features

Reversible anonymization (replacement_mode='reversible'): indexed placeholders <PERSON_0> with AnonymizationMap for round-trip deanonymization
Synthetic replacement (replacement_mode='synthetic'): Faker-generated realistic fake values for entities
PII mode (ner_mode='pii'): auto-configures GLiNER with 60+ entity labels for comprehensive PII detection
Document classification (check_classify_document=True): GLiClass zero-shot pre-classification before text processing
ProcessResult: backward-compatible 3-tuple unpacking + .metadata dict for reversible maps and classification results
GLiNER ONNX (gliner_onnx=True): loads GLiNER with pre-built ONNX weights from HuggingFace Hub
Bi-encoder support: auto-detects ModernBERT, caches label embeddings, dynamic context windows

Language Parameter Enhancement

ISO code acceptance: language='en' (ISO 639-1), language='eng' (ISO 639-3), or language='ENGLISH' — all resolve to canonical Lingua name
Tuple-of-languages: language=('en', 'nl') restricts auto-detection to listed languages (vs. pinning to one)
extra_languages also accepts ISO codes: extra_languages=('fr', 'pt') resolves correctly
All 75 Lingua languages supported via ISO code lookup

Improvements

Thread-safe TextCleanerConfig frozen dataclass with per-instance configuration
Parallel process_batch() via ThreadPoolExecutor (ONNX releases the GIL)
Async aprocess_batch() for FastAPI/aiohttp integration
warmup() method for pre-loading NER models at startup

Full Changelog

v0.5.2...v0.6.0

Assets 2

28 Feb 09:08

rhnfzl

v0.5.2

def20dc

v0.5.2

What's changed

Consolidated README: Combined "What's New" v0.5.0 and v0.4.5 sections into a single concise summary with only major user-facing features
Fixed llms.txt link: Changed from relative path (./llms.txt) to absolute GitHub URL so it works on both GitHub and PyPI
Added llms.txt to PyPI sidebar: Now appears as a clickable link in PyPI project URLs
Single-source version: Version is now defined only in sct/__init__.py and read dynamically by pyproject.toml via setuptools.dynamic - no more dual maintenance
Removed em-dashes: Replaced all em-dash characters with standard punctuation throughout README

Full Changelog: v0.5.1...v0.5.2

Assets 2

23 Feb 22:49

rhnfzl

v0.5.1

cc3a773

v0.5.1

What's New

Bug Fixes

asyncio: Replace deprecated get_event_loop() with get_running_loop() — fixes RuntimeError in Python 3.12+ when called from FastAPI/aiohttp
Quantize cache path: ONNX quantized models now write to ~/.cache/sct_quantized/ instead of the read-only HuggingFace Hub cache directory

Performance

ONNX session sharing: Languages sharing the same model (e.g. FR/PT/IT all use wikineural-multilingual-ner-onnx) now reuse a single session — saves ~600 MB RAM
Finer-grained inference locks: Per-model locks replace the coarse per-language lock, allowing true concurrent inference across different models
Lock-free tokenization: split_text() no longer holds the inference lock during chunking (HF fast tokenizer is thread-safe)
CJK/Arabic chunk safety: _simple_chunk() uses 2 chars/token (down from 4) to prevent context-window overflow for CJK and Arabic text

API

load_language(lang): New public method on GeneralNER for stable warm-up without coupling to private _ensure_loaded()
ner_batch_size validation: TextCleanerConfig(ner_batch_size=0) now raises ValueError immediately instead of silently producing empty results

Regex

Sentence boundary: SENTENCE_BOUNDARY_PATTERN upgraded to the regex library with an abbreviation guard — "Dr. Smith", "Mr. Jones", "U.S. Army" no longer cause false splits in NER chunking

Tests

12 new tests covering all 9 changes

Assets 2

23 Feb 13:03

rhnfzl

v0.4.5

aaec893

v0.4.5

v0.4.5 — Multi-Backend NER, ONNX-First Architecture, Language Extensibility

Major release with architectural overhaul since v0.3.0.

Architecture

Frozen TextCleanerConfig dataclass — thread-safe, per-instance configuration replaces mutable module-level globals
Backward-compatible: old config.CHECK_* module vars still work
ONNX-first NER inference — torch-free base install (~400MB models vs ~7GB)
Thread-parallel batch processing via ThreadPoolExecutor (ONNX releases the GIL)
Migrated from setup.py to pyproject.toml (PEP 517)

Multi-Backend NER

5 backends: onnx (default), torch, gliner, ensemble_onnx, ensemble_torch
GLiNER zero-shot NER for custom entity types (PRODUCT, EVENT, SKILL, etc.)
Ensemble voting across backends for improved recall
Lazy per-language model loading — models load on first use, not all at init
Batched chunk inference — long texts split with semantic boundaries (paragraph > sentence > clause > word)
Language-specific + multilingual models with averaged confidence scores
Entity key collision fix — uses start:end separator instead of string concatenation
MISC entity handling — no longer silently dropped
ONNX-quantized models hosted on HuggingFace Hub
Configurable confidence thresholds per backend
GPU acceleration (CUDA) for both ONNX and PyTorch backends

Language Extensibility

Add custom languages via config — no source changes needed
extra_languages: extend Lingua detection to new languages
custom_stopwords: per-language stopword sets
custom_month_names: per-language month names for date detection
ner_models: per-language ONNX model overrides
Language-keyed model dict replaces fragile positional tuple
See docs/ADDING_LANGUAGES.md for the full guide

Multilingual Date Detection

Extended DATE_REGEX with month names in 4 languages: English, Dutch, German, Spanish
Supports ISO 8601, DD/MM/YYYY, month-name-first, and day-first date formats
Optional fuzzy date matching via RapidFuzz catches misspelled months (e.g. "Janury", "Feburary")
Install with: pip install squeakycleantext[fuzzy]
Enable: TextCleanerConfig(check_fuzzy_replace_dates=True)
Language-aware vocabulary filtering with English fallback

Text Processing

Configurable emoji removal (check_remove_emoji)
Configurable bracket/brace content removal (check_remove_bracket_content, check_remove_brace_content)
Smart case folding — preserves abbreviations (NATO, UNESCO) and camelCase (iPhone, eBay) while lowercasing regular words; stopword-aware

Dependencies

stop-words replaces NLTK (50KB bundled vs 30MB download, no post-install step)
Stopwords stored as sets for O(1) lookup (was lists with O(n))
PyTorch/Transformers moved to optional [torch] extra
New optional extras: [gpu], [fuzzy], [gliner], [gliner2], [all-ner]

Code Quality

Python 3.11–3.13 support
ruff linter (replaces flake8)
79+ tests with Hypothesis property-based testing and Faker
Thread-safety verified with concurrent NER tests
Pre-compiled punctuation and symbol regexes
CI with coverage reporting

Bug Fixes

Fixed ReDoS vulnerability in ACRONYM_REGEX
Fixed Unicode escape bug in URL_REGEX and EMAIL_REGEX
Fixed ISOLATED_LETTERS_REGEX matching commas/spaces
Fixed process() always returning consistent 3-tuple
Fixed thread-safety (no instance mutation during processing)
Fixed entity key collision in NER deduplication

Full Changelog: v0.3.0...v0.4.5

Assets 2

Releases: rhnfzl/SqueakyCleanText

v0.6.1: Bug fixes, functional tests, GLiNER ONNX fallback

Bug Fixes

New

Code Quality

Test Results

Uh oh!

v0.6.0 — GLiNER modernization, reversible anonymization, language ISO codes

What's New in v0.6.0

Major Features

Language Parameter Enhancement

Improvements

Full Changelog

Uh oh!

v0.5.2

What's changed

Uh oh!

v0.5.1

What's New

Bug Fixes

Performance

API

Regex

Tests

Uh oh!

v0.4.5

v0.4.5 — Multi-Backend NER, ONNX-First Architecture, Language Extensibility

Architecture

Multi-Backend NER

Language Extensibility

Multilingual Date Detection

Text Processing

Dependencies

Code Quality

Bug Fixes

Uh oh!