Skip to content

Releases: rhnfzl/SqueakyCleanText

v0.6.1: Bug fixes, functional tests, GLiNER ONNX fallback

28 Feb 21:18
e23b104

Choose a tag to compare

Bug Fixes

  • GLiClass API fix: Document classification (check_classify_document=True) was completely broken — candidate_labels=labels= and updated return format parsing to list[list[dict]]
  • Presidio GLiNER fix: presidio_gliner backend was completely broken — model_path= renamed to model_name= in upstream presidio-analyzer, added threshold= passthrough
  • GLiNER ONNX fallback: gliner_onnx=True no longer crashes with FileNotFoundError when model.onnx is missing (GLiNER issue #314) — gracefully falls back to PyTorch with a warning

New

  • 69 end-to-end functional tests (tests/test_functional.py) covering 100% of feature surface with real model inference across 13 categories: core pipeline, language detection, NER backends (ONNX, Torch, GLiNER, ensemble, Presidio), replacement modes, document classification, synthetic replacement, PII mode, and config edge cases
  • Shared test infrastructure (tests/conftest.py) — DRY test constants

Code Quality

  • Simplified code: hoisted loop invariants, flattened branches, consolidated regex constants to constants.py
  • Uniform AnonymizeResult return type across all anonymization paths
  • Removed dead code and redundant decorators
  • Side-effect-free package detection in tests (importlib.util.find_spec)

Test Results

Suite Passed Failed
Unit tests 194 0
Functional tests (real models) 69 0
Total 263 0

Full Changelog: v0.6.0...v0.6.1

v0.6.0 — GLiNER modernization, reversible anonymization, language ISO codes

28 Feb 18:01

Choose a tag to compare

What's New in v0.6.0

Major Features

  • Reversible anonymization (replacement_mode='reversible'): indexed placeholders <PERSON_0> with AnonymizationMap for round-trip deanonymization
  • Synthetic replacement (replacement_mode='synthetic'): Faker-generated realistic fake values for entities
  • PII mode (ner_mode='pii'): auto-configures GLiNER with 60+ entity labels for comprehensive PII detection
  • Document classification (check_classify_document=True): GLiClass zero-shot pre-classification before text processing
  • ProcessResult: backward-compatible 3-tuple unpacking + .metadata dict for reversible maps and classification results
  • GLiNER ONNX (gliner_onnx=True): loads GLiNER with pre-built ONNX weights from HuggingFace Hub
  • Bi-encoder support: auto-detects ModernBERT, caches label embeddings, dynamic context windows

Language Parameter Enhancement

  • ISO code acceptance: language='en' (ISO 639-1), language='eng' (ISO 639-3), or language='ENGLISH' — all resolve to canonical Lingua name
  • Tuple-of-languages: language=('en', 'nl') restricts auto-detection to listed languages (vs. pinning to one)
  • extra_languages also accepts ISO codes: extra_languages=('fr', 'pt') resolves correctly
  • All 75 Lingua languages supported via ISO code lookup

Improvements

  • Thread-safe TextCleanerConfig frozen dataclass with per-instance configuration
  • Parallel process_batch() via ThreadPoolExecutor (ONNX releases the GIL)
  • Async aprocess_batch() for FastAPI/aiohttp integration
  • warmup() method for pre-loading NER models at startup

Full Changelog

v0.5.2...v0.6.0

v0.5.2

28 Feb 09:08

Choose a tag to compare

What's changed

  • Consolidated README: Combined "What's New" v0.5.0 and v0.4.5 sections into a single concise summary with only major user-facing features
  • Fixed llms.txt link: Changed from relative path (./llms.txt) to absolute GitHub URL so it works on both GitHub and PyPI
  • Added llms.txt to PyPI sidebar: Now appears as a clickable link in PyPI project URLs
  • Single-source version: Version is now defined only in sct/__init__.py and read dynamically by pyproject.toml via setuptools.dynamic - no more dual maintenance
  • Removed em-dashes: Replaced all em-dash characters with standard punctuation throughout README

Full Changelog: v0.5.1...v0.5.2

v0.5.1

23 Feb 22:49

Choose a tag to compare

What's New

Bug Fixes

  • asyncio: Replace deprecated get_event_loop() with get_running_loop() — fixes RuntimeError in Python 3.12+ when called from FastAPI/aiohttp
  • Quantize cache path: ONNX quantized models now write to ~/.cache/sct_quantized/ instead of the read-only HuggingFace Hub cache directory

Performance

  • ONNX session sharing: Languages sharing the same model (e.g. FR/PT/IT all use wikineural-multilingual-ner-onnx) now reuse a single session — saves ~600 MB RAM
  • Finer-grained inference locks: Per-model locks replace the coarse per-language lock, allowing true concurrent inference across different models
  • Lock-free tokenization: split_text() no longer holds the inference lock during chunking (HF fast tokenizer is thread-safe)
  • CJK/Arabic chunk safety: _simple_chunk() uses 2 chars/token (down from 4) to prevent context-window overflow for CJK and Arabic text

API

  • load_language(lang): New public method on GeneralNER for stable warm-up without coupling to private _ensure_loaded()
  • ner_batch_size validation: TextCleanerConfig(ner_batch_size=0) now raises ValueError immediately instead of silently producing empty results

Regex

  • Sentence boundary: SENTENCE_BOUNDARY_PATTERN upgraded to the regex library with an abbreviation guard — "Dr. Smith", "Mr. Jones", "U.S. Army" no longer cause false splits in NER chunking

Tests

  • 12 new tests covering all 9 changes

v0.4.5

23 Feb 13:03

Choose a tag to compare

v0.4.5 — Multi-Backend NER, ONNX-First Architecture, Language Extensibility

Major release with architectural overhaul since v0.3.0.

Architecture

  • Frozen TextCleanerConfig dataclass — thread-safe, per-instance configuration replaces mutable module-level globals
  • Backward-compatible: old config.CHECK_* module vars still work
  • ONNX-first NER inference — torch-free base install (~400MB models vs ~7GB)
  • Thread-parallel batch processing via ThreadPoolExecutor (ONNX releases the GIL)
  • Migrated from setup.py to pyproject.toml (PEP 517)

Multi-Backend NER

  • 5 backends: onnx (default), torch, gliner, ensemble_onnx, ensemble_torch
  • GLiNER zero-shot NER for custom entity types (PRODUCT, EVENT, SKILL, etc.)
  • Ensemble voting across backends for improved recall
  • Lazy per-language model loading — models load on first use, not all at init
  • Batched chunk inference — long texts split with semantic boundaries (paragraph > sentence > clause > word)
  • Language-specific + multilingual models with averaged confidence scores
  • Entity key collision fix — uses start:end separator instead of string concatenation
  • MISC entity handling — no longer silently dropped
  • ONNX-quantized models hosted on HuggingFace Hub
  • Configurable confidence thresholds per backend
  • GPU acceleration (CUDA) for both ONNX and PyTorch backends

Language Extensibility

  • Add custom languages via config — no source changes needed
  • extra_languages: extend Lingua detection to new languages
  • custom_stopwords: per-language stopword sets
  • custom_month_names: per-language month names for date detection
  • ner_models: per-language ONNX model overrides
  • Language-keyed model dict replaces fragile positional tuple
  • See docs/ADDING_LANGUAGES.md for the full guide

Multilingual Date Detection

  • Extended DATE_REGEX with month names in 4 languages: English, Dutch, German, Spanish
  • Supports ISO 8601, DD/MM/YYYY, month-name-first, and day-first date formats
  • Optional fuzzy date matching via RapidFuzz catches misspelled months (e.g. "Janury", "Feburary")
  • Install with: pip install squeakycleantext[fuzzy]
  • Enable: TextCleanerConfig(check_fuzzy_replace_dates=True)
  • Language-aware vocabulary filtering with English fallback

Text Processing

  • Configurable emoji removal (check_remove_emoji)
  • Configurable bracket/brace content removal (check_remove_bracket_content, check_remove_brace_content)
  • Smart case folding — preserves abbreviations (NATO, UNESCO) and camelCase (iPhone, eBay) while lowercasing regular words; stopword-aware

Dependencies

  • stop-words replaces NLTK (50KB bundled vs 30MB download, no post-install step)
  • Stopwords stored as sets for O(1) lookup (was lists with O(n))
  • PyTorch/Transformers moved to optional [torch] extra
  • New optional extras: [gpu], [fuzzy], [gliner], [gliner2], [all-ner]

Code Quality

  • Python 3.11–3.13 support
  • ruff linter (replaces flake8)
  • 79+ tests with Hypothesis property-based testing and Faker
  • Thread-safety verified with concurrent NER tests
  • Pre-compiled punctuation and symbol regexes
  • CI with coverage reporting

Bug Fixes

  • Fixed ReDoS vulnerability in ACRONYM_REGEX
  • Fixed Unicode escape bug in URL_REGEX and EMAIL_REGEX
  • Fixed ISOLATED_LETTERS_REGEX matching commas/spaces
  • Fixed process() always returning consistent 3-tuple
  • Fixed thread-safety (no instance mutation during processing)
  • Fixed entity key collision in NER deduplication

Full Changelog: v0.3.0...v0.4.5