Skip to content

Commit def20dc

Browse files
committed
Bump version to 0.5.2: consolidate README, fix llms.txt link, single-source version
- Combine What's New v0.5.0 and v0.4.5 into single concise section - Replace all em-dashes with context-appropriate alternatives - Fix llms.txt link to absolute GitHub URL (was broken on PyPI) - Add llms.txt to pyproject.toml project URLs for PyPI sidebar - Single-source version via setuptools dynamic version from sct.__version__
1 parent cc3a773 commit def20dc

3 files changed

Lines changed: 30 additions & 65 deletions

File tree

README.md

Lines changed: 24 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -11,11 +11,11 @@
1111
A comprehensive text cleaning and preprocessing pipeline for machine learning and NLP tasks.
1212
</div>
1313

14-
> **Using an AI coding assistant?** This repo includes an [`llms.txt`](./llms.txt) with the full API surface, config reference, and Q&A optimised for Claude, Cursor, Copilot, and ChatGPT.
14+
> **Using an AI coding assistant?** This repo includes an [`llms.txt`](https://github.com/rhnfzl/SqueakyCleanText/blob/main/llms.txt) with the full API surface, config reference, and Q&A - optimised for Claude, Cursor, Copilot, and ChatGPT.
1515
1616
In the world of machine learning and natural language processing, clean and well-structured text data is crucial for building effective downstream models and managing token limits in language models.
1717

18-
SqueakyCleanText simplifies the process by automatically addressing common text issues removing PII, anonymizing named entities (persons, organisations, locations), and ensuring your data is clean and well-structured for language models and classical ML pipelines with minimal effort on your part.
18+
SqueakyCleanText simplifies the process by automatically addressing common text issues - removing PII, anonymizing named entities (persons, organisations, locations), and ensuring your data is clean and well-structured for language models and classical ML pipelines with minimal effort on your part.
1919

2020
### Key Features
2121

@@ -88,7 +88,7 @@ SqueakyCleanText simplifies the process by automatically addressing common text
8888
pip install SqueakyCleanText
8989
```
9090

91-
The base install uses **ONNX Runtime** for NER inference no PyTorch or Transformers required.
91+
The base install uses **ONNX Runtime** for NER inference - no PyTorch or Transformers required.
9292

9393
### Optional Extras
9494

@@ -163,7 +163,7 @@ cfg = TextCleanerConfig(
163163
gliner_labels=('person', 'organization', 'location', 'product', 'event'),
164164
gliner_label_map={
165165
'person': 'PER', 'organization': 'ORG', 'location': 'LOC',
166-
# 'product' and 'event' are unmapped they become <PRODUCT>, <EVENT> tokens
166+
# 'product' and 'event' are unmapped - they become <PRODUCT>, <EVENT> tokens
167167
},
168168
gliner_threshold=0.4,
169169
)
@@ -253,7 +253,7 @@ SqueakyCleanText supports five NER backends, selectable via the `ner_backend` co
253253

254254
| Backend | Description | Dependencies | Best for |
255255
|---------|-------------|-------------|----------|
256-
| `onnx` (default) | ONNX Runtime inference with quantized XLM-RoBERTa models | Base install | Production fast, torch-free |
256+
| `onnx` (default) | ONNX Runtime inference with quantized XLM-RoBERTa models | Base install | Production: fast, torch-free |
257257
| `torch` | PyTorch/Transformers pipeline with full XLM-RoBERTa models | `[torch]` extra | Compatibility with existing PyTorch workflows |
258258
| `gliner` | GLiNER zero-shot NER with custom entity labels | `[gliner]` or `[gliner2]` extra | Custom entity types (PRODUCT, SKILL, EVENT, etc.) |
259259
| `ensemble_onnx` | ONNX + GLiNER ensemble voting | `[gliner]` extra | Maximum recall with custom entities |
@@ -438,69 +438,30 @@ Input Text
438438
(lm_text, stat_text, language)
439439
```
440440

441-
Each step is toggled by a `TextCleanerConfig` field. The pipeline is built once at initialization disabled steps are skipped entirely (zero overhead).
441+
Each step is toggled by a `TextCleanerConfig` field. The pipeline is built once at initialization; disabled steps are skipped entirely (zero overhead).
442442

443-
## What's New in v0.5.0
443+
## What's New
444444

445-
Quality, performance, and API improvements:
445+
**v0.5.x**
446+
- `aprocess_batch()`: async batch processing for FastAPI / aiohttp integrations
447+
- `warmup(languages)`: pre-load NER models at startup to eliminate first-request latency
448+
- `custom_pipeline_steps`: attach arbitrary `(text: str) -> str` callables after the built-in pipeline
449+
- French, Portuguese, and Italian NER support via a shared multilingual ONNX session
450+
- Improved NER sentence boundary detection with abbreviation guard
446451

447-
**Async & API**
448-
- `aprocess_batch()` — async batch processing for FastAPI / aiohttp (uses `get_running_loop`, Python 3.12+ compatible)
449-
- `warmup(languages)` — public method to pre-load NER models at startup
450-
- `custom_pipeline_steps` config field — plug in arbitrary `(text: str) -> str` callables after the built-in pipeline
451-
452-
**Language Support**
453-
- French, Portuguese, Italian now supported out of the box via the multilingual ONNX model
454-
- ONNX sessions are shared across same-model languages (FR/PT/IT → one session, ~600 MB saved)
455-
456-
**Performance & Thread Safety**
457-
- Per-model inference locks replace the coarse per-language lock — true concurrent inference across different language models
458-
- `split_text()` is now lock-free (HF fast tokenizer is thread-safe)
459-
- Conservative `chars/token` ratio (2×) in `_simple_chunk` prevents context-window overflow for CJK and Arabic texts
460-
461-
**Correctness**
462-
- `SENTENCE_BOUNDARY_PATTERN` upgraded to the `regex` library with an abbreviation guard — "Dr. Smith", "Mr. Jones", "U.S. Army" no longer cause false splits during NER chunking
463-
- `ner_batch_size=0` and `ner_batch_size=-1` now raise `ValueError` immediately instead of silently producing empty results
464-
- Quantized ONNX models are cached to `~/.cache/sct_quantized/` instead of the read-only HuggingFace Hub cache directory
465-
466-
---
467-
468-
## What's New in v0.4.5
469-
470-
Major release with architectural overhaul since v0.3.0:
471-
472-
**Architecture**
473-
- Frozen `TextCleanerConfig` dataclass replaces global mutable config (thread-safe, per-instance)
474-
- ONNX-first NER inference — torch-free base install (~400MB models vs ~7GB)
475-
- Thread-parallel batch processing via `ThreadPoolExecutor` (ONNX releases the GIL)
476-
477-
**NER**
478-
- 5 backends: `onnx`, `torch`, `gliner`, `ensemble_onnx`, `ensemble_torch`
452+
**v0.4.5**
453+
- Frozen `TextCleanerConfig` dataclass: immutable, thread-safe, per-instance configuration
454+
- ONNX-first NER inference: torch-free base install (~400 MB models vs ~7 GB)
455+
- Thread-parallel batch processing via `ThreadPoolExecutor`
456+
- Five NER backends: `onnx`, `torch`, `gliner`, `ensemble_onnx`, `ensemble_torch`
479457
- GLiNER zero-shot NER for custom entity types (PRODUCT, EVENT, SKILL, etc.)
480458
- Ensemble voting across backends for improved recall
481-
- Lazy per-language model loading (only loads models when needed)
482-
- Language-keyed model dict replaces fragile positional tuple
483-
- ONNX-quantized models hosted on [HuggingFace Hub](https://huggingface.co/rhnfzl)
484-
485-
**Text Processing**
486-
- Multilingual date detection (ISO 8601, European formats, month names in EN/NL/DE/ES)
487-
- Fuzzy date matching for misspelled months (via rapidfuzz, empirically calibrated threshold)
488-
- Configurable emoji removal
489-
- Configurable bracket/brace content removal
490-
- Smart case folding (preserves NER replacement tokens)
491-
- Custom stopwords and month names per language
492-
493-
**Dependencies**
494-
- `stop-words` package replaces NLTK (50KB bundled vs 30MB download)
495-
- PyTorch/Transformers moved to optional `[torch]` extra
496-
- New optional extras: `[gpu]`, `[fuzzy]`, `[gliner]`, `[gliner2]`, `[all-ner]`
497-
- Migrated from `setup.py` to `pyproject.toml` (PEP 517)
498-
499-
**Quality**
500-
- Python 3.11–3.13 support
501-
- `ruff` linter (replaces flake8)
502-
- hypothesis-based property testing with pytest-timeout
503-
- Collision-safe NER entity keys
459+
- Lazy per-language model loading
460+
- Multilingual date detection and fuzzy date matching
461+
- Configurable emoji removal, bracket/brace content removal, and smart case folding
462+
- `stop-words` replaces NLTK (50 KB bundled vs 30 MB download)
463+
- PyTorch and Transformers moved to optional extras
464+
- Migrated to `pyproject.toml` (PEP 517), Python 3.11-3.13, ruff linter
504465

505466
## Contributing
506467

pyproject.toml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
44

55
[project]
66
name = "SqueakyCleanText"
7-
version = "0.5.1"
7+
dynamic = ["version"]
88
description = "Text preprocessing & PII anonymization pipeline for NLP/ML: ONNX NER ensemble, language detection, stopword removal, and configurable token replacement."
99
readme = "README.md"
1010
license = {text = "MIT"}
@@ -51,6 +51,7 @@ Homepage = "https://github.com/rhnfzl/SqueakyCleanText"
5151
Repository = "https://github.com/rhnfzl/SqueakyCleanText"
5252
Issues = "https://github.com/rhnfzl/SqueakyCleanText/issues"
5353
Changelog = "https://github.com/rhnfzl/SqueakyCleanText/releases"
54+
"llms.txt" = "https://github.com/rhnfzl/SqueakyCleanText/blob/main/llms.txt"
5455

5556
[project.optional-dependencies]
5657
gpu = [
@@ -88,6 +89,9 @@ test = [
8889
"pytest-cov>=4.1",
8990
]
9091

92+
[tool.setuptools.dynamic]
93+
version = {attr = "sct.__version__"}
94+
9195
[tool.setuptools.packages.find]
9296
include = ["sct*"]
9397
exclude = ["tests*"]

sct/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,5 +3,5 @@
33
from sct.config import TextCleanerConfig
44
from sct.sct import TextCleaner
55

6-
__version__ = "0.5.1"
6+
__version__ = "0.5.2"
77
__all__ = ["TextCleaner", "TextCleanerConfig"]

0 commit comments

Comments
 (0)