|
11 | 11 | A comprehensive text cleaning and preprocessing pipeline for machine learning and NLP tasks. |
12 | 12 | </div> |
13 | 13 |
|
14 | | -> **Using an AI coding assistant?** This repo includes an [`llms.txt`](./llms.txt) with the full API surface, config reference, and Q&A — optimised for Claude, Cursor, Copilot, and ChatGPT. |
| 14 | +> **Using an AI coding assistant?** This repo includes an [`llms.txt`](https://github.com/rhnfzl/SqueakyCleanText/blob/main/llms.txt) with the full API surface, config reference, and Q&A - optimised for Claude, Cursor, Copilot, and ChatGPT. |
15 | 15 |
|
16 | 16 | In the world of machine learning and natural language processing, clean and well-structured text data is crucial for building effective downstream models and managing token limits in language models. |
17 | 17 |
|
18 | | -SqueakyCleanText simplifies the process by automatically addressing common text issues — removing PII, anonymizing named entities (persons, organisations, locations), and ensuring your data is clean and well-structured for language models and classical ML pipelines with minimal effort on your part. |
| 18 | +SqueakyCleanText simplifies the process by automatically addressing common text issues - removing PII, anonymizing named entities (persons, organisations, locations), and ensuring your data is clean and well-structured for language models and classical ML pipelines with minimal effort on your part. |
19 | 19 |
|
20 | 20 | ### Key Features |
21 | 21 |
|
@@ -88,7 +88,7 @@ SqueakyCleanText simplifies the process by automatically addressing common text |
88 | 88 | pip install SqueakyCleanText |
89 | 89 | ``` |
90 | 90 |
|
91 | | -The base install uses **ONNX Runtime** for NER inference — no PyTorch or Transformers required. |
| 91 | +The base install uses **ONNX Runtime** for NER inference - no PyTorch or Transformers required. |
92 | 92 |
|
93 | 93 | ### Optional Extras |
94 | 94 |
|
@@ -163,7 +163,7 @@ cfg = TextCleanerConfig( |
163 | 163 | gliner_labels=('person', 'organization', 'location', 'product', 'event'), |
164 | 164 | gliner_label_map={ |
165 | 165 | 'person': 'PER', 'organization': 'ORG', 'location': 'LOC', |
166 | | - # 'product' and 'event' are unmapped — they become <PRODUCT>, <EVENT> tokens |
| 166 | + # 'product' and 'event' are unmapped - they become <PRODUCT>, <EVENT> tokens |
167 | 167 | }, |
168 | 168 | gliner_threshold=0.4, |
169 | 169 | ) |
@@ -253,7 +253,7 @@ SqueakyCleanText supports five NER backends, selectable via the `ner_backend` co |
253 | 253 |
|
254 | 254 | | Backend | Description | Dependencies | Best for | |
255 | 255 | |---------|-------------|-------------|----------| |
256 | | -| `onnx` (default) | ONNX Runtime inference with quantized XLM-RoBERTa models | Base install | Production — fast, torch-free | |
| 256 | +| `onnx` (default) | ONNX Runtime inference with quantized XLM-RoBERTa models | Base install | Production: fast, torch-free | |
257 | 257 | | `torch` | PyTorch/Transformers pipeline with full XLM-RoBERTa models | `[torch]` extra | Compatibility with existing PyTorch workflows | |
258 | 258 | | `gliner` | GLiNER zero-shot NER with custom entity labels | `[gliner]` or `[gliner2]` extra | Custom entity types (PRODUCT, SKILL, EVENT, etc.) | |
259 | 259 | | `ensemble_onnx` | ONNX + GLiNER ensemble voting | `[gliner]` extra | Maximum recall with custom entities | |
@@ -438,69 +438,30 @@ Input Text |
438 | 438 | (lm_text, stat_text, language) |
439 | 439 | ``` |
440 | 440 |
|
441 | | -Each step is toggled by a `TextCleanerConfig` field. The pipeline is built once at initialization — disabled steps are skipped entirely (zero overhead). |
| 441 | +Each step is toggled by a `TextCleanerConfig` field. The pipeline is built once at initialization; disabled steps are skipped entirely (zero overhead). |
442 | 442 |
|
443 | | -## What's New in v0.5.0 |
| 443 | +## What's New |
444 | 444 |
|
445 | | -Quality, performance, and API improvements: |
| 445 | +**v0.5.x** |
| 446 | +- `aprocess_batch()`: async batch processing for FastAPI / aiohttp integrations |
| 447 | +- `warmup(languages)`: pre-load NER models at startup to eliminate first-request latency |
| 448 | +- `custom_pipeline_steps`: attach arbitrary `(text: str) -> str` callables after the built-in pipeline |
| 449 | +- French, Portuguese, and Italian NER support via a shared multilingual ONNX session |
| 450 | +- Improved NER sentence boundary detection with abbreviation guard |
446 | 451 |
|
447 | | -**Async & API** |
448 | | -- `aprocess_batch()` — async batch processing for FastAPI / aiohttp (uses `get_running_loop`, Python 3.12+ compatible) |
449 | | -- `warmup(languages)` — public method to pre-load NER models at startup |
450 | | -- `custom_pipeline_steps` config field — plug in arbitrary `(text: str) -> str` callables after the built-in pipeline |
451 | | - |
452 | | -**Language Support** |
453 | | -- French, Portuguese, Italian now supported out of the box via the multilingual ONNX model |
454 | | -- ONNX sessions are shared across same-model languages (FR/PT/IT → one session, ~600 MB saved) |
455 | | - |
456 | | -**Performance & Thread Safety** |
457 | | -- Per-model inference locks replace the coarse per-language lock — true concurrent inference across different language models |
458 | | -- `split_text()` is now lock-free (HF fast tokenizer is thread-safe) |
459 | | -- Conservative `chars/token` ratio (2×) in `_simple_chunk` prevents context-window overflow for CJK and Arabic texts |
460 | | - |
461 | | -**Correctness** |
462 | | -- `SENTENCE_BOUNDARY_PATTERN` upgraded to the `regex` library with an abbreviation guard — "Dr. Smith", "Mr. Jones", "U.S. Army" no longer cause false splits during NER chunking |
463 | | -- `ner_batch_size=0` and `ner_batch_size=-1` now raise `ValueError` immediately instead of silently producing empty results |
464 | | -- Quantized ONNX models are cached to `~/.cache/sct_quantized/` instead of the read-only HuggingFace Hub cache directory |
465 | | - |
466 | | ---- |
467 | | - |
468 | | -## What's New in v0.4.5 |
469 | | - |
470 | | -Major release with architectural overhaul since v0.3.0: |
471 | | - |
472 | | -**Architecture** |
473 | | -- Frozen `TextCleanerConfig` dataclass replaces global mutable config (thread-safe, per-instance) |
474 | | -- ONNX-first NER inference — torch-free base install (~400MB models vs ~7GB) |
475 | | -- Thread-parallel batch processing via `ThreadPoolExecutor` (ONNX releases the GIL) |
476 | | - |
477 | | -**NER** |
478 | | -- 5 backends: `onnx`, `torch`, `gliner`, `ensemble_onnx`, `ensemble_torch` |
| 452 | +**v0.4.5** |
| 453 | +- Frozen `TextCleanerConfig` dataclass: immutable, thread-safe, per-instance configuration |
| 454 | +- ONNX-first NER inference: torch-free base install (~400 MB models vs ~7 GB) |
| 455 | +- Thread-parallel batch processing via `ThreadPoolExecutor` |
| 456 | +- Five NER backends: `onnx`, `torch`, `gliner`, `ensemble_onnx`, `ensemble_torch` |
479 | 457 | - GLiNER zero-shot NER for custom entity types (PRODUCT, EVENT, SKILL, etc.) |
480 | 458 | - Ensemble voting across backends for improved recall |
481 | | -- Lazy per-language model loading (only loads models when needed) |
482 | | -- Language-keyed model dict replaces fragile positional tuple |
483 | | -- ONNX-quantized models hosted on [HuggingFace Hub](https://huggingface.co/rhnfzl) |
484 | | - |
485 | | -**Text Processing** |
486 | | -- Multilingual date detection (ISO 8601, European formats, month names in EN/NL/DE/ES) |
487 | | -- Fuzzy date matching for misspelled months (via rapidfuzz, empirically calibrated threshold) |
488 | | -- Configurable emoji removal |
489 | | -- Configurable bracket/brace content removal |
490 | | -- Smart case folding (preserves NER replacement tokens) |
491 | | -- Custom stopwords and month names per language |
492 | | - |
493 | | -**Dependencies** |
494 | | -- `stop-words` package replaces NLTK (50KB bundled vs 30MB download) |
495 | | -- PyTorch/Transformers moved to optional `[torch]` extra |
496 | | -- New optional extras: `[gpu]`, `[fuzzy]`, `[gliner]`, `[gliner2]`, `[all-ner]` |
497 | | -- Migrated from `setup.py` to `pyproject.toml` (PEP 517) |
498 | | - |
499 | | -**Quality** |
500 | | -- Python 3.11–3.13 support |
501 | | -- `ruff` linter (replaces flake8) |
502 | | -- hypothesis-based property testing with pytest-timeout |
503 | | -- Collision-safe NER entity keys |
| 459 | +- Lazy per-language model loading |
| 460 | +- Multilingual date detection and fuzzy date matching |
| 461 | +- Configurable emoji removal, bracket/brace content removal, and smart case folding |
| 462 | +- `stop-words` replaces NLTK (50 KB bundled vs 30 MB download) |
| 463 | +- PyTorch and Transformers moved to optional extras |
| 464 | +- Migrated to `pyproject.toml` (PEP 517), Python 3.11-3.13, ruff linter |
504 | 465 |
|
505 | 466 | ## Contributing |
506 | 467 |
|
|
0 commit comments