Skip to content

Commit 32a25b9

Browse files
thegdsksclaude
andcommitted
docs: restructure documentation with marketing-focused README
- Add new docs/ folder with detailed documentation: - getting-started.md: Installation and basic usage - api-reference.md: Complete API documentation with tables - framework-examples.md: React, Vue, Angular, Express, Next.js - advanced-features.md: Leetspeak, Unicode, ML, caching - ML-GUIDE.md: TensorFlow.js integration guide - Rewrite main README with marketing focus: - Add comprehensive badges (npm, pypi, CI, bundle size, stars) - Add performance benchmark comparison table - Add feature comparison vs competitors - Add ASCII pipeline diagram and Mermaid architecture - Add Star History chart - Add local testing interface section (port 4000) - Update package READMEs to link to docs - Add og-image.png for social sharing - Add CHANGELOG.md 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent e42a83b commit 32a25b9

File tree

17 files changed

+3356
-1463
lines changed

17 files changed

+3356
-1463
lines changed

CHANGELOG.md

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# Changelog
2+
3+
All notable changes to this project will be documented in this file.
4+
5+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7+
8+
## [3.1.0] - 2025-12-30
9+
10+
### Changed
11+
- Removed legacy release workflow in favor of streamlined CI/CD
12+
- Version bump for npm publishing improvements
13+
14+
## [3.0.0] - 2025-12-30
15+
16+
### Added
17+
18+
#### Leetspeak Detection
19+
- New `detectLeetspeak` option to catch obfuscated profanity like `f4ck`, `@ss`, `$h!t`
20+
- Three intensity levels: `basic`, `moderate`, `aggressive`
21+
- Detects spaced characters (`f u c k`) and repeated characters (`fuuuuck`)
22+
23+
#### Unicode Normalization
24+
- New `normalizeUnicode` option (enabled by default)
25+
- Detects Cyrillic/Greek lookalikes (e.g., `fυck` with Greek upsilon)
26+
- Handles zero-width characters, full-width characters, and homoglyphs
27+
- Two-pass normalization to prevent Scunthorpe problem (false positives)
28+
29+
#### Result Caching
30+
- New `cacheResults` option for 800x performance improvement on repeated checks
31+
- LRU eviction with configurable `maxCacheSize` (default: 1000)
32+
- Cache management methods: `getCacheSize()`, `clearCache()`
33+
34+
#### ML Integration (Optional)
35+
- TensorFlow.js-powered toxicity detection via `glin-profanity/ml` module
36+
- `ToxicityDetector` class for standalone ML analysis
37+
- `HybridFilter` class combining rule-based and ML detection
38+
- Detects: toxicity, insults, threats, identity attacks, obscene content, severe toxicity
39+
- Configurable threshold and combination modes
40+
41+
#### New Languages
42+
- Added Dutch language support
43+
- Fixed Turkish dictionary
44+
45+
### Changed
46+
- Improved Filter class with configuration export/import
47+
- Enhanced performance benchmarks
48+
- Better TypeScript type definitions
49+
50+
### Fixed
51+
- Scunthorpe problem (false positives like "Scunthorpe", "assassin")
52+
- Repeated character handling in edge cases
53+
- User ignoreWords now properly merge with global whitelist
54+
55+
## [2.3.7] - Previous Release
56+
57+
See git history for changes prior to v3.0.0.

README.md

Lines changed: 233 additions & 542 deletions
Large diffs are not rendered by default.

docs/ML-GUIDE.md

Lines changed: 273 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,273 @@
1+
# ML-Based Toxicity Detection Guide
2+
3+
This guide covers the optional Machine Learning integration in glin-profanity v3.0+.
4+
5+
## Overview
6+
7+
The ML module provides TensorFlow.js-powered toxicity detection for context-aware content filtering beyond simple keyword matching.
8+
9+
**Key Benefits:**
10+
- Detects subtle toxicity, insults, and threats that keywords miss
11+
- Context-aware analysis (understands meaning, not just words)
12+
- Configurable confidence thresholds
13+
- Works alongside rule-based filtering for comprehensive coverage
14+
15+
## Installation
16+
17+
The ML module requires optional peer dependencies:
18+
19+
```bash
20+
npm install @tensorflow/tfjs @tensorflow-models/toxicity
21+
```
22+
23+
## Usage
24+
25+
### Standalone ToxicityDetector
26+
27+
For ML-only toxicity analysis:
28+
29+
```typescript
30+
import { ToxicityDetector } from 'glin-profanity/ml';
31+
32+
const detector = new ToxicityDetector({
33+
threshold: 0.9 // Confidence threshold (0-1)
34+
});
35+
36+
// Load the model (downloads ~5MB on first use)
37+
await detector.loadModel();
38+
39+
// Analyze text
40+
const result = await detector.analyze('you are terrible');
41+
42+
console.log(result.isToxic); // true/false
43+
console.log(result.predictions); // Array of category predictions
44+
console.log(result.matchedCategories); // ['insult', 'toxicity']
45+
```
46+
47+
### HybridFilter (Rules + ML)
48+
49+
Combines rule-based profanity detection with ML analysis:
50+
51+
```typescript
52+
import { HybridFilter } from 'glin-profanity/ml';
53+
54+
const filter = new HybridFilter({
55+
// Rule-based options
56+
languages: ['english'],
57+
detectLeetspeak: true,
58+
normalizeUnicode: true,
59+
60+
// ML options
61+
enableML: true,
62+
mlThreshold: 0.85,
63+
combinationMode: 'or', // 'or' | 'and' | 'ml-override' | 'rules-first'
64+
});
65+
66+
// Initialize (loads ML model)
67+
await filter.initialize();
68+
69+
// Async hybrid check (rules + ML)
70+
const result = await filter.checkProfanityAsync('text to analyze');
71+
72+
console.log(result.containsProfanity); // Rule-based result
73+
console.log(result.isToxic); // ML result
74+
console.log(result.mlResult); // Full ML analysis
75+
76+
// Sync rule-based check (fast, no ML)
77+
filter.isProfane('badword'); // true
78+
```
79+
80+
## Combination Modes
81+
82+
The `combinationMode` option controls how rule-based and ML results combine:
83+
84+
| Mode | Description |
85+
|------|-------------|
86+
| `'or'` | Flag if EITHER rules OR ML detect issues (default) |
87+
| `'and'` | Flag only if BOTH rules AND ML agree |
88+
| `'ml-override'` | ML result takes precedence over rules |
89+
| `'rules-first'` | Use ML only if rules find nothing |
90+
91+
## ML Categories
92+
93+
The toxicity model detects these categories:
94+
95+
| Category | Description |
96+
|----------|-------------|
97+
| `toxicity` | General toxic content |
98+
| `severe_toxicity` | Highly toxic content |
99+
| `insult` | Personal insults and attacks |
100+
| `threat` | Threatening language |
101+
| `identity_attack` | Identity-based hate speech |
102+
| `obscene` | Obscene/vulgar content |
103+
| `sexual_explicit` | Sexually explicit content |
104+
105+
## Performance Considerations
106+
107+
### First Load
108+
- Model downloads ~5MB from TensorFlow Hub
109+
- Takes 2-5 seconds depending on connection
110+
- Browser caches model files for subsequent loads
111+
112+
### Analysis Speed
113+
- ML analysis: ~500ms-2s per text
114+
- Rule-based: ~0.04ms per text
115+
- Use rule-based for real-time (typing) validation
116+
- Use ML for submit/post validation
117+
118+
### Offline Usage
119+
120+
The model requires an internet connection for first download. For offline apps:
121+
122+
**Option 1: Browser Cache**
123+
```javascript
124+
// Model cached after first load
125+
// Works offline on subsequent page loads
126+
```
127+
128+
**Option 2: Service Worker**
129+
```javascript
130+
// Cache model files with service worker
131+
self.addEventListener('fetch', event => {
132+
if (event.request.url.includes('tensorflow')) {
133+
event.respondWith(caches.match(event.request));
134+
}
135+
});
136+
```
137+
138+
**Option 3: IndexedDB (TensorFlow.js native)**
139+
```typescript
140+
// Save model after load
141+
await model.save('indexeddb://toxicity-model');
142+
143+
// Load from IndexedDB later
144+
const model = await tf.loadGraphModel('indexeddb://toxicity-model');
145+
```
146+
147+
## Best Practices
148+
149+
### 1. Use Appropriate Thresholds
150+
151+
```typescript
152+
// Stricter (fewer false positives, may miss subtle toxicity)
153+
mlThreshold: 0.95
154+
155+
// Balanced (recommended)
156+
mlThreshold: 0.85
157+
158+
// Lenient (catches more, more false positives)
159+
mlThreshold: 0.7
160+
```
161+
162+
### 2. Combine with Rules
163+
164+
```typescript
165+
// Best coverage: use both
166+
const filter = new HybridFilter({
167+
languages: ['english'],
168+
detectLeetspeak: true,
169+
enableML: true,
170+
combinationMode: 'or',
171+
});
172+
```
173+
174+
### 3. Handle Loading States
175+
176+
```typescript
177+
const [isReady, setIsReady] = useState(false);
178+
179+
useEffect(() => {
180+
filter.initialize().then(() => setIsReady(true));
181+
}, []);
182+
183+
// Show loading state while model loads
184+
if (!isReady) return <LoadingSpinner />;
185+
```
186+
187+
### 4. Graceful Fallback
188+
189+
```typescript
190+
try {
191+
await filter.initialize();
192+
} catch (err) {
193+
console.warn('ML unavailable, using rules only');
194+
// Filter still works with rule-based detection
195+
}
196+
```
197+
198+
## API Reference
199+
200+
### ToxicityDetector
201+
202+
```typescript
203+
interface MLDetectorConfig {
204+
threshold?: number; // Default: 0.9
205+
labels?: ToxicityLabel[]; // Which categories to detect
206+
}
207+
208+
class ToxicityDetector {
209+
constructor(config?: MLDetectorConfig);
210+
loadModel(): Promise<void>;
211+
analyze(text: string): Promise<MLAnalysisResult>;
212+
isModelLoaded(): boolean;
213+
}
214+
```
215+
216+
### HybridFilter
217+
218+
```typescript
219+
interface HybridFilterConfig extends FilterConfig {
220+
enableML?: boolean;
221+
mlThreshold?: number;
222+
combinationMode?: 'or' | 'and' | 'ml-override' | 'rules-first';
223+
}
224+
225+
class HybridFilter extends Filter {
226+
constructor(config?: HybridFilterConfig);
227+
initialize(): Promise<void>;
228+
checkProfanityAsync(text: string): Promise<HybridAnalysisResult>;
229+
}
230+
```
231+
232+
### Result Types
233+
234+
```typescript
235+
interface MLAnalysisResult {
236+
isToxic: boolean;
237+
predictions: ToxicityPrediction[];
238+
matchedCategories: ToxicityLabel[];
239+
processingTime: number;
240+
}
241+
242+
interface HybridAnalysisResult extends CheckProfanityResult {
243+
isToxic: boolean;
244+
mlResult?: MLAnalysisResult;
245+
confidence: number;
246+
}
247+
```
248+
249+
## Troubleshooting
250+
251+
### CORS Errors on Localhost
252+
253+
TensorFlow Hub may block requests from localhost. Solutions:
254+
1. Deploy to a real domain for testing
255+
2. Use a proxy server
256+
3. Pre-download and host model files locally
257+
258+
### "No backend found" Error
259+
260+
Ensure TensorFlow.js is imported before the toxicity model:
261+
262+
```typescript
263+
// Correct order
264+
import '@tensorflow/tfjs';
265+
import * as toxicity from '@tensorflow-models/toxicity';
266+
```
267+
268+
### Model Too Large
269+
270+
The toxicity model is ~5MB. Alternatives:
271+
- Use rule-based only for size-sensitive apps
272+
- Load model on-demand (not at app start)
273+
- Consider server-side ML for web apps

0 commit comments

Comments
 (0)