Skip to content

Commit cff5383

Browse files
feat: v3.1 performance enhancements and bug fixes (#105)
* ⚡ Lazy load profanity dictionaries for faster startup Refactored `DictionaryLoader` to load language dictionaries on demand. This significantly reduces import time and memory usage when only specific languages are needed. - Extracted language file mapping to `LANGUAGE_FILES` constant. - Removed eager loading in `__init__`. - Implemented `_load_dictionary` for lazy loading. - Updated `get_words` and `get_all_words` to use lazy loading. - Added `tests/test_dictionary_lazy.py` to verify lazy loading behavior. Co-authored-by: thegdsks <39922405+thegdsks@users.noreply.github.com> * Optimize checkPhraseContext to use matchWord for targeted phrase matching - Modified `checkPhraseContext` in `packages/js/src/nlp/contextAnalyzer.ts` to filter phrases by `matchWord`. - Added test case `packages/js/tests/context-optimization.test.ts` to verify the fix. - This prevents unrelated positive phrases (e.g., "the bomb") from whitelisting other profanities (e.g., "shit"). Co-authored-by: thegdsks <39922405+thegdsks@users.noreply.github.com> * feat(context-analyzer): use contextWords for detailed reasoning - Update generateReason to include found positive/negative indicators in the return string. - Remove unused variable lint suppression. Co-authored-by: thegdsks <39922405+thegdsks@users.noreply.github.com> * feat: cache compiled regexes in Filter class for performance optimization Co-authored-by: thegdsks <39922405+thegdsks@users.noreply.github.com> * feat: implement matchWord support for domain-specific filtering in ContextAnalyzer Updates `ContextAnalyzer.isDomainWhitelisted` to use the `matchWord` argument. Introduces `GAMING_ACCEPTABLE_WORDS` to restrict whitelisting in gaming contexts to only specific acceptable words (e.g. 'kill', 'shoot', 'badass'), rather than whitelisting all profanity when gaming terms are present. Adds regression tests verifying the fix. Co-authored-by: thegdsks <39922405+thegdsks@users.noreply.github.com> * Optimize regex compilation in Filter class using caching This change introduces a cache for compiled regex patterns in the `Filter` class. Previously, `_get_regex` would re-escape and re-compile regex patterns for every word checked, even if the word had been processed before. This optimization stores the compiled regex in `self._regex_cache` keyed by the word, avoiding redundant computations. Performance Benchmark (50 iterations): - is_profane: ~14.11ms -> ~11.60ms (~17.8% improvement) - check_profanity: ~14.44ms -> ~11.45ms (~20.7% improvement) Co-authored-by: thegdsks <39922405+thegdsks@users.noreply.github.com> * fix: add .npmrc with legacy-peer-deps=true to fix CI build Co-authored-by: thegdsks <39922405+thegdsks@users.noreply.github.com> * ci: fix npm ci dependency conflict by using legacy peer deps Updates the JS CI workflow to use `npm ci --legacy-peer-deps` to bypass conflict between @tensorflow/tfjs (4.x) and @tensorflow-models/toxicity (1.2.2). Co-authored-by: thegdsks <39922405+thegdsks@users.noreply.github.com> * Optimize checkPhraseContext to use matchWord and fix CI dependencies - Modified `checkPhraseContext` in `packages/js/src/nlp/contextAnalyzer.ts` to filter phrases by `matchWord`. - Added test case `packages/js/tests/context-optimization.test.ts` to verify the fix. - Added `overrides` to root `package.json` to resolve `@tensorflow-models/toxicity` peer dependency conflicts causing CI failures. - Updated `package-lock.json` to reflect resolved dependencies. Co-authored-by: thegdsks <39922405+thegdsks@users.noreply.github.com> * feat(context-analyzer): use contextWords for detailed reasoning - Update generateReason to include found positive/negative indicators in the return string. - Remove unused variable lint suppression. - Fix CI dependency conflict by adding overrides for @tensorflow/tfjs packages. Co-authored-by: thegdsks <39922405+thegdsks@users.noreply.github.com> * fix(python): bundle dictionaries in package for pip install Fixes #70 - Copy language dictionaries into glin_profanity/data/dictionaries/ - Update dictionary loader to use bundled path with fallback - Update pyproject.toml to include dictionary files in wheel - Dictionaries now work when installed via pip * fix: apply copilot review suggestions - Fix unreachable NEGATIVE_PHRASES branch in contextAnalyzer.ts (phrase.includes(matchWord) was always false for prefix phrases) - Make test assertions explicit in repro_issue.test.ts - Use dynamic language count in test_dictionary_lazy.py * fix: address CodeRabbit review suggestions - pyproject.toml: Use force-include instead of shared-data for bundling - contextAnalyzer.ts: Normalize domain whitelist entries to lowercase - dutch.json: Remove trailing 'g' artifacts from 20+ words - globalWhitelist.json: Remove duplicate "Analytics" entry - italian.json: Fix encoding artifact in "fare una" entry - japanese.json: Remove generic words causing false positives (嫌い, 女の子) - norwegian.json: Rename from Norwegian.json for consistency - spanish.json: Fix typo "sesinato" → "asesinato" - turkish.json: Remove false positives (allah, ana, coca cola, cola) --------- Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com>
1 parent f33a86c commit cff5383

43 files changed

Lines changed: 4575 additions & 477 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/ci-js.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ jobs:
6060
cache-dependency-path: package-lock.json
6161

6262
- name: 🔧 Install dependencies
63-
run: npm ci
63+
run: npm ci --legacy-peer-deps
6464

6565
- name: 📋 Lint & Type Check
6666
working-directory: ${{ env.PACKAGE_DIR }}

.npmrc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
legacy-peer-deps=true

package-lock.json

Lines changed: 564 additions & 376 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

package.json

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,16 @@
6161
"react": "^18.3.1",
6262
"react-dom": "^18.3.1"
6363
},
64+
"overrides": {
65+
"@tensorflow-models/toxicity": {
66+
"@tensorflow/tfjs-core": "^4.22.0",
67+
"@tensorflow/tfjs-converter": "^4.22.0"
68+
},
69+
"@tensorflow/tfjs-core": "^4.22.0",
70+
"@tensorflow/tfjs-converter": "^4.22.0",
71+
"@tensorflow/tfjs-backend-cpu": "^4.22.0",
72+
"@tensorflow/tfjs-backend-webgl": "^4.22.0"
73+
},
6474
"devDependencies": {
6575
"@babel/core": "^7.25.2",
6676
"@babel/preset-env": "^7.25.3",

packages/js/src/filters/Filter.ts

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ class Filter {
4747
private cacheResults: boolean;
4848
private maxCacheSize: number;
4949
private cache: Map<string, CheckProfanityResult>;
50+
private regexCache: Map<string, RegExp>;
5051

5152
/**
5253
* Creates a new Filter instance with the specified configuration.
@@ -113,6 +114,7 @@ class Filter {
113114
this.cacheResults = config?.cacheResults ?? false;
114115
this.maxCacheSize = config?.maxCacheSize ?? 1000;
115116
this.cache = new Map();
117+
this.regexCache = new Map();
116118

117119
// Build word dictionary
118120
let words: string[] = [];
@@ -202,6 +204,7 @@ class Filter {
202204
*/
203205
public clearCache(): void {
204206
this.cache.clear();
207+
this.regexCache.clear();
205208
}
206209

207210
/**
@@ -292,10 +295,17 @@ class Filter {
292295
}
293296

294297
private getRegex(word: string): RegExp {
298+
if (this.regexCache.has(word)) {
299+
const regex = this.regexCache.get(word)!;
300+
regex.lastIndex = 0;
301+
return regex;
302+
}
295303
const flags = this.caseSensitive ? 'g' : 'gi';
296304
const escapedWord = word.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
297305
const boundary = this.wordBoundaries ? '\\b' : '';
298-
return new RegExp(`${boundary}${escapedWord}${boundary}`, flags);
306+
const regex = new RegExp(`${boundary}${escapedWord}${boundary}`, flags);
307+
this.regexCache.set(word, regex);
308+
return regex;
299309
}
300310

301311
private isFuzzyToleranceMatch(word: string, text: string): boolean {

packages/js/src/nlp/contextAnalyzer.ts

Lines changed: 34 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,16 @@ const GAMING_POSITIVE = new Set([
4141
'build', 'loadout', 'strategy', 'tactic', 'play', 'move', 'combo'
4242
]);
4343

44+
// Words that are acceptable in gaming contexts but might be flagged otherwise
45+
const GAMING_ACCEPTABLE_WORDS = new Set([
46+
'kill', 'killer', 'killed', 'killing',
47+
'shoot', 'shot', 'shooting',
48+
'die', 'dying', 'died', 'dead', 'death',
49+
'badass', 'sick', 'insane', 'crazy', 'mad', 'beast', 'savage',
50+
'suck', 'sucks',
51+
'wtf', 'omg', 'hell', 'damn', 'crap'
52+
]);
53+
4454
// Common positive phrases that might contain flagged words
4555
const POSITIVE_PHRASES = new Map([
4656
['the bomb', 0.9], // "this movie is the bomb"
@@ -69,7 +79,9 @@ export class ContextAnalyzer {
6979
constructor(config: ContextConfig) {
7080
this.contextWindow = config.contextWindow;
7181
this.language = config.language;
72-
this.domainWhitelists = new Set(config.domainWhitelists || []);
82+
this.domainWhitelists = new Set(
83+
(config.domainWhitelists || []).map(word => word.toLowerCase())
84+
);
7385
}
7486

7587
/**
@@ -122,12 +134,10 @@ export class ContextAnalyzer {
122134
};
123135
}
124136

125-
// eslint-disable-next-line @typescript-eslint/no-unused-vars
126137
private checkPhraseContext(contextText: string, matchWord: string): ContextAnalysisResult | null {
127-
// TODO: Use matchWord for more specific phrase matching in the future
128138
// Check positive phrases
129139
for (const [phrase, score] of POSITIVE_PHRASES.entries()) {
130-
if (contextText.includes(phrase)) {
140+
if (phrase.includes(matchWord) && contextText.includes(phrase)) {
131141
return {
132142
contextScore: score,
133143
reason: `Positive phrase detected: "${phrase}"`,
@@ -136,7 +146,7 @@ export class ContextAnalyzer {
136146
}
137147
}
138148

139-
// Check negative phrases
149+
// Check negative phrases (prefixes like "you are" that introduce profanity)
140150
for (const [phrase, score] of NEGATIVE_PHRASES.entries()) {
141151
if (contextText.includes(phrase)) {
142152
return {
@@ -150,25 +160,36 @@ export class ContextAnalyzer {
150160
return null;
151161
}
152162

153-
// eslint-disable-next-line @typescript-eslint/no-unused-vars
154163
private isDomainWhitelisted(contextWords: string[], matchWord: string): boolean {
155-
// TODO: Use matchWord for domain-specific filtering in the future
164+
const normalizedMatchWord = matchWord.toLowerCase();
165+
156166
// Check if any domain whitelist words are present
157167
for (const word of contextWords) {
158-
if (this.domainWhitelists.has(word) || GAMING_POSITIVE.has(word)) {
168+
// Check user-defined domain whitelists (permissive)
169+
if (this.domainWhitelists.has(word)) {
159170
return true;
160171
}
172+
173+
// Check internal gaming whitelist (restrictive)
174+
if (GAMING_POSITIVE.has(word)) {
175+
if (GAMING_ACCEPTABLE_WORDS.has(normalizedMatchWord)) {
176+
return true;
177+
}
178+
}
161179
}
162180
return false;
163181
}
164182

165-
// eslint-disable-next-line @typescript-eslint/no-unused-vars
166183
private generateReason(score: number, contextWords: string[]): string {
167-
// TODO: Use contextWords for more detailed reasoning in the future
184+
const foundPositive = Array.from(new Set(contextWords.filter(word => POSITIVE_INDICATORS.has(word))));
185+
const foundNegative = Array.from(new Set(contextWords.filter(word => NEGATIVE_INDICATORS.has(word))));
186+
168187
if (score >= 0.7) {
169-
return 'Positive context detected - likely not profanity';
188+
const details = foundPositive.length > 0 ? ` (found: ${foundPositive.join(', ')})` : '';
189+
return `Positive context detected${details} - likely not profanity`;
170190
} else if (score <= 0.3) {
171-
return 'Negative context detected - likely profanity';
191+
const details = foundNegative.length > 0 ? ` (found: ${foundNegative.join(', ')})` : '';
192+
return `Negative context detected${details} - likely profanity`;
172193
} else {
173194
return 'Neutral context - uncertain classification';
174195
}
@@ -253,7 +274,7 @@ export class ContextAnalyzer {
253274
* Updates the domain whitelist for this analyzer instance
254275
*/
255276
updateDomainWhitelist(newWhitelist: string[]): void {
256-
this.domainWhitelists = new Set(newWhitelist);
277+
this.domainWhitelists = new Set(newWhitelist.map(word => word.toLowerCase()));
257278
}
258279

259280
/**
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
import { Filter } from '../src/filters/Filter';
2+
3+
describe('Context Optimization', () => {
4+
let filter: Filter;
5+
6+
beforeEach(() => {
7+
filter = new Filter({
8+
enableContextAware: true,
9+
languages: ['english'],
10+
});
11+
});
12+
13+
it('should NOT whitelist profanity based on unrelated positive phrases', () => {
14+
// "the bomb" is a positive phrase for "bomb".
15+
// "shit" is a profanity.
16+
// If "the bomb" is present, it shouldn't whitelist "shit".
17+
const text = 'The bomb exploded and shit happened';
18+
19+
const result = filter.checkProfanity(text);
20+
21+
// Should be flagged because "shit" is profanity and "the bomb" is irrelevant to "shit"
22+
expect(result.containsProfanity).toBe(true);
23+
expect(result.profaneWords).toContain('shit');
24+
});
25+
26+
it('should still whitelist relevant positive phrases', () => {
27+
const text = 'This movie is the bomb';
28+
const result = filter.checkProfanity(text);
29+
30+
// Should NOT be flagged because "the bomb" is a whitelisted phrase for "bomb"
31+
expect(result.containsProfanity).toBe(false);
32+
});
33+
});
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
import { Filter } from '../src/filters/Filter';
2+
3+
describe('Domain Specific Whitelisting', () => {
4+
let filter: Filter;
5+
6+
beforeEach(() => {
7+
filter = new Filter({
8+
enableContextAware: true,
9+
contextWindow: 3,
10+
confidenceThreshold: 0.7,
11+
languages: ['english'],
12+
logProfanity: false,
13+
});
14+
});
15+
16+
it('verifies baseline: profane word without context is flagged', () => {
17+
const result = filter.checkProfanity('You are a bitch');
18+
expect(result.containsProfanity).toBe(true);
19+
});
20+
21+
it('correctly flags profanity even if gaming context is present (Fixed Behavior)', () => {
22+
// "bitch" is profane.
23+
// "player" is in GAMING_POSITIVE.
24+
// "bitch" is NOT in GAMING_ACCEPTABLE_WORDS.
25+
// So it should remain profane.
26+
// Before the fix, "player" would cause isDomainWhitelisted to return true.
27+
// After the fix, isDomainWhitelisted returns false.
28+
// Sentiment analysis: "you" (negative) vs "player" (positive).
29+
const text = 'You bitch player';
30+
const result = filter.checkProfanity(text);
31+
32+
expect(result.containsProfanity).toBe(true);
33+
expect(result.matches).toBeDefined();
34+
expect(result.matches!.length).toBeGreaterThan(0);
35+
// It should NOT be whitelisted
36+
expect(result.matches![0].isWhitelisted).toBe(false);
37+
});
38+
39+
it('whitelists acceptable gaming words in gaming context', () => {
40+
const text = 'This game sucks'; // "sucks" is in GAMING_ACCEPTABLE_WORDS
41+
const result = filter.checkProfanity(text);
42+
expect(result.containsProfanity).toBe(false);
43+
// Wait, if "sucks" is whitelisted, containsProfanity might be false OR true but isWhitelisted=true?
44+
// checkProfanity implementation:
45+
// if (contextResult.isWhitelisted) { continue; }
46+
// So if whitelisted, it is NOT added to matches/profaneWords.
47+
// So containsProfanity should be false.
48+
49+
const text2 = 'You are a badass player'; // "badass" is in GAMING_ACCEPTABLE_WORDS
50+
const result2 = filter.checkProfanity(text2);
51+
expect(result2.containsProfanity).toBe(false);
52+
});
53+
});
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
{
2+
"words": [
3+
"drittsekk",
4+
"faen i helvete",
5+
"fitte",
6+
"jævla",
7+
"kuk",
8+
"kukene",
9+
"kuker",
10+
"nigger",
11+
"pikk",
12+
"sotrør",
13+
"ståpikk",
14+
"ståpikkene",
15+
"ståpikker"
16+
]
17+
}

0 commit comments

Comments
 (0)