Problem:
routeToOCR()claimed to "try both engines and merge results" for mixed language- Actually just routed to Arabic engine
mergeOCRResults()helper was never invoked- Documentation claim of "result merging / combined outputs" was false
Fix: Implemented actual parallel processing and merging
Before:
} else {
// Mixed content
logger.debug('Mixed language detected - using Arabic OCR for better coverage');
return await this.processArabicOCR(pageImage);
}After:
} else {
// Mixed content - try both engines and merge results
logger.debug('Mixed language detected - processing with both engines');
const [defaultResult, arabicResult] = await Promise.all([
this.processDefaultOCR(pageImage),
this.processArabicOCR(pageImage),
]);
return await this.mergeOCRResults(defaultResult, arabicResult);
}Result:
- ✅ Mixed language now processes with BOTH engines in parallel
- ✅
mergeOCRResults()actually called - ✅ Documentation claim now accurate
Problem:
normalize()method never callednormalizeArabicCharacters()- Alef variants, Teh Marbuta, Hamza normalization not applied
- Documentation claimed "complete normalization" but was incomplete
Fix: Added Arabic character normalization to pipeline
Before (pipeline):
Step 1: normalizeUnicode()
Step 2: normalizeWhitespace()
Step 3: normalizeDigits()
Step 4: removeDiacritics() // ← Arabic chars not normalized!
Step 5: trim()After (pipeline):
Step 1: normalizeUnicode()
Step 2: normalizeWhitespace()
Step 3: normalizeDigits()
Step 4: normalizeArabicCharacters() // ← NEW! Applied for Arabic/mixed
Step 5: removeDiacritics()
Step 6: trim()Result:
- ✅ Alef variants (أإآ) → ا
- ✅ Teh Marbuta (ة) → ه
- ✅ Hamza variants (ؤئ) → وي
- ✅ Pipeline now truly complete
Test Added:
it('should apply Arabic character normalization in pipeline', () => {
const text = 'أإآمدرسة'; // Alef variants + Teh Marbuta
const result = normalizer.normalize(text, 'arabic', false);
expect(result).toBe('ااامدرسه'); // ✅ Normalized
});Problem:
containsArabic()andcontainsEnglish()used global regexes with.test()- Global flag makes regex stateful across calls
- Risk in long-lived contexts (workers, services)
Fix: Create new regex instances for each call
Before:
const UNICODE_RANGES = {
arabic: /[...]/g, // Global flag - stateful!
english: /[...]/g,
};
containsArabic(text: string): boolean {
return UNICODE_RANGES.arabic.test(text); // Stateful!
}After:
const UNICODE_PATTERNS = {
arabic: '[...]', // Pattern string
english: '[...]',
};
containsArabic(text: string): boolean {
return new RegExp(UNICODE_PATTERNS.arabic).test(text); // New instance!
}Changes Applied:
- ✅
containsArabic()- Creates new regex per call - ✅
containsEnglish()- Creates new regex per call - ✅
detectLanguageFromText()- Creates new regex per call - ✅
getLanguageStatistics()- Creates new regex per call
Result: No regex state preserved between calls
-
✅
src/services/ocr-router.service.ts- Implemented parallel processing for mixed language
- Actually calls
mergeOCRResults()
-
✅
src/services/text-normalizer.service.ts- Added
normalizeArabicCharacters()to pipeline - Applied for Arabic and mixed languages
- Added
-
✅
src/services/language-detector.service.ts- Converted global regexes to pattern strings
- Create new instances per call
-
✅
src/services/text-normalizer.service.test.ts- Added test for Arabic character normalization in pipeline
$ npm run build
✅ Clean compilation
$ npm test
Test Suites: 7 passed, 7 total
Tests: 87 passed, 87 total (+1 new test) ✅| Issue | Priority | Status | Impact |
|---|---|---|---|
| Mixed-language not merging | Medium | ✅ Fixed | Both engines now used |
| Arabic chars not normalized | Medium | ✅ Fixed | Complete pipeline |
| Stateful regex risk | Open Q | ✅ Fixed | Thread-safe |
All documentation claims now accurate. Pipeline is truly complete!