This guide explains how to set up and configure the OCR service for offline use.
If you have internet access, the OCR service will work automatically:
npm install
npm run devOn the first OCR request, Tesseract.js will automatically download language training data (~4MB) from the internet. After that, it works offline.
For production environments or offline use, pre-download the language training data:
npm run setup:tessdataThis downloads language training files to public/tessdata/:
eng.traineddata.gz(~4MB) - English language data- Add more languages as needed
The setup script automatically creates lib/tessdata-config.ts. Import and use it in your OCR service:
// lib/simple-ocr-service.ts
import { TESSDATA_CONFIG } from './tessdata-config';
// In getWorker() method:
this.worker = await createWorker(language, 1, {
langPath: TESSDATA_CONFIG.langPath, // Use local files
logger: (m) => {
if (m.status === 'recognizing text') {
logger.debug(`OCR Progress: ${Math.round(m.progress * 100)}%`);
}
}
});Test that OCR works without internet:
- Disable internet connection
- Start the application
- Upload a document for OCR processing
- Verify it works correctly
Currently configured: English (eng)
-
Edit
scripts/setup-tessdata.mjs:const LANGUAGES = ['eng', 'fra', 'deu', 'spa']; // Add language codes
-
Run setup again:
npm run setup:tessdata
-
Use the language in your OCR request:
curl -X POST -F "file=@document.pdf" -F "language=fra" \ http://localhost:3000/api/simple-ocr
Common language codes:
eng- Englishfra- Frenchdeu- Germanspa- Spanishita- Italianpor- Portugueserus- Russianchi_sim- Chinese (Simplified)jpn- Japaneseara- Arabic
Full list of supported languages
Issue: Serverless functions may not persist downloaded files between invocations.
Solution 1 - Bundle Language Data:
- Run
npm run setup:tessdatabefore deployment - Commit
public/tessdata/to your repository - Configure OCR service to use local files
Solution 2 - Use CDN (Default):
- Accept that first request per cold start will download language data
- Subsequent requests in same container will be fast
Add to your Dockerfile:
# Download language data during build
RUN npm run setup:tessdata
# Or manually download
RUN mkdir -p public/tessdata && \
cd public/tessdata && \
wget https://tessdata.projectnaptha.com/4.0.0/eng.traineddata.gzRun once during deployment:
npm run setup:tessdataLanguage files persist on disk and work offline.
Configure OCR behavior in .env.local:
# OCR processing timeout (milliseconds)
OCR_TIMEOUT=600000
# Maximum file size (bytes)
MAX_FILE_SIZE=52428800
# Enable confidence tracking
ENABLE_CONFIDENCE_TRACKING=true
# Minimum acceptable confidence (0-100)
MIN_CONFIDENCE=70Advanced configuration in lib/simple-ocr-service.ts:
await createWorker(language, 1, {
langPath: '/tessdata', // Local language files
cacheMethod: 'write', // Cache strategy
logger: (m) => { /* ... */ }, // Progress logging
errorHandler: (err) => { /* */ }, // Error handling
});Cause: Tesseract.js trying to download language data but no internet connection.
Solution:
- Run
npm run setup:tessdatawith internet - Configure OCR service to use local files (see Step 2 above)
Causes:
- Large file size
- Complex document layout
- Multiple languages
Solutions:
- Reduce image resolution (600 DPI max recommended)
- Pre-process images (deskew, denoise)
- Use specific language instead of auto-detect
- Enable preprocessing options:
curl -X POST -F "file=@doc.pdf" \ -F "enhanceContrast=true" \ -F "deskew=true" \ http://localhost:3000/api/simple-ocr
Solutions:
- Ensure good quality input (300+ DPI)
- Use correct language setting
- Enable image preprocessing
- Check for correct orientation
- Avoid handwritten or stylized fonts
Process a document with OCR.
Request:
curl -X POST http://localhost:3000/api/simple-ocr \
-F "file=@document.pdf" \
-F "language=eng" \
-F "deskew=true" \
-F "enhanceContrast=true"Parameters:
file(required) - PDF or image filelanguage(optional) - Language code (default: 'eng')deskew(optional) - Auto-rotate text (default: false)enhanceContrast(optional) - Enhance image contrast (default: false)removeNoise(optional) - Remove image noise (default: false)
Response:
{
"success": true,
"text": "Extracted text content...",
"confidence": 95.5,
"processingTime": 2341,
"pageCount": 1,
"message": "OCR processing completed successfully"
}The OCR service reuses the Tesseract worker across requests for better performance.
Enable preprocessing for better accuracy and speed:
deskew: Automatically correct image rotationenhanceContrast: Improve text visibilityremoveNoise: Clean up scan artifacts
Always specify the correct language instead of using auto-detection:
-F "language=eng" # Much faster than auto-detect- Use 300 DPI for scanned documents
- Convert color images to grayscale
- Ensure good lighting and contrast
- Tesseract.js Documentation
- Tesseract OCR Best Practices
- Language Training Data
- Project README
- Testing Report
Last Updated: November 14, 2025