Extracts voter data from Tamil Nadu electoral roll PDFs (English + Tamil pairs) into structured CSV files using local OCR. Completely offline, zero API cost, and achieves 99.87% cell-level accuracy across 1,118 validated records.
| Factor | OCR (this tool) | LLM-based |
|---|---|---|
| Cost | $0 (runs locally) | API costs per page (~$0.01-0.05/page) |
| Accuracy | 99.87% cell accuracy | Comparable, but varies by model |
| Speed | ~60s per page pair | Depends on API rate limits |
| Scalability | Run 1000s of pairs overnight with multi-worker | Limited by API quotas and cost |
| Privacy | All data stays local | Data sent to external API |
| Offline | Works without internet | Requires internet |
With 4 workers, the tool processes ~11,700 page pairs overnight. Accuracy was achieved through 5 phases of targeted improvements including empty cell detection, multi-strategy EPIC/serial voting, confidence-aware Tamil matching, and consecutive-run serial anchoring.
PDF --> PyMuPDF (extract image) --> OpenCV (detect grid, crop cells)
--> Empty cell filter (ink density) --> Tesseract OCR (per cell)
--> Multi-signal validation --> Regex parsing --> Merge EN+TA --> CSV
| Component | Library | Purpose |
|---|---|---|
| PDF image extraction | PyMuPDF (fitz) |
Extract embedded PNG from each single-page PDF |
| Grid detection | OpenCV | Morphological ops to find 3x10 row/column grid |
| Empty cell detection | OpenCV | Ink density analysis to skip empty cells before OCR |
| Image preprocessing | OpenCV | CLAHE, denoising, adaptive threshold, 4x upscale |
| OCR | Tesseract 5.4+ (pytesseract) |
Text recognition (PSM 6, OEM 1) |
| Field parsing | Python re |
Regex extraction with fuzzy label matching |
| Tamil matching | EPIC ID (confidence-aware) + serial + position | Match Tamil page to English page |
start.batOpens a browser-based dashboard at http://localhost:7000 with guided setup, one-click workflow execution, and real-time progress monitoring. No CLI knowledge required.
The CLI approach (below) remains fully supported and unchanged — the web UI is additive only.
setup.batThis checks/installs Python, Tesseract, Tamil language data, and Python dependencies.
Download from python.org. Check "Add Python to PATH" during installation.
winget install UB-Mannheim.TesseractOCRThis installs to C:\Program Files\Tesseract-OCR\. Verify:
"C:\Program Files\Tesseract-OCR\tesseract.exe" --version
# Expected: tesseract v5.4.0.xxxxxImportant: During Tesseract installation, check "Additional language data" and "Additional script data" to include Tamil support automatically.
Download tam.traineddata and copy to Tesseract's tessdata folder:
# Open Command Prompt as Administrator
copy %USERPROFILE%\Downloads\tam.traineddata "C:\Program Files\Tesseract-OCR\tessdata\tam.traineddata"Verify Tamil is available:
"C:\Program Files\Tesseract-OCR\tesseract.exe" --list-langs
# Should list: eng, osd, tampip install -r requirements.txtVerify:
python -c "import fitz, cv2, pytesseract, numpy, PIL; print('All packages OK')"Place your downloaded electoral roll PDFs in the following structure:
Input/ER_Downloads/AC-xxx/
english/ <-- English PDF files (e.g., 2026-EROLLGEN-...-ENG-1-WI.pdf)
tamil/ <-- Tamil PDF files (e.g., 2026-EROLLGEN-...-TAM-1-WI.pdf)
Replace AC-xxx with your Assembly Constituency number (e.g., AC-184, AC-188).
python split_pdfs.py --ac AC-188
# Or run interactively:
python split_pdfs.pyThis splits each multi-page PDF into individual page files. All pages are split — non-data pages (metadata, summary, maps) are auto-detected and skipped during extraction in Step 3. Output goes to Input/split_files/AC-188/{english,tamil}/.
# Validate on a single page pair first (recommended)
python extract_ocr.py AC-188 --validate
# Process one AC with 4 parallel workers
python extract_ocr.py AC-188 --workers 4
# Process specific part(s) — useful when splitting work across people
python extract_ocr.py AC-188 --part 101
python extract_ocr.py AC-188 --part 50-100
python extract_ocr.py AC-188 --part 1,5,10-20
# Process first 100 pairs only (useful for testing)
python extract_ocr.py AC-188 --limit 100 --workers 4
# Process all ACs (can run overnight for large datasets)
python extract_ocr.py --all --workers 4
# Run interactively (prompts for AC number):
python extract_ocr.pyOutput CSVs (one per page) are saved to output/split_files/AC-188/.
python merge_outputs.py --ac AC-188
# Or run interactively:
python merge_outputs.pyThis merges page-level CSVs back into part-level and AC-level files. Output goes to output/merged_files/parts/AC-188/ (per-part) and output/merged_files/ac/AC-188.csv (entire constituency).
Important: The merge script does a full rewrite of each part CSV, not an incremental append. Once a part is merged, it is marked as done in a checkpoint and skipped on subsequent runs. If you extract additional pages for a part that was already merged (e.g., extracted 20 pages, merged, then extracted the remaining 26), you must use --force to re-merge and pick up the new pages:
python merge_outputs.py --ac AC-188 --forceFor best results, complete all extraction for a part before merging.
bash check-progress.shER_OCR/
├── extract_ocr.py # Main OCR extraction script
├── split_pdfs.py # Split multi-page PDFs into individual pages
├── merge_outputs.py # Merge page CSVs into part-level and AC-level CSVs
├── analyze_quality.py # Quality analysis and accuracy reporting
├── check-progress.sh # Progress monitoring script
├── setup.bat # Automated dependency setup (Windows)
├── start.bat # Web UI launcher (Windows) — run this to start the UI
├── start.sh # Web UI launcher (Linux/macOS)
├── requirements.txt # Python dependencies (OCR + web)
├── web/ # Web UI (FastAPI backend + browser frontend)
│ ├── app.py # FastAPI application entry point
│ ├── api/ # API route modules
│ └── core/ # Job manager, dep checker, queue manager
│ └── static/ # HTML, CSS, JS (served at http://localhost:7000)
├── Input/
│ ├── ER_Downloads/
│ │ └── AC-xxx/ # Original downloaded PDFs
│ │ ├── english/
│ │ └── tamil/
│ └── split_files/
│ └── AC-xxx/ # Split page PDFs + checkpoint
│ ├── english/
│ ├── tamil/
│ └── checkpoint.json
├── output/
│ ├── split_files/
│ │ └── AC-xxx/ # Page-level CSVs (one per page)
│ └── merged_files/
│ ├── parts/
│ │ └── AC-xxx/ # Part-level merged CSVs
│ └── ac/
│ └── AC-xxx.csv # AC-level merged CSV (entire constituency)
└── logs/ # Per-run log files and JSON summaries
14 columns, UTF-8 with BOM encoding (for Excel compatibility):
| Column | Source | Example |
|---|---|---|
| AC No | Filename | 184 |
| Part No | Filename | 33 |
| Serial No | English OCR | 211 |
| EPIC ID | English OCR | RVJ1612993 |
| Name (English) | English OCR | Kavitha |
| Name (Tamil) | Tamil OCR | கவிதா |
| Relation Name (English) | English OCR | Murugesan |
| Relation Name (Tamil) | Tamil OCR | முருகேசன் |
| Relation Type | English OCR | Father |
| House No | English OCR | '1-192 |
| Age | English OCR | 30 |
| Gender | English OCR | Female |
| DOB | (always blank) | |
| ContactNo | (always blank) |
- House numbers prefixed with
'to prevent Excel auto-formatting - Records sorted ascending by Serial No within each file
When running with --cross-check or --validate, two additional columns are appended:
| Column | Values | Description |
|---|---|---|
| Cross_Check | OK / REVIEW |
REVIEW means at least one field disagreed between English and Tamil cells |
| Cross_Check_Notes | text | Semicolon-separated list of mismatches (e.g., EPIC mismatch EN=WXJ1234567 TA=WXJ1234568; House mismatch EN=3-5 TA=3-6) |
Cross-check columns are never written in normal production runs — they only appear when explicitly requested.
start.bat # Windows — auto-detects free port, opens at http://localhost:7000
bash start.sh # Linux/macOSThe server runs locally on loopback only (127.0.0.1) — no network exposure.
The UI uses a sidebar navigation on desktop (collapses to a bottom bar on mobile):
| Section | Purpose |
|---|---|
| Setup (gear icon) | Collapsible panel — check Tesseract, Tamil tessdata, and Python packages. Install missing deps with one click. |
| Workflow | Select an AC (or create a new one with +), configure options, run individual steps or the full pipeline. System resources panel shows RAM-aware worker recommendation and disk space warning. |
| Live Logs | Real-time streaming output from any running job. ETA estimator, colour-coded lines, kill button. |
| Data | Browse all ACs — download merged CSVs, view extraction progress, check file validation. |
| History | Browse and download past log files and run summary JSONs. |
Add multiple ACs to the queue from the Workflow tab and click Start Queue. The tool runs split → extract → merge for each AC sequentially. Queue state is saved to web/queue_state.json — a server restart will resume from where it left off. A browser notification fires when each AC completes.
The system resources panel reads your CPU core count and available RAM, then recommends a safe worker count (rule: min(cores-1, available_RAM_GB / 0.5)). The slider turns yellow above the recommended value and red when risky. A disk space warning appears if the estimated output size for the selected AC approaches your free disk space.
The Extract step in the Workflow tab exposes the full CLI interface:
| Option | UI Control | Description |
|---|---|---|
--part |
Text input | Process specific part number or range (e.g., "101" or "50-100") |
--page |
Text input | Page number, range, or list within a part (e.g., "4", "1-10", "1,5,10-20"). Requires --part |
--limit |
Number input | Max pairs to process (0 = all) |
--cross-check |
Checkbox | Cross-validate EN vs TA cells, adds 2 extra columns |
--reset |
Checkbox (yellow) | Clear checkpoint for specified part — shows confirmation dialog |
Two utility buttons are also available:
- Dry Run — runs
--dry-runto show pending file pairs without processing - Validate Page — runs
--validatewith the current--part/--pagesettings
Each pipeline step has a collapsible ℹ info button explaining what it does and where output goes.
Click the green + button next to the AC dropdown to create a new AC input directory. Enter the AC number in AC-xxx format (e.g., AC-188) and the tool creates Input/ER_Downloads/AC-xxx/{english,tamil}/ ready for PDF files.
The Test 1 Page button runs extract_ocr.py --validate on the selected AC and renders the extracted records as an inline table — useful for spot-checking OCR quality before committing to a full run.
The Setup tab can install missing components:
- Python packages — runs
pip install -r requirements.txt - Tesseract OCR — runs
winget install UB-Mannheim.TesseractOCR(Windows) - Tamil tessdata — downloads
tam.traineddatadirectly from the Tesseract GitHub repository. IfC:\Program Files\Tesseract-OCR\tessdata\requires admin rights, falls back to a project-localtessdata/folder and writesTESSDATA_PREFIXto.env(picked up automatically bystart.bat).
Default port is 7000. If 7000 is in use, start.bat automatically tries 7001–7009. Port 8000 is intentionally avoided as Windows (Hyper-V / WSL) commonly reserves it.
python split_pdfs.py [options]
Options:
--ac AC-xxx Assembly Constituency (e.g., AC-188). Prompts if omitted.
--force Overwrite existing split files
All pages are split (no metadata pages skipped). Non-data pages are auto-detected and skipped during extraction by extract_ocr.py.
python extract_ocr.py [directory] [options]
Positional:
directory AC directory (e.g., AC-188). Prompts if omitted.
Options:
--all Process all discovered AC directories
--validate Process only 1 pair, print detailed output (auto-enables --cross-check)
--dry-run List pending pairs without processing
--reset Reset checkpoint and output for a directory
--part PARTS Filter by part number: single (101), range (50-100), or mixed (1,5,10-20)
--page PAGES Page number, range, or list (e.g., 4, 1-10, 1,5,10-20). Requires --part.
--workers N Number of parallel workers (default: 4)
--limit N Process only N pairs, then stop
--cross-check Cross-validate EPIC ID, House No, and serial between English and Tamil cells.
Appends Cross_Check and Cross_Check_Notes columns to CSV output.
Part filtering examples:
python extract_ocr.py AC-188 --part 101 # Only Part 101
python extract_ocr.py AC-188 --part 50-100 # Parts 50 through 100
python extract_ocr.py AC-188 --part 1,5,10-20 # Parts 1, 5, and 10-20
python extract_ocr.py AC-188 --reset --part 101 # Reset only Part 101
python extract_ocr.py AC-188 --reset --part 50-100 # Reset Parts 50-100
python extract_ocr.py AC-188 --dry-run --part 101 # Preview Part 101 pending pairs
Validation examples:
python extract_ocr.py AC-188 --validate # Validate first pending pair
python extract_ocr.py AC-188 --part 3 --page 4 --validate # Validate specific page
python extract_ocr.py AC-188 --part 3 --page 1-10 # Process pages 1 through 10
python extract_ocr.py AC-188 --part 3 --page 1,5,10-20 # Specific pages and ranges
python extract_ocr.py AC-188 --part 3 --page 4 --validate --cross-check # With cross-check
python merge_outputs.py [options]
Options:
--ac AC-xxx Assembly Constituency (e.g., AC-188). Prompts if omitted.
--force Re-merge all parts from scratch (required if new pages were
extracted after a previous merge)
Produces both part-level CSVs (output/merged_files/parts/AC-xxx/) and a single AC-level CSV (output/merged_files/ac/AC-xxx.csv). Each merge does a full rewrite — it does not append. Without --force, already-merged parts are skipped. The AC-level file is always regenerated from current part CSVs.
python analyze_quality.py [options]
Options:
--ac AC-xxx Analyze specific AC (default: all available)
split_pdfs.py splits all pages from multi-page PDFs into individual page files. No pages are skipped during splitting — non-data pages (metadata, summary, maps, legends) are auto-detected and skipped during extraction.
Before running OCR on each cell, ink density is analyzed. Cells with less than 2% ink coverage (after excluding grid line borders) are skipped. This eliminates phantom records from empty cells on partial pages (pages with fewer than 30 entries), with zero OCR cost.
Each PDF page contains a 3-column x 10-row grid of voter entries (max 30 per page). OpenCV detects:
- Horizontal lines using morphological opening with a wide kernel
- Column boundaries by finding vertical gaps in pixel density
- Fallback: If detected columns span <85% of page width, falls back to proportional
[2%, 34%, 66%, 98%]
Each non-empty cell is cropped, upscaled 4x (Lanczos), preprocessed, then passed to Tesseract:
- English cells:
--psm 4 --oem 1withlang=eng - Tamil cells:
--psm 6 --oem 1withlang=tam+eng(bottom 15% cropped to avoid label contamination), with retry using alternative preprocessing (Otsu, less aggressive crop) when initial result is poor
Records must have at least 2 valid signals (name, EPIC ID, serial number, age+gender, house number) to be accepted. This prevents noise from empty or partially-filled cells from creating phantom records.
- Multi-strategy voting: Serial numbers extracted using 3 threshold strategies (fixed 150, Otsu, fixed 120) with majority voting
- Cross-validation: Targeted serial always cross-validates the primary OCR result
- Consecutive-run anchoring: Stray serial filter uses the longest consecutive run as anchor instead of median, preventing a single misread from cascading into incorrect corrections
- Multi-strategy voting: 3 preprocessing strategies (CLAHE, Otsu, sharpen) with confidence scoring
- Consensus detection: If 2+ strategies agree, confidence is boosted
- Confidence-aware Tamil matching: Low-confidence EPICs (<70) skip EPIC-based Tamil matching in favor of position-based matching
Regex patterns with fuzzy label matching handle common OCR errors:
"Fatner Name"-->"Father Name","Gerder"-->"Gender"- EPIC ID
O-->0correction in digit positions
- Gender labels ("ஆண்", "பெண்") and noise words are filtered from Tamil name output
- Minimum 3 Tamil characters required (rejects single-char OCR fragments)
- Retry with alternative preprocessing (Otsu threshold, less aggressive crop) when initial result is poor
- Extract EPIC IDs from English page
- Try Tamil pages at same number, +/-1 first (fast path)
- Fall back to scanning all Tamil pages for the part
- EPIC-based matching (Pass 1): Only used when EPIC confidence >= 70
- Serial-based matching (Pass 2): Fills gaps from Pass 1
- Position-based matching (Pass 3): Cell index fallback
- Cross-validation: Low-confidence EPIC matches are verified against position-based results
After each page pair, the filename is saved to checkpoint.json inside the AC directory. Processing can be stopped and resumed at any time.
Validated on AC-166 Part 1 (46 page pairs, 1,118 voter records including 10 partial pages):
| Metric | Value |
|---|---|
| Overall cell accuracy | 99.87% |
| Record completeness | 98.39% |
| Serial number accuracy | 100% |
| EPIC ID fill rate | 99.9% |
| Name (English) | 99.8% |
| Name (Tamil) | 99.8% |
| Malformed EPIC IDs | 0 |
v1.1 improvements over v1.0:
- Eliminated phantom records on partial pages (pages with <30 entries)
- Fixed serial number misreads caused by stray filter cascading (e.g., 505 misread as 605)
- Improved Tamil name quality: no more gender labels or single-char fragments as names
- Reduced EPIC digit misreads via multi-strategy confidence voting
| Metric | Value |
|---|---|
| Per page pair (sequential) | ~60 seconds |
| Per AC (~720 pairs, 4 workers) | ~3-4 hours |
| All ACs (~11,700 pairs, 4 workers) | ~6-8 hours |
| Cost | $0 (all local) |
| RAM usage (4 workers) | ~800 MB |
Note: v1.1 is slightly slower per page than v1.0 due to multi-strategy EPIC/serial voting. The additional OCR passes only trigger when needed (low confidence or missing fields).
Each extraction run generates:
- Log file:
logs/extract_{AC}_{timestamp}.log— per-file processing details - Summary JSON:
logs/extract_{AC}_{timestamp}_summary.json— total records, warnings, errors, duration
| Issue | Solution |
|---|---|
setup.bat fails to install Tesseract |
winget may not be available. Install manually from UB-Mannheim, then re-run setup.bat |
Tesseract OCR not found |
Verify C:\Program Files\Tesseract-OCR\tesseract.exe exists |
No module named 'fitz' |
Run pip install pymupdf |
| Tamil names are empty | Verify tam appears in tesseract --list-langs |
Grid detection failed |
Falls back to proportional splitting automatically |
| Processing is slow | Reduce --workers if RAM is limited; increase for faster CPUs |
Permission denied on tessdata |
Use the Web UI Setup tab (auto-falls back to project-local folder), or copy with an admin terminal |
| Unicode errors on Windows | The script handles UTF-8 encoding automatically |
| Merge missing new pages | Use python merge_outputs.py --ac AC-xxx --force to re-merge |
Web UI: can't reach localhost:7000 |
Port 7000 may be in use — start.bat tries 7000–7009 automatically |
Web UI: fastapi not found |
Run pip install fastapi uvicorn aiofiles psutil or let start.bat install them |
| Web UI: job shows error immediately | Check Live Logs tab — the subprocess exit message explains the cause |
- OCR accuracy ceiling at 115 DPI: Source PDFs are rasterized at 115 DPI, limiting OCR precision for some characters
- EPIC ID misreads: Occasional letter/digit confusion (e.g.,
RVIvsRVJ) — inherent to OCR at this resolution, mitigated by multi-strategy voting in v1.1 - Tamil name quality on partial pages: A few cells on partial pages may have missing or mismatched Tamil names due to garbled OCR at 115 DPI
- Processing speed: ~60s per page pair due to Tesseract OCR overhead and multi-strategy voting; parallelized with
--workers
- Empty folder guidance: When an AC has no English or Tamil PDFs, a helpful message now directs users to download from voters.eci.gov.in and shows the exact save path (e.g.,
Input/ER_Downloads/AC-200/english/) - File path summary after jobs: After any job completes, errors, or is killed, the Live Logs terminal appends a summary showing the input and output file paths for that step — also shown when switching to a completed job in the dropdown
- Web UI (
start.bat/start.sh): Browser-based dashboard athttp://localhost:7000- Setup tab: dependency checker for Python packages, Tesseract binary, and Tamil tessdata — with one-click installation and live progress streaming
- Workflow tab: AC selector, step-by-step pipeline (Split → Extract → Merge → Analyze) with all CLI options exposed as UI controls
- Create new AC:
+button next to AC dropdown createsInput/ER_Downloads/AC-xxx/{english,tamil}/directories from the UI - Full Extract CLI parity:
--part,--page,--limit,--cross-check,--resetexposed as UI controls; plus standalone Dry Run and Validate Page buttons - Count mismatch guidance: When English/Tamil PDF counts differ, warning links to voters.eci.gov.in for re-download
- Page range support:
--pagenow accepts ranges and lists (e.g.,1-10,1,5,10-20) — same syntax as--part - Sidebar navigation: Workflow, Live Logs, Data, History as vertical sidebar (desktop) / bottom bar (mobile); Setup as collapsible panel via gear icon
- Collapsible step info: Each pipeline step has an
ℹbutton explaining what it does and where output goes - Reset confirmation:
--resetcheckbox shows a confirm dialog before clearing checkpoint data - System resources panel: RAM-aware worker count recommendation, disk space warning with estimated output size
- Input file validator: checks English/Tamil PDF count match before running
- Quick Validate / Preview: "Test 1 Page" button with inline extracted-records table
- Live Logs tab: real-time SSE log streaming, colour-coded output, ETA estimator, kill button
- Overnight Queue: queue multiple ACs for sequential unattended processing; state persisted to
web/queue_state.json - Browser notifications: fired on job/queue completion
- Data tab: AC overview table, CSV download, per-AC progress
- History tab: log file browser and run summary viewer
- Dark mode default with toggle; port auto-detection (7000–7009)
- CLI workflow unchanged — all existing commands work exactly as before
- Cross-validation layer (
--cross-check): After extraction, re-examines Tamil cells to independently verify EPIC ID, House No, and serial number against the English cell values. Mismatches are flagged asREVIEWin two new optional CSV columns (Cross_Check,Cross_Check_Notes) - House No cross-check: Dedicated English-language OCR pass on Tamil cells to extract and compare house numbers — parallel to the existing EPIC ID cross-check
- Targeted page validation (
--page N): Allows--validateto target a specific page number within a part, bypassing the checkpoint so any page (including already-processed ones) can be re-inspected - Zero overhead in production: Cross-check columns are never written unless
--cross-checkor--validateis specified; default 14-column CSV format is unchanged
- AC-level merge:
merge_outputs.pynow automatically produces a single CSV per constituency alongside part-level files - Output directory restructure: Merged output moved from
output/merged/tooutput/merged_files/parts/andoutput/merged_files/ac/
- Split all pages:
split_pdfs.pyno longer skips metadata pages; non-data pages are auto-detected during extraction - Empty cell detection: Pre-OCR ink density analysis eliminates phantom records on partial pages
- Multi-signal record validation: Requires 2+ valid fields to accept a record
- Serial number multi-strategy voting: 3 threshold strategies with majority voting and cross-validation
- Consecutive-run serial anchor: Stray filter uses longest consecutive run instead of median, preventing misread cascades
- Trailing empty row trim: Safety net removes noise records from bottom of partial pages
- Tamil name quality: Minimum 3 Tamil characters, gender/noise label rejection, expanded noise word list
- Tamil OCR retry: Alternative preprocessing (Otsu, less aggressive crop) when initial Tamil result is poor
- EPIC confidence scoring: Multi-strategy voting with per-word confidence from Tesseract
- Confidence-aware Tamil matching: Low-confidence EPICs skip to position-based matching; cross-validation for borderline cases
- Merge documentation: Clarified that merge does full rewrite, not append;
--forcerequired for re-merge
- Initial release with 4 phases of OCR improvements
- 99.30% cell-level accuracy across 6,510 validated records
This project is licensed under the MIT License.