You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Dec 10, 2025. It is now read-only.
The IEBC PDF for the 2025 by-elections contains multiple tables for candidates running for different positions (Senator, Member of Parliament, Member of County Assembly). Currently, we do not have an automated way to extract candidate information into structured JSON compatible with our database schema (candidates, positions, parties).
This issue tracks the work to implement a candidate data extraction module that will:
Read the relevant PDF pages.
Normalize table columns across tables with different structures.
Handle missing or inconsistent data (e.g., party symbols, candidate photos, gender not included).
Produce a clean JSON output compatible with the database schema.
Problems / Challenges
Different table structures per position
Position
Columns
Senator
Only county info
MP
County + Constituency info
MCA
County + Constituency + Ward (CAW) info
Positions not explicitly in the tables
We must infer the position from table structure (presence of CAW → MCA, Constituency → MP, County only → Senator).
Missing fields
gender → unknown
photo → use Symbol column if present, else None
party_id → inferred from Party Name / Party Abbreviation
voting_station → not included, links to stations table
Messy inline spacing / PDF artifacts
e.g., "FORD- KENY A" → "FORD-KENYA"
e.g., multiple spaces in candidate names
Proposed Solution
Create a new extraction service: candidate_data_extractor.py
Normalize Camelot-extracted tables and unify column names.
Infer position_type based on table columns.
Clean inline spaces in candidate names and party fields.
The IEBC PDF for the 2025 by-elections contains multiple tables for candidates running for different positions (Senator, Member of Parliament, Member of County Assembly). Currently, we do not have an automated way to extract candidate information into structured JSON compatible with our database schema (
candidates,positions,parties).This issue tracks the work to implement a candidate data extraction module that will:
Read the relevant PDF pages.
Normalize table columns across tables with different structures.
Handle missing or inconsistent data (e.g., party symbols, candidate photos, gender not included).
Produce a clean JSON output compatible with the database schema.
Problems / Challenges
Different table structures per position
Positions not explicitly in the tables
We must infer the position from table structure (presence of CAW → MCA, Constituency → MP, County only → Senator).
Missing fields
gender→ unknownphoto→ useSymbolcolumn if present, elseNoneparty_id→ inferred fromParty Name/Party Abbreviationvoting_station→ not included, links tostationstableMessy inline spacing / PDF artifacts
e.g.,
"FORD- KENY A"→"FORD-KENYA"e.g., multiple spaces in candidate names
Proposed Solution
Create a new extraction service:
candidate_data_extractor.pyNormalize Camelot-extracted tables and unify column names.
Infer
position_typebased on table columns.Clean inline spaces in candidate names and party fields.
Produce JSON output ready to populate:
candidatespositionspartiesSample JSON fields per candidate:
Acceptance Criteria
Candidate tables are read from PDF and normalized.
Position type is correctly inferred for all candidate tables.
All candidate names, party names, and other text fields are cleaned.
Output is written as JSON and is ready to be ingested into the database.
Module can be called via
generate.py --extraction_type candidate_data.Dependencies
This module depends on the polling station extraction module, since
voting_stationlinks to that data.JSON from polling data extraction must exist before populating
candidates.voting_station.Notes
Hardcoding positions is acceptable for now due to inconsistent table formatting.
Candidate extraction should be separate from polling data extraction for modularity.