Issue: Extract Candidate Data from IEBC PDF into JSON

<html>
The IEBC PDF for the 2025 by-elections contains multiple tables for candidates running for different positions (Senator, Member of Parliament, Member of County Assembly). Currently, we do not have an automated way to extract candidate information into structured JSON compatible with our database schema (<code inline="">candidates</code>, <code inline="">positions</code>, <code inline="">parties</code>).
This issue tracks the work to implement a candidate data extraction module that will:
<ol>
<li>
Read the relevant PDF pages.
</li>
<li>
Normalize table columns across tables with different structures.
</li>
<li>
Handle missing or inconsistent data (e.g., party symbols, candidate photos, gender not included).
</li>
<li>
Produce a clean JSON output compatible with the database schema.
</li>
</ol>
<hr>
<h3>Problems / Challenges</h3>
<ol>
<li>
Different table structures per position
</li>
</ol>

Position | Columns
-- | --
Senator | Only county info
MP | County + Constituency info
MCA | County + Constituency + Ward (CAW) info


<ol start="2">
<li>
Positions not explicitly in the tables
<ul>
<li>
We must infer the position from table structure (presence of CAW → MCA, Constituency → MP, County only → Senator).
</li>
</ul>
</li>
<li>
Missing fields
<ul>
<li>
<code inline="">gender</code> → unknown
</li>
<li>
<code inline="">photo</code> → use <code inline="">Symbol</code> column if present, else <code inline="">None</code>
</li>
<li>
<code inline="">party_id</code> → inferred from <code inline="">Party Name</code> / <code inline="">Party Abbreviation</code>
</li>
<li>
<code inline="">voting_station</code> → not included, links to <code inline="">stations</code> table
</li>
</ul>
</li>
<li>
Messy inline spacing / PDF artifacts
<ul>
<li>
e.g., <code inline="">"FORD- KENY A"</code> → <code inline="">"FORD-KENYA"</code>
</li>
<li>
e.g., multiple spaces in candidate names
</li>
</ul>
</li>
</ol>
<hr>
<h3>Proposed Solution</h3>
<ul>
<li>
Create a new extraction service: <code inline="">candidate_data_extractor.py</code>
</li>
<li>
Normalize Camelot-extracted tables and unify column names.
</li>
<li>
Infer <code inline="">position_type</code> based on table columns.
</li>
<li>
Clean inline spaces in candidate names and party fields.
</li>
<li>
Produce JSON output ready to populate:
<ul>
<li>
<code inline="">candidates</code>
</li>
<li>
<code inline="">positions</code>
</li>
<li>
<code inline="">parties</code>
</li>
</ul>
</li>
</ul>
Sample JSON fields per candidate:
<pre><code class="language-json">{
 "name": "John Doe",
 "gender": "unknown",
 "photo": null,
 "position_type": "Member of Parliament",
 "party_name": "FORD-Kenya",
 "party_code": "FORD",
 "county_code": "030",
 "county_name": "Ngaina",
 "constituency_code": "157",
 "constituency_name": "Ngaina East",
 "ward_code": "",
 "ward_name": ""
}
</code></pre>
<hr>
<h3>Acceptance Criteria</h3>
<ol>
<li>
Candidate tables are read from PDF and normalized.
</li>
<li>
Position type is correctly inferred for all candidate tables.
</li>
<li>
All candidate names, party names, and other text fields are cleaned.
</li>
<li>
Output is written as JSON and is ready to be ingested into the database.
</li>
<li>
Module can be called via <code inline="">generate.py --extraction_type candidate_data</code>.
</li>
</ol>
<hr>
<h3>Dependencies</h3>
<ul>
<li>
This module depends on the polling station extraction module, since <code inline="">voting_station</code> links to that data.
</li>
<li>
JSON from polling data extraction must exist before populating <code inline="">candidates.voting_station</code>.
</li>
</ul>
<hr>
<h3>Notes</h3>
<ul>
<li>
Hardcoding positions is acceptable for now due to inconsistent table formatting.
</li>
<li>
Candidate extraction should be separate from polling data extraction for modularity.
</li>
</ul>
<blockquote>
This issue is linked to the <code inline="">candidate-data-extraction</code> feature branch.
</blockquote>
</body>
</html>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue: Extract Candidate Data from IEBC PDF into JSON #11

Problems / Challenges

Proposed Solution

Acceptance Criteria

Dependencies

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Position	Columns
Senator	Only county info
MP	County + Constituency info
MCA	County + Constituency + Ward (CAW) info

Issue: Extract Candidate Data from IEBC PDF into JSON #11

Description

Problems / Challenges

Proposed Solution

Acceptance Criteria

Dependencies

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions