Skip to content
This repository was archived by the owner on Dec 10, 2025. It is now read-only.
This repository was archived by the owner on Dec 10, 2025. It is now read-only.

Issue: Extract Candidate Data from IEBC PDF into JSON #11

Description

@koleshjr

The IEBC PDF for the 2025 by-elections contains multiple tables for candidates running for different positions (Senator, Member of Parliament, Member of County Assembly). Currently, we do not have an automated way to extract candidate information into structured JSON compatible with our database schema (candidates, positions, parties).

This issue tracks the work to implement a candidate data extraction module that will:

  1. Read the relevant PDF pages.

  2. Normalize table columns across tables with different structures.

  3. Handle missing or inconsistent data (e.g., party symbols, candidate photos, gender not included).

  4. Produce a clean JSON output compatible with the database schema.


Problems / Challenges

  1. Different table structures per position

Position Columns
Senator Only county info
MP County + Constituency info
MCA County + Constituency + Ward (CAW) info
  1. Positions not explicitly in the tables

    • We must infer the position from table structure (presence of CAW → MCA, Constituency → MP, County only → Senator).

  2. Missing fields

    • gender → unknown

    • photo → use Symbol column if present, else None

    • party_id → inferred from Party Name / Party Abbreviation

    • voting_station → not included, links to stations table

  3. Messy inline spacing / PDF artifacts

    • e.g., "FORD- KENY A""FORD-KENYA"

    • e.g., multiple spaces in candidate names


Proposed Solution

  • Create a new extraction service: candidate_data_extractor.py

  • Normalize Camelot-extracted tables and unify column names.

  • Infer position_type based on table columns.

  • Clean inline spaces in candidate names and party fields.

  • Produce JSON output ready to populate:

    • candidates

    • positions

    • parties

Sample JSON fields per candidate:

{
  "name": "John Doe",
  "gender": "unknown",
  "photo": null,
  "position_type": "Member of Parliament",
  "party_name": "FORD-Kenya",
  "party_code": "FORD",
  "county_code": "030",
  "county_name": "Ngaina",
  "constituency_code": "157",
  "constituency_name": "Ngaina East",
  "ward_code": "",
  "ward_name": ""
}

Acceptance Criteria

  1. Candidate tables are read from PDF and normalized.

  2. Position type is correctly inferred for all candidate tables.

  3. All candidate names, party names, and other text fields are cleaned.

  4. Output is written as JSON and is ready to be ingested into the database.

  5. Module can be called via generate.py --extraction_type candidate_data.


Dependencies

  • This module depends on the polling station extraction module, since voting_station links to that data.

  • JSON from polling data extraction must exist before populating candidates.voting_station.


Notes

  • Hardcoding positions is acceptable for now due to inconsistent table formatting.

  • Candidate extraction should be separate from polling data extraction for modularity.

This issue is linked to the candidate-data-extraction feature branch.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions