|
| 1 | +# openelections-data-nh — Claude project notes |
| 2 | + |
| 3 | +## What this repo does |
| 4 | + |
| 5 | +Pre-processes New Hampshire election results into the OpenElections CSV |
| 6 | +format. Output CSVs are ingested by the OpenElections [processing |
| 7 | +pipeline](http://docs.openelections.net/guide/) (note: that domain may |
| 8 | +not resolve; sister-state repos in `github.com/openelections/` are the |
| 9 | +practical reference for current output conventions). |
| 10 | + |
| 11 | +## Layout at a glance |
| 12 | + |
| 13 | +| Path | What's there | |
| 14 | +|---|---| |
| 15 | +| [oe_nh/](oe_nh/) | Modern parser framework (this is where new work goes) | |
| 16 | +| [raw/`<year>`/`<election>`/](raw/) | Committed source `.xls` / `.xlsx` files from sos.nh.gov | |
| 17 | +| [`<year>`/](2024/) | Output CSVs (also: pre-existing 2000-2020 CSVs from earlier contributors) | |
| 18 | +| [scripts/fetch-raw.md](scripts/fetch-raw.md) | Manual procedure for downloading new SoS files | |
| 19 | +| [scripts/run-data-tests.sh](scripts/run-data-tests.sh) | Run the four OpenElections data tests locally | |
| 20 | +| [2012/code/](2012/code/), [2014/](2014/), [2016/parser.py](2016/parser.py) | Pre-existing one-off scrapers — historical, not actively maintained | |
| 21 | +| [tests/](tests/) | unit + Hypothesis property tests for the new framework | |
| 22 | + |
| 23 | +## How to run the parser |
| 24 | + |
| 25 | +```bash |
| 26 | +uv sync --all-groups # install deps |
| 27 | +uv run pytest # run the test suite |
| 28 | +uv run python -m oe_nh.cli --year 2024 --election general --office president |
| 29 | +scripts/run-data-tests.sh # validate produced CSVs against OE tests |
| 30 | +``` |
| 31 | + |
| 32 | +## Architecture in 30 seconds |
| 33 | + |
| 34 | +`oe_nh/cli.py` is the orchestrator. It loads year-specific job registries |
| 35 | +(`oe_nh/jobs/nh_<year>.py`), finds the matching `Job` for the requested |
| 36 | +election/office, runs `parse_workbook()` for each raw file the Job |
| 37 | +references, and writes the result CSV via `oe_nh/writer.py`. |
| 38 | + |
| 39 | +The interesting code is in [oe_nh/parser.py](oe_nh/parser.py). Two shapes |
| 40 | +to know about: |
| 41 | + |
| 42 | +1. **Single-sheet workbook** (Congressional CD1/CD2): towns down column 0, |
| 43 | + candidates in `header_row`, vote matrix below. Set `header_row` and |
| 44 | + optionally `lookup_county_from_town=True` if the file has no county |
| 45 | + column. |
| 46 | + |
| 47 | +2. **Multi-sheet workbook with section scanning** (President, Governor, |
| 48 | + US Senate): `multi_sheet=True` enables row-by-row scanning for |
| 49 | + county-name section headers. The same code path covers 2024 |
| 50 | + (1 section per sheet, 11 sheets) and 2022 (multiple sections per |
| 51 | + sheet, with Summary+Belknap stacked on sheet 0 and Strafford+Sullivan |
| 52 | + stacked on the last sheet). |
| 53 | + |
| 54 | +A row is a section header iff cell 0 (after stripping `" County"`) is |
| 55 | +a known NH county AND cell 1 is a non-numeric candidate label. The |
| 56 | +second check distinguishes a section header from a Summary block's |
| 57 | +data row that happens to start with a county name. |
| 58 | + |
| 59 | +Edge cases (State House's multi-district-per-file shape) are designed |
| 60 | +to become `Parser` subclasses; none exist yet. |
| 61 | + |
| 62 | +## What's covered, what's deferred |
| 63 | + |
| 64 | +**Shipped and merged upstream:** |
| 65 | + |
| 66 | +- `claude/uv-setup`: pyproject.toml, uv.lock, Py2 print fixes, scripts/run-data-tests.sh |
| 67 | +- `claude/narrow-exceptions`: tightened bare except blocks in 2012 scrapers |
| 68 | + |
| 69 | +**Pushed, PR open (PR #2 on tclancy's fork to upstream):** |
| 70 | + |
| 71 | +- `claude/nh-rewrite`: framework + 2022/2024 General CSVs for Pres, Gov, |
| 72 | + US Senate, Congressional. Maintainer feedback: "looks pretty |
| 73 | + reasonable; how hard to extend to state-level races?" |
| 74 | + |
| 75 | +**In progress (next session):** |
| 76 | + |
| 77 | +- **State-level races for 2022 + 2024**: Executive Council, State |
| 78 | + Senate, State House. Raw files dropped under `raw/<year>/general/` |
| 79 | + but NOT yet renamed to the convention (still have `<year>-ge-` |
| 80 | + prefixes and `_N` revision suffixes from SoS). Discovery so far: |
| 81 | + - Exec Council: ONE file containing all 5 districts (multi-section |
| 82 | + by district inside, like our county sections) |
| 83 | + - State Senate: ONE file containing all 24 districts (same shape) |
| 84 | + - State House: 10 files, one per county. Each contains multiple |
| 85 | + districts (multi-member, varied) — district markers look like |
| 86 | + "Belknap 2 (4)" meaning district 2, 4-seat. |
| 87 | + - Likely implementable as a `district_marker_pattern` knob on the |
| 88 | + existing section scanner — not a separate subclass — since the |
| 89 | + iteration shape is the same as our existing county sections, just |
| 90 | + keyed on district name. |
| 91 | + - One open question Tom flagged: convention should files be |
| 92 | + `state-house-belknap.xls` (matches `state-senate.xls`) or |
| 93 | + `house-belknap.xls` (matches SoS's internal naming)? |
| 94 | + |
| 95 | +**Deferred (longer-term):** |
| 96 | + |
| 97 | +- Primaries (Presidential Primary 2024, State Primary 2022 + 2024) — |
| 98 | + per the original spec but lower priority than General |
| 99 | +- Pre-existing 2014/2016/2018/2020 CSV data quality issues (triaged |
| 100 | + but not fixed — `2016/parser.py` line 92 etc. emit `None` for county |
| 101 | + in statewide sections; 2018 precinct file has duplicated Scattering |
| 102 | + rows; etc.) |
| 103 | +- Smarter auto-discovery: per-office `ParserConfig` templates so a |
| 104 | + Congressional Job becomes a one-liner (currently each Job lists files |
| 105 | + explicitly because each office needs different config knobs) |
| 106 | + |
| 107 | +## Conventions worth knowing |
| 108 | + |
| 109 | +- **Branch names:** `claude/<topic>` per Tom's global CLAUDE.md |
| 110 | +- **Raw file naming:** `<office-slug>.xls[x]` (single statewide file) |
| 111 | + or `<office-slug>-<location>.xls[x]` (county slug or district digits). |
| 112 | + See [scripts/fetch-raw.md](scripts/fetch-raw.md). SoS files are |
| 113 | + manually renamed when committed (the SoS site has unstable Drupal |
| 114 | + `_N` revision suffixes). |
| 115 | +- **Output CSV schema:** `[county, precinct, office, district, party, |
| 116 | + candidate, votes]` — matches 2018-2020 files in this repo and is |
| 117 | + consistent with the modern OpenElections direction. |
| 118 | +- **The OE data tests are git-pinned in CI** — see |
| 119 | + `.github/workflows/data_tests.yml`. Local runner reads the pin out |
| 120 | + of that yaml so they can't drift. |
| 121 | + |
| 122 | +## Surprises and lore |
| 123 | + |
| 124 | +- **NH SoS publishes mixed `.xls` and `.xlsx` in the same election** — |
| 125 | + WorkbookReader sniffs magic bytes rather than trusting the extension. |
| 126 | +- **Sheet 0 of multi-sheet workbooks is the county summary** (gets |
| 127 | + silently skipped via `skip_sheet_markers` config). |
| 128 | +- **Each county sheet ends with a `TOTALS` row** that would otherwise be |
| 129 | + treated as a precinct. Default `skip_town_values` includes |
| 130 | + `{TOTALS, Totals, Total}`. |
| 131 | +- **Two unincorporated Coos townships have legacy abbreviated names** |
| 132 | + in `town_to_county.py` ('At. & Gil. Academy Grant', |
| 133 | + 'Thompson & Meserve's Pur.'); 2024 SoS uses slightly different |
| 134 | + abbreviations. Aliases in `PRECINCT_ALIASES` map between them. |
| 135 | +- **"us-senator" vs "us-senate"** — SoS files use the former; we use |
| 136 | + the latter as the canonical office slug. |
| 137 | +- **2018 + 2020 precinct CSVs combine all offices into one file**. |
| 138 | + Our work uses one CSV per office (matching 2012 + modern sister-state |
| 139 | + conventions). Both styles are reasonable; the framework can emit |
| 140 | + either by tweaking the orchestrator. |
| 141 | + |
| 142 | +## Where to look first |
| 143 | + |
| 144 | +If the OpenElections maintainers respond to the [in-flight PR](https://github.com/tclancy/openelections-data-nh/pull/new/claude/nh-rewrite), |
| 145 | +their feedback determines what's next. Otherwise the highest-value |
| 146 | +follow-ups are probably: |
| 147 | + |
| 148 | +1. **Add Executive Council** (5 statewide districts, simple shape). |
| 149 | +2. **Build the State House subclass** — biggest expansion of the |
| 150 | + framework, unlocks all the down-ballot data. |
| 151 | +3. **Combine outputs into per-election precinct.csv files** if upstream |
| 152 | + prefers that to per-office files. |
| 153 | + |
| 154 | +## Related branches (local-only, not pushed) |
| 155 | + |
| 156 | +- `claude/nh-rewrite-design` — original design doc from the brainstorm |
| 157 | + at the start of the session. Lives at |
| 158 | + `docs/superpowers/specs/2026-05-23-nh-parser-rewrite-design.md` on |
| 159 | + that branch only. Useful for "why was this decided" archeology. |
0 commit comments