Skip to content

Commit f0c6c3d

Browse files
authored
Merge pull request #42 from tclancy/claude/nh-shape-autodiscover
Automate Parsing and Add State-Level Races
2 parents 4aaeb41 + bed77bb commit f0c6c3d

48 files changed

Lines changed: 12749 additions & 533 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.claude/CLAUDE.md

Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
# openelections-data-nh — Claude project notes
2+
3+
## What this repo does
4+
5+
Pre-processes New Hampshire election results into the OpenElections CSV
6+
format. Output CSVs are ingested by the OpenElections [processing
7+
pipeline](http://docs.openelections.net/guide/) (note: that domain may
8+
not resolve; sister-state repos in `github.com/openelections/` are the
9+
practical reference for current output conventions).
10+
11+
## Layout at a glance
12+
13+
| Path | What's there |
14+
|---|---|
15+
| [oe_nh/](oe_nh/) | Modern parser framework (this is where new work goes) |
16+
| [raw/`<year>`/`<election>`/](raw/) | Committed source `.xls` / `.xlsx` files from sos.nh.gov |
17+
| [`<year>`/](2024/) | Output CSVs (also: pre-existing 2000-2020 CSVs from earlier contributors) |
18+
| [scripts/fetch-raw.md](scripts/fetch-raw.md) | Manual procedure for downloading new SoS files |
19+
| [scripts/run-data-tests.sh](scripts/run-data-tests.sh) | Run the four OpenElections data tests locally |
20+
| [2012/code/](2012/code/), [2014/](2014/), [2016/parser.py](2016/parser.py) | Pre-existing one-off scrapers — historical, not actively maintained |
21+
| [tests/](tests/) | unit + Hypothesis property tests for the new framework |
22+
23+
## How to run the parser
24+
25+
```bash
26+
uv sync --all-groups # install deps
27+
uv run pytest # run the test suite
28+
uv run python -m oe_nh.cli --year 2024 --election general --office president
29+
scripts/run-data-tests.sh # validate produced CSVs against OE tests
30+
```
31+
32+
## Architecture in 30 seconds
33+
34+
`oe_nh/cli.py` is the orchestrator. It loads year-specific job registries
35+
(`oe_nh/jobs/nh_<year>.py`), finds the matching `Job` for the requested
36+
election/office, runs `parse_workbook()` for each raw file the Job
37+
references, and writes the result CSV via `oe_nh/writer.py`.
38+
39+
The interesting code is in [oe_nh/parser.py](oe_nh/parser.py). Two shapes
40+
to know about:
41+
42+
1. **Single-sheet workbook** (Congressional CD1/CD2): towns down column 0,
43+
candidates in `header_row`, vote matrix below. Set `header_row` and
44+
optionally `lookup_county_from_town=True` if the file has no county
45+
column.
46+
47+
2. **Multi-sheet workbook with section scanning** (President, Governor,
48+
US Senate): `multi_sheet=True` enables row-by-row scanning for
49+
county-name section headers. The same code path covers 2024
50+
(1 section per sheet, 11 sheets) and 2022 (multiple sections per
51+
sheet, with Summary+Belknap stacked on sheet 0 and Strafford+Sullivan
52+
stacked on the last sheet).
53+
54+
A row is a section header iff cell 0 (after stripping `" County"`) is
55+
a known NH county AND cell 1 is a non-numeric candidate label. The
56+
second check distinguishes a section header from a Summary block's
57+
data row that happens to start with a county name.
58+
59+
Edge cases (State House's multi-district-per-file shape) are designed
60+
to become `Parser` subclasses; none exist yet.
61+
62+
## What's covered, what's deferred
63+
64+
**Shipped and merged upstream:**
65+
66+
- `claude/uv-setup`: pyproject.toml, uv.lock, Py2 print fixes, scripts/run-data-tests.sh
67+
- `claude/narrow-exceptions`: tightened bare except blocks in 2012 scrapers
68+
69+
**Pushed, PR open (PR #2 on tclancy's fork to upstream):**
70+
71+
- `claude/nh-rewrite`: framework + 2022/2024 General CSVs for Pres, Gov,
72+
US Senate, Congressional. Maintainer feedback: "looks pretty
73+
reasonable; how hard to extend to state-level races?"
74+
75+
**In progress (next session):**
76+
77+
- **State-level races for 2022 + 2024**: Executive Council, State
78+
Senate, State House. Raw files dropped under `raw/<year>/general/`
79+
but NOT yet renamed to the convention (still have `<year>-ge-`
80+
prefixes and `_N` revision suffixes from SoS). Discovery so far:
81+
- Exec Council: ONE file containing all 5 districts (multi-section
82+
by district inside, like our county sections)
83+
- State Senate: ONE file containing all 24 districts (same shape)
84+
- State House: 10 files, one per county. Each contains multiple
85+
districts (multi-member, varied) — district markers look like
86+
"Belknap 2 (4)" meaning district 2, 4-seat.
87+
- Likely implementable as a `district_marker_pattern` knob on the
88+
existing section scanner — not a separate subclass — since the
89+
iteration shape is the same as our existing county sections, just
90+
keyed on district name.
91+
- One open question Tom flagged: convention should files be
92+
`state-house-belknap.xls` (matches `state-senate.xls`) or
93+
`house-belknap.xls` (matches SoS's internal naming)?
94+
95+
**Deferred (longer-term):**
96+
97+
- Primaries (Presidential Primary 2024, State Primary 2022 + 2024) —
98+
per the original spec but lower priority than General
99+
- Pre-existing 2014/2016/2018/2020 CSV data quality issues (triaged
100+
but not fixed — `2016/parser.py` line 92 etc. emit `None` for county
101+
in statewide sections; 2018 precinct file has duplicated Scattering
102+
rows; etc.)
103+
- Smarter auto-discovery: per-office `ParserConfig` templates so a
104+
Congressional Job becomes a one-liner (currently each Job lists files
105+
explicitly because each office needs different config knobs)
106+
107+
## Conventions worth knowing
108+
109+
- **Branch names:** `claude/<topic>` per Tom's global CLAUDE.md
110+
- **Raw file naming:** `<office-slug>.xls[x]` (single statewide file)
111+
or `<office-slug>-<location>.xls[x]` (county slug or district digits).
112+
See [scripts/fetch-raw.md](scripts/fetch-raw.md). SoS files are
113+
manually renamed when committed (the SoS site has unstable Drupal
114+
`_N` revision suffixes).
115+
- **Output CSV schema:** `[county, precinct, office, district, party,
116+
candidate, votes]` — matches 2018-2020 files in this repo and is
117+
consistent with the modern OpenElections direction.
118+
- **The OE data tests are git-pinned in CI** — see
119+
`.github/workflows/data_tests.yml`. Local runner reads the pin out
120+
of that yaml so they can't drift.
121+
122+
## Surprises and lore
123+
124+
- **NH SoS publishes mixed `.xls` and `.xlsx` in the same election**
125+
WorkbookReader sniffs magic bytes rather than trusting the extension.
126+
- **Sheet 0 of multi-sheet workbooks is the county summary** (gets
127+
silently skipped via `skip_sheet_markers` config).
128+
- **Each county sheet ends with a `TOTALS` row** that would otherwise be
129+
treated as a precinct. Default `skip_town_values` includes
130+
`{TOTALS, Totals, Total}`.
131+
- **Two unincorporated Coos townships have legacy abbreviated names**
132+
in `town_to_county.py` ('At. & Gil. Academy Grant',
133+
'Thompson & Meserve's Pur.'); 2024 SoS uses slightly different
134+
abbreviations. Aliases in `PRECINCT_ALIASES` map between them.
135+
- **"us-senator" vs "us-senate"** — SoS files use the former; we use
136+
the latter as the canonical office slug.
137+
- **2018 + 2020 precinct CSVs combine all offices into one file**.
138+
Our work uses one CSV per office (matching 2012 + modern sister-state
139+
conventions). Both styles are reasonable; the framework can emit
140+
either by tweaking the orchestrator.
141+
142+
## Where to look first
143+
144+
If the OpenElections maintainers respond to the [in-flight PR](https://github.com/tclancy/openelections-data-nh/pull/new/claude/nh-rewrite),
145+
their feedback determines what's next. Otherwise the highest-value
146+
follow-ups are probably:
147+
148+
1. **Add Executive Council** (5 statewide districts, simple shape).
149+
2. **Build the State House subclass** — biggest expansion of the
150+
framework, unlocks all the down-ballot data.
151+
3. **Combine outputs into per-election precinct.csv files** if upstream
152+
prefers that to per-office files.
153+
154+
## Related branches (local-only, not pushed)
155+
156+
- `claude/nh-rewrite-design` — original design doc from the brainstorm
157+
at the start of the session. Lives at
158+
`docs/superpowers/specs/2026-05-23-nh-parser-rewrite-design.md` on
159+
that branch only. Useful for "why was this decided" archeology.

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,8 @@ __pycache__/
44

55
# Cloned by scripts/run-data-tests.sh, pinned to the version used in CI.
66
.data_tests/
7+
8+
# Tom's global ~/.gitignore excludes .claude/* by default, but we want
9+
# project-specific Claude notes tracked here so they survive clones.
10+
!.claude/
11+
!.claude/**

0 commit comments

Comments
 (0)