Version 1.3 — adopted 2026-04-22, revised 2026-04-23 (Leeds drift spot-check:
187 drifted Tier-1 cells + 12 broken Tier-4 URLs + stale personnel), revised
2026-04-24 (Bradford screenshot regression: only 3 of 22 councils had shipped
screenshot evidence to main). This document is the canonical contract for
how data lands on CivAccount. Older planning docs (NORTH-STAR-STANDARD.md,
DATA-PIPELINE.md, DATA-YEAR-POLICY.md, COUNCIL-AUDIT-PLAYBOOK.md,
PROVENANCE-INTEGRITY-PLAN.md) are superseded and moved to docs/archive/.
Any conflict between this doc and older text — this doc wins.
v1.2 additions:
- Non-negotiable #6: Zero drift against reference datasets. (§2)
- Tier rules strengthened: verbatim rule applies to Tiers 1-4, not just Tier-3 PDFs. (§3)
- New §15: Continuous drift prevention — quarterly re-verification mandatory.
UK councils already publish how they spend public money. They publish it on their own websites. They publish it in inconsistent formats — PDFs, HTML tables, Socrata open-data portals, XLS spreadsheets, 360Giving CSVs, committee papers. A resident who wants to know "what does my council spend on bins / who does it pay / how high is my bill" has the information in theory but not in practice. They'd need to know which council, which URL, which format, and how to reconcile different pages to each other.
CivAccount does exactly one thing: goes to the council's publications, presents the numbers clearly in one place, and puts a link next to every number so the reader can verify it themselves against the council's own document.
We are a presentation layer. We are not a primary source. We do not compute. We do not estimate. We do not fill gaps with inference. If a council publishes a figure, we show it with a link. If a council doesn't publish a figure, we don't make one up — we leave the space empty and label it honestly.
This mission is the forcing function behind every decision below. If any decision seems to make CivAccount more comprehensive at the cost of verifiability, the decision is wrong.
-
Every rendered value must be independently verifiable by any member of the public. No internal-only sources, no trust-us layer, no "we checked, trust us" values.
-
Primary source = UK government publication. GOV.UK, ONS, DEFRA, DfT, Ofsted, LGBCE, or a council's own
.gov.ukdomain. News, aggregators, Wikipedia, and third parties can only cross-reference — never supply the value itself. -
Every rendered value must appear verbatim in a linkable public document. No peer averages ("typical metropolitan district CE salary £X"), no year-on-year deltas ("+£191 from last year"), no per-capita comparators ("+£113 per resident vs average"), no 5-year arithmetic. Even when the inputs are individually sourced, the derived value doesn't exist on any council's website — so it fails the bar. The only exception is calculations mandated by statute whose outputs every council publishes verbatim (e.g. Council Tax Act 1992 s.5 band ratios — every council lists all 8 band values on its own website). Adopted 2026-04-22 after Bradford UX audit.
-
Date is structural, not decorative. Every value has a year label visible to the reader within one interaction. Mixed vintages across a single card are expected and correct — we label each value with its own year rather than pretend they share one.
-
If a value can't meet the bar above, it does not render. The UI's
DataValidationNotice/ card-hiding pattern handles this gracefully. Stripping is always preferable to fabrication. -
Zero drift against reference datasets. ← NEW 2026-04-23. Every rendered value that mirrors a national reference CSV (parsed-population, parsed-area-band-d, RA Part 1/2) must match that CSV's current row exactly. Every Tier-4 live-page URL must resolve to HTTP 200. Every Tier-4 personnel name must match the council's currently-listed official. Verification is continuous (quarterly at minimum), not one-time. Councils that fall out of alignment drop out of
STRICT_COUNCILSimmediately. -
Every council ships 1:1 screenshot evidence. ← NEW 2026-04-24. Every council in
STRICT_COUNCILSdeclares at least onepage_image_urlin itsfield_sources. Every referenced PNG exists on disk undersrc/data/councils/pdfs/council-pdfs/<slug>/images/. Everyexcerptis verbatim-present in the archived source (whitespace + unicode canonicalisation is OK, paraphrase isn't). The live popover is the contract: click a value → see the page in the document that contains it. No screenshot = no trust.
Every rendered value is assigned a tier. Lower number = higher quality.
- Format: CSV / ODS direct download
- Verification:
source-truthvalidator compares our rendered value to the exact source cell on every CI run - Fingerprint:
parsed_csv_sha256inscripts/validate/source-manifest.json - Examples: MHCLG RA Part 1/2, MHCLG Council Tax live tables, ONS population estimates, DEFRA ENV18, DfT RDC/RDL, Ofsted inspection data, LGBCE electoral data, MHCLG CoR capital expenditure
- Format: CSV / JSON / Socrata API / 360Giving
- Fingerprint: sha256 on the archived file in
pdfs/spending-csvs/<slug>/orpdfs/360giving/with_meta.json - Verification: aggregate derivations re-computed from the raw file on each run
- Examples: Camden Socrata (
opendata.camden.gov.uk), 360Giving grant registers, Bradford DataHub
- Format: PDF with extracted text (pdfplumber / pdftotext)
- Fingerprint: sha256 recorded in
_meta.json, referenced infield_sources[k].sha256_at_access - Verification: manual pdftotext extraction; excerpt quoted in data-file comment
- Examples: Bradford Pay Policy 2025-26 (page 11 deep-link), Bradford SoA, Bradford MTFS
- Format: URL that works in a human browser but cannot be fetched by our scripts (Cloudflare, bot-blocks) or HTML pages with no downloadable document form
- Fingerprint: unavailable —
archive_exempt: "cloudflare_blocked" | "bot_blocked" | "no_document_form" | "live_page"flag recorded - Verification: manual extraction at a recorded
accesseddate; ideally confirmed by a secondary source - Examples: Camden CE salary page, Kent Cabinet live page
- Format: we point readers at the primary; the secondary (news / LGC / Wikipedia talk page / academic paper) is the confirmation source we used
- Fingerprint: the secondary source URL, date, and quoted figure go into
field_sources[k].cross_check_ref - Verification: we never rely on Tier 5 alone for the rendered value. We use the secondary to confirm what the bot-blocked primary says. If there's no primary, the field is stripped.
- Examples: Kent CE salary £223,979 (primary PDF Cloudflare-blocked; Kent Online confirmed the figure quoted from Kent's disclosure)
- LLM-generated / model-inferred figures
- Wikipedia as a primary source (tertiary only, as a pointer)
- Trade aggregators (TaxPayersAlliance, Glassdoor, TPA Rich List)
- News items that don't cite a primary UK-government source
- "Estimates based on size" / interpolation / model-derived values
- Prior CivAccount values that can't be back-traced to a primary document
These render with "Source: Calculated" or "Source: Comparison" labels and are all banned — they don't satisfy principle #2 (every value appears verbatim in a council publication):
- Year-on-year change deltas — "Up 9.3% from last year (+£191.00)". Even though this year's and last year's Band D are each Tier 1 sourced, the delta isn't in any publication. Strip. (The individual year-by-year Band D history card remains — each bar is sourced.)
- Multi-year change deltas — "+£450 over 5 years (+25.1%)". Same reason. Strip.
- Peer-group averages — "Typical metropolitan district CE salary £192,127", "Avg for districts £12,307". CivAccount computes these across all councils of a type. Strip.
- Peer-group comparators — "Compared to average: -£40.89", "+£113 per resident". Strip.
- Per-capita derivations — "£X per resident" budgets. Inputs (budget + population) are Tier 1 but the ratio isn't published. Strip.
- Combined ranking scores — "Above median on 5 of 7 metrics". Strip.
Only permitted derivation: statutory calculations whose outputs every council publishes verbatim. Current permitted set:
- Tax bands A-H from Band D (Council Tax Act 1992 s.5 — 6/9, 7/9, 8/9, 11/9, 13/9, 15/9, 18/9 ratios; every billing authority publishes all 8 band values on its own council-tax page). Provenance label =
published, notcalculated, because the output figures ARE in every council's publication.
If an apparent derivation doesn't meet this "every council publishes the output verbatim" test, it's stripped. When in doubt, strip.
Every entry in Council.detailed.field_sources[k] must carry:
{
url: string; // Direct URL to primary source (preferred: direct PDF/CSV link)
title: string; // Human-readable document title + page/section quote
accessed: string; // ISO date we last verified the value (YYYY-MM-DD)
data_year: string; // Fiscal / reporting year — "YYYY-YY" | "mid-YYYY" | "current"
tier: 1 | 2 | 3 | 4 | 5; // Source quality tier per section 3
extraction_method: // How the value was obtained from the source
| 'csv_row' // → which row, which column in source-manifest
| 'pdf_page' // → which page, what quoted text
| 'aggregate' // → which filter, which aggregation function
| 'socrata_query' // → which dataset, which query
| 'manual_read'; // → for Tier 4/5, recorded with accessed date
// Tier-specific extras:
sha256_at_access?: string; // Required for tier ≤ 3
archive_exempt?: // Required for tier 4
| 'cloudflare_blocked'
| 'bot_blocked'
| 'no_document_form'
| 'live_page';
cross_check_ref?: { // Required for tier 5
url: string;
title: string;
quoted_figure: string;
access_date: string;
};
wayback_url?: string; // Auto-snapshot on ingest (Memento protocol, RFC 7089)
page?: number; // For pdf_page extraction
excerpt?: string; // Verbatim quote of the line containing the value
page_image_url?: string; // Pre-generated PDF-page PNG (Tier 3 only) — see §6 Phase 1b + §8
csv_row_excerpt?: { // Pre-extracted CSV row (Tier 1/2 only) — see §8
headers: string[];
row: string[];
highlight_column: string;
};
}Any field missing tier or extraction_method fails the CI validator. Any field that doesn't meet its tier's fingerprint requirement fails the CI validator.
Follows ISO 8601 + UK fiscal-year convention:
"2025-26"→ 1 April 2025 to 31 March 2026 (UK fiscal year)"mid-2024"→ ONS mid-year estimate as of June 2024"current"→ live page with no stable fiscal year (e.g. Cabinet listings)- Anything else → validator error
- Source document's own stated period is authoritative. Not access date, not current fiscal year, not publication date. Whatever the document's cover page / front matter states.
- Every displayed value carries its year within one interaction. Popover / badge / inline label.
- Multiple numbers near each other each carry their own year unless the surrounding prose explicitly states all numbers share a year.
- Historical series preferred where a council publishes one. Band D already has 5-year history (
band_d_2021throughband_d_2025); extend the pattern to other fields where history is published: budgets, reserves, capital, councillor allowance scheme. - 2026-27 data is held back from UI until this verification pipeline is fully operational. MHCLG published 2026-27 Band D in March 2026; we have the values (
band_d_2026) but the UI does not surface them until sign-off. - When a council doesn't publish historical data, show current only with an explicit "history not available" note. Don't backfill from inference.
The same numbered steps run on every council. They're idempotent, scripted where possible, manual where necessary.
Find every document the council publishes that might contain relevant data.
Checklist of URL patterns to probe on every council site:
<council>.gov.uk/finances-and-spending/(or variants:/finance,/budgets,/financial-information)<council>.gov.uk/statement-of-accounts/<council>.gov.uk/pay-policy-statement/or/chief-officer-pay/<council>.gov.uk/councillors-allowances/or/members-allowances/<council>.gov.uk/mtfs/or/medium-term-financial-strategy/<council>.gov.uk/cabinet/or/portfolio-holders/democracy.<council>.gov.uk/documents/(moderngov-style PDF archive)opendata.<council>.gov.uk/(Socrata / CKAN / ArcGIS)datahub.<council>.gov.uk/(some councils)<council>.gov.uk/spending-over-500/or/invoices-over-250/- 360Giving registry for the council
data.gov.ukdatasets published by the council
Output: src/data/councils/pdfs/council-pdfs/<slug>/inventory.json listing every candidate URL with an initial tier guess.
For each inventory URL, download the file to src/data/councils/pdfs/council-pdfs/<slug>/ with:
- Filename that reflects document + year (e.g.
pay-policy-2025-26.pdf) _meta.jsonsibling file carrying: source_url, publisher, document_type, fiscal_year, fetched (ISO), sha256, licence
Where the URL is bot-blocked (Tier 4), skip archive and record archive_exempt — the URL itself is still recorded in inventory, just not fetched.
Every successfully-archived URL gets a Wayback Machine snapshot via SavePageNow API. The Wayback URL goes into the _meta.json.
At archive time, generate lightweight visual evidence for each value so any reader, on a phone, can verify any number without running scripts.
For Tier 3 archived PDFs:
- At extraction time (Phase 2), we know which page each value came from.
- Render that page to PNG using
pdftoppm -png -r 150 -f <page> -l <page>. - Store the PNG as a public static asset in
src/data/councils/pdfs/council-pdfs/<slug>/images/(served by Next at/archive/<slug>/images/...). - Record the resulting URL in
field_sources[k].page_image_url. - One PNG per page-per-field (pages shared by multiple values get cached).
For Tier 1 GOV.UK CSVs:
- No PNG — screenshots of spreadsheets are unreadable.
- Instead, emit a structured
csv_row_excerptcontaining headers + the council's row + the column to highlight. - UI renders this as an inline mini-table in the popover.
For Tier 2 council open-data (Socrata / 360Giving):
- Either inline mini-table (same as Tier 1), or link directly to the platform's filtered view if the platform renders well (Socrata's native UI shows the row).
For Tier 4 (bot-blocked):
- Pre-generation not possible. Rely on Wayback snapshot URL + live-URL-works-in-browser caveat.
For Tier 5 (secondary-confirmed):
- Not applicable — no primary file to screenshot.
Storage budget: ~200 KB per PNG × ~10 fields per council × 317 councils ≈ ~600 MB at full rollout. Acceptable. Deduplication via content-hashed filenames possible later if it becomes large.
Why this is worth the storage: the single biggest blocker to non-technical spot-checks is "I'd have to open a 6 MB PDF and find the right page." Pre-generated page PNGs reduce that to "tap the value, see the page." Immediately usable by anyone, not just developers.
Per document type:
- PDF →
pdftotext -layout, sometimes with page-range. Key figures located by regex / text search. Quote the line containing the value in the data-file comment. - CSV / XLSX → parse, navigate to the council's row, extract the named column. Record which row index + column header.
- Socrata JSON → API query, aggregate if needed. Record the query.
- HTML table → where accessible, parse with cheerio or equivalent. For bot-blocked: manual read.
Output: src/data/councils/pdfs/council-pdfs/<slug>/extracted-values.json with every value tagged by source file + location + extraction method.
For each extracted value:
- If the value also appears in a Tier 1 CSV (e.g. CE salary appears in a council PDF and we happen to have it from an MHCLG aggregated dataset) — cross-check they agree.
- For large financial figures, run Benford's Law first-digit test on the aggregate distribution of all values per council. Flag any council whose Benford conformance is > 1.96σ from expected for human spot-check.
- Year-on-year outlier: compare to prior-year value if available. > 30% movement is flagged for human confirmation (might be real, usually worth a second look).
- Sum consistency: budget category sum = total_service; allowances detail sum ≈ total_allowances_cost ± 5%; precept shares sum = total Band D.
Only values that survived Phases 1-3 land in the council's data file. Each one carries the full field_sources[k] schema from section 4.
Values that failed:
- Not published by council → field is absent. UI handles via card-hiding or "not published" label.
- Published but extraction ambiguous → flag in
<COUNCIL>-AUDIT.mdunder "Known gaps"; field absent until resolved. - Published but fails cross-check (Benford outlier / sum inconsistency / YoY outlier) → field absent; investigate separately.
Run the validator suite:
source-truth(Tier 1 values match source cell)audit-north-star(structural completeness)field-source-years(data_year present + well-formed)tier-classification(tier declared + extraction_method declared)benford(first-digit distribution)sum-consistency(cross-field sums)yoy-outlier(year-on-year sanity)link-check(no silent 404s)
All must pass. Any failure → field stripped or fix re-attempted; not waved through.
Added 2026-04-22 after Bradford audit revealed that structural validators can pass while the UI still renders unwrapped numbers. Structural compliance is necessary but not sufficient.
Load /council/<slug> in a real browser. Walk every text node. Any numeric value not inside a [role="button"][aria-label^="Source:"] ancestor is a violation.
Script: node scripts/council-research/ux-audit.mjs --council=<Name>
For each violation:
- If it's a real data point → wrap it in
<SourceAnnotation>in the rendering component - If it's decorative / repeat / label → leave (and make sure the sweep filter catches it as non-data)
- If it's unsourceable → strip the underlying data field; UI either omits the card or shows
DataValidationNotice
Re-run until 0 violations. No exceptions.
Visual checks (human must do):
- Tap first hero value → popover opens → source URL is specific (not a landing page) → tier badge visible
- Tap a Tier 3 value with
page_image_url→ thumbnail loads → lightbox shows exactly the right PDF page - Tap "Open source" on a Tier 3 PDF link → browser jumps to the right page (via
#page=N) - Supplier drill-down → helper copy names the specific supplier/recipient to search for
This is the single most important gate. A council that passes Phases 0-5 but fails Phase 5b still has unverified data rendering to the public — and that breaks the whole mission.
Write <COUNCIL>-AUDIT.md in the data repo, including a Datasheet for Datasets-style section (Gebru et al., ACM 2021 standard):
- Motivation (why does this council's dataset exist — link to mission)
- Composition (what fields are populated, what are absent, why)
- Collection process (sources, dates, methods)
- Preprocessing (extraction scripts used, any cleaning)
- Uses (who is this for)
- Distribution (OGL v3, how readers access)
- Maintenance (when re-verified, by whom)
Plus a per-field register mapping field → source document → page → sha256 → last-verified → tier → extraction method.
One PR per council: data file + archived files + _meta.json + extracted-values.json + <COUNCIL>-AUDIT.md. CI runs all validators. Human reviews. Merge.
Every field_sources entry is compatible with the W3C PROV data model (PROV-DM / PROV-O, 2013 standard used by research data infrastructure worldwide):
- Entity: the source document (PDF, CSV, HTML page) identified by URL + sha256
- Activity: the extraction step (
pdf_page/csv_row/ etc.) identified by extraction_method + accessed date - Agent: the human + script that performed extraction (commit author + script path + script version)
- Relationships:
- rendered value
wasDerivedFromsource document - extraction activity
wasAssociatedWithagent - rendered value
wasGeneratedByextraction activity
- rendered value
This gives us a machine-readable lineage graph. Research groups ingesting our dataset can use existing PROV-aware tooling without transforming it.
At render time, the SourceAnnotation popover can render a plain-English lineage sentence: "The Bradford 2025-26 Chief Executive salary £217,479 was extracted from page 11 of Bradford's Pay Policy Statement 2025-26 (sha256 545774…) by Claude under commit 4130b06 on 2026-04-21, verified against published .gov.uk URL."
The one UI rule: any reader, on any device, must be able to verify any rendered value with at most two taps.
A SourceAnnotation popover opens, showing:
- Plain-English lineage sentence (§7)
- Tier badge — "GOV.UK bulk" / "Council open-data" / "Archived PDF" / "Council PDF (bot-blocked)" / "Secondary-confirmed"
- Data year badge — e.g. "2025-26"
- Visual evidence appropriate to the tier:
- Tier 1 / 2: inline mini-table with the CSV row + column highlighted
- Tier 3: thumbnail of the pre-generated page PNG — tap to expand full-screen
- Tier 4: "Open source" link → live council URL + "Open archived copy" link → Wayback snapshot
- Tier 5: primary URL + secondary confirmation URL side by side with quoted figure
- sha256 fingerprint prefix (first 12 chars) — proof the archived file is tamper-evident
- "Open source" button — opens the primary
.gov.ukdocument in a new tab - "Open local archive" button (Tier ≤ 3) — opens our immutable archived copy
- Last-verified date (ISO, human-readable)
- "Report incorrect data" footer link → pre-filled feedback form
The non-dev spot-check path is: tap value → see page image + source URL. One tap. No terminal. No scripts. No PDF download on mobile data.
Aggregate view of every rendered field for a council:
- Table sorted by tier (best evidence first)
- Each row has the same evidence as the popover: mini-table / page PNG thumbnail / source link
- FAIR self-assessment JSON-LD block (§9)
- Datasheet-for-Datasets summary (§6 Phase 6)
- "Download all archives as ZIP" for researchers
Every correction ever made, visible to the public (§14).
- Card headers carry year: "Who the council pays — 2024-25"
- Mixed-vintage cards: each number carries its own year inline
- Hero paragraphs that combine multiple numbers must either (a) share a year across all numbers or (b) each number gets its own year inline
Following the methodology in Mark Nigrini's Journal of Accountancy 2022 paper and his 2012 Benford's Law: Applications for Forensic Accounting, Auditing, and Fraud Detection:
Benford's Law says the first digit of naturally-occurring financial values follows:
| First digit | Expected % |
|---|---|
| 1 | 30.1% |
| 2 | 17.6% |
| 3 | 12.5% |
| 4 | 9.7% |
| 5 | 7.9% |
| 6 | 6.7% |
| 7 | 5.8% |
| 8 | 5.1% |
| 9 | 4.6% |
Fabricated / manipulated financial data tends to deviate — fabricators distribute digits more uniformly than reality does.
Application: new validator scripts/validate/validators/benford.mjs runs the first-digit test across every numeric value we render per council (minimum N=30 values — below which the test is unreliable per Nigrini). Per-council z-score logged; councils with z > 1.96 flagged for human spot-check.
This wouldn't be a gate — Benford false-positives are real, especially with small samples. It's a trip-wire: "this council's numbers look statistically unusual; audit them." Today's Camden figures likely would have tripped this.
Per Wilkinson et al., Scientific Data (2016), the most-cited modern data-quality standard:
- Findable: every council has a persistent URI (
civaccount.co.uk/data/council/<slug>/) + rich JSON-LD metadata + listed in thesitemap.xml. - Accessible: all data publicly served over HTTPS, no auth required; archive files available at stable content-addressed URLs.
- Interoperable: JSON-LD schema.org Dataset markup; CSV export endpoint per council; PROV-compatible provenance.
- Reusable: Open Government Licence v3.0 (matches the councils' own licence — we inherit); clear citation guidance.
A FAIR self-assessment JSON-LD block lives on every council's /provenance page.
Every rendered council page embeds:
schema:WebPage(existing)schema:Dataset(the council's derived data)schema:Organization(the council itself, linked to GOV.UK URI)schema:hasPartrelationships linking dataset to source documents
Target: ★★★★★ for the derived dataset.
- ★ on the web — done (civaccount.co.uk is live)
- ★★ structured data — done (TypeScript data files, JSON API)
- ★★★ non-proprietary — done (JSON, CSV export)
- ★★★★ URIs identify things — to add (each field has a persistent URI)
- ★★★★★ linked data — to add (link council URI to MHCLG URI, ONS URI, LGBCE URI)
Following the Alan Turing Institute's Turing Way handbook for reproducible research:
Per council, a manifests/<slug>.json reproducibility manifest lists:
- Every source URL + sha256 at fetch time + fetch date
- Every extraction script version (commit SHA)
- Every validator that ran + version + outcome
- The final rendered values
Anyone can run npm run reproduce -- --council=Bradford which:
- Re-fetches each source URL via Wayback if primary is down
- Compares fetched sha256 to stored sha256
- Re-runs extraction scripts
- Compares extracted values to current data file
- Exits 0 if identical; exits 1 with a diff if not
This is the scientific-reproducibility floor: anyone in the world should be able to reproduce our numbers.
Downloaded source files are stored content-addressed — the filename on disk is the sha256 hash:
src/data/councils/pdfs/by-hash/
├── 54/57/54577433...e8b4.pdf
├── 2a/72/2a72b676...981f.pdf
...
Human-readable pointers live alongside:
src/data/councils/pdfs/council-pdfs/<slug>/
├── pay-policy-2025-26.pdf → symlink / pointer to by-hash/54/57/...
├── pay-policy-2025-26_meta.json
├── statement-of-accounts-2024-25.pdf → symlink / pointer to by-hash/af/b8/...
...
Benefit: if pay-policy-2025-26.pdf is ever replaced (accidentally or maliciously) the pointer breaks and CI fails. The original file (under its hash) is always preserved. Git-LFS or plain git depending on file sizes.
Every source URL on ingest is auto-snapshotted to Internet Archive via the SavePageNow API. The Wayback URL is stored in field_sources[k].wayback_url alongside the live URL.
On the /provenance page, readers see both:
- "Source (live): council.gov.uk/pay-policy/"
- "Source (archived 2026-04-22): web.archive.org/web/20260422…/…"
Means readers can always verify the value even if the council has since edited or deleted the document.
Visible on the site at /corrections:
- Every correction that changed a published value
- Date of correction
- Old value → new value
- Why (e.g. "source-truth validator found year-drift against MHCLG RA 2025-26")
- Verifying source link
- Commit SHA
Today's Camden £118.5m → £10m correction is correction #001. Bradford's CE name fix is earlier. Publishing these visibly is a trust-through-fallibility signal — Poynter / IFCN journalism standard.
Per Sculley et al., Hidden Technical Debt in Machine Learning Systems (Google, NeurIPS 2015), silent data dependencies are the single biggest source of data-pipeline rot.
data-dependencies.json (machine-readable) declares:
{
"sources": {
"area-council-tax": {
"downstream_fields": ["council_tax.band_d_2025", ...],
"last_updated": "2026-04-21",
"consumers": ["source-truth validator", "UnifiedDashboard hero"]
},
...
}
}Change a source → validator knows exactly which fields depend → CI fails until consumers are reviewed. Prevents the "oh we forgot to update X" class of error.
These run on every commit. Any failure blocks merge.
| Validator | Checks | Status |
|---|---|---|
source-truth |
Tier 1 values exact-match source CSV cells | ✅ existing |
field-source-years |
Every field_sources entry has well-formed data_year | ✅ existing |
audit-north-star |
5-criterion structural gate on reference councils | ✅ existing |
north-star-gate |
Regression gate — reference councils must stay at 0/5 | ✅ existing |
tier-classification |
Every value declares tier + extraction_method | 🔴 NEW |
forbidden-source-scan |
No URL points at a forbidden domain list | 🔴 NEW |
benford |
Per-council first-digit distribution within 1.96σ | 🔴 NEW |
sum-consistency |
Cross-field sum checks | ✅ partial — extend |
yoy-outlier |
Year-on-year change within plausible bounds | 🔴 NEW |
last-verified-freshness |
No value's last_verified older than 180 days | 🔴 NEW |
link-check |
HTTP 200 on every field_sources URL, no silent 404s | ✅ existing |
reproducibility |
npm run reproduce exits 0 per council |
🔴 NEW |
content-addressed-archive |
Every archived file matches its recorded sha256 | 🔴 NEW |
Five new validators to build. Three existing validators to extend.
I (Claude) have no persistent memory across sessions. The methodology must be resumable from scratch by reading:
- This document —
NORTH-STAR.md - Per-council status —
scripts/council-research/status/<slug>.json(per-council progress: Phase 0 ✓ / Phase 1 ✓ / Phase 2 ⏳ …) - Per-council audit —
docs/<COUNCIL>-AUDIT.md(what's been verified, open gaps) - Global progress —
docs/PROGRESS.md(which councils done, which in progress, which blocked)
Any future session reads these four sources and knows exactly where to pick up. No oral history, no assumed context, no "ask the previous agent."
Already installed / in repo:
pdftotext(poppler) — PDF text extraction ✓pdftoppm(poppler) — PDF page → PNG rendering ✓sha256sum— fingerprinting ✓- Node.js + repo scripts ✓
To build (as part of foundation scaffolding):
scripts/council-research/01-inventory.mjs— URL pattern probingscripts/council-research/02-archive.mjs— fetch + sha256 + _meta + Wayback snapshotscripts/council-research/03-extract-pdf.mjs— pdftotext wrapper + structured JSONscripts/council-research/04-extract-csv.mjs— csv/xlsx extractorscripts/council-research/05-populate.mjs— extracted JSON → data-file diffscripts/council-research/06-audit-evidence.mjs— on-demand PDF → PNG for spot-checksscripts/council-research/lib/{fetch,pdf,sha256,meta,wayback,prov}.mjs— helpersscripts/validate/validators/{tier-classification,forbidden-source-scan,benford,yoy-outlier,last-verified-freshness,reproducibility,content-addressed-archive}.mjs— new validators
A council is North-Star done when:
- ✓ All archivable documents fetched to
pdfs/council-pdfs/<slug>/with sha256 +_meta.json+ Wayback URL - ✓ Every rendered field has a
field_sourcesentry meeting section 4 schema - ✓ All 13 CI validators pass (0 errors, warnings reviewed)
- ✓
ux-audit.mjs --council=<Name>reports 0/0 violations — both 0 unwrapped numeric values AND 0 derived/comparator values (Phase 5b — expanded 2026-04-22 after Bradford audit) - ✓ Expected data-level strips applied per COUNCIL-ROLLOUT-PLAYBOOK.md Phase 3 checklist (performance_kpis, service_outcomes.housing, service_outcomes.population_served, service_spending, etc. — unless explicitly page-level sourced)
- ✓
<COUNCIL>-AUDIT.mdwritten with Datasheet for Datasets structure (section 6 Phase 6) - ✓
manifests/<slug>.jsonreproducibility manifest committed - ✓
npm run reproduce -- --council=<slug>exits 0 - ✓
status/<slug>.jsonmarked"done": true - ✓ Any fields that can't meet the bar are absent (stripped), not approximated
- ✓ Every single rendered value appears verbatim in a linkable public document — no peer averages, no YoY deltas, no per-capita ratios, no multi-year change callouts; only the statutory tax-bands exception is permitted (principle #3)
See COUNCIL-ROLLOUT-PLAYBOOK.md for the step-by-step operational guide.
Three reference councils (Bradford, Camden, Kent) must be North-Star done before we scale to any other council. After that, new councils are added one at a time, each passing the same bar.
- LLM-invented numbers (the root cause of Camden's £118.5m fabrication — will be forbidden by
forbidden-source-scanvalidator) - Uniform schema that forces every council to have every field (optional schema; stripping is fine)
- "Looks approximately right" tolerances on values known to be exact in source
- Rendering a blanket-year claim across mixed-vintage data
- Back-filling missing history via interpolation
- Trust-us claims without a verifiable click-through to a primary source
Essential (cited throughout this doc):
- Wilkinson et al. (2016), Scientific Data — The FAIR Guiding Principles for scientific data management and stewardship — https://www.nature.com/articles/sdata201618
- W3C (2013) — PROV-DM: The PROV Data Model — https://www.w3.org/TR/prov-dm/
- Gebru et al. (2021), Communications of the ACM — Datasheets for Datasets — https://dl.acm.org/doi/10.1145/3458723
- Nigrini (2012), Wiley — Benford's Law: Applications for Forensic Accounting, Auditing, and Fraud Detection
- Berners-Lee (2006, updated 2010) — 5-Star Open Data — https://5stardata.info/
- Sculley et al. (2015), NeurIPS — Hidden Technical Debt in Machine Learning Systems — https://papers.nips.cc/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html
- The Alan Turing Institute — The Turing Way — https://the-turing-way.netlify.app/
- Van de Sompel et al. (2013), RFC 7089 — Memento — Time-Based Access to Resource States — https://datatracker.ietf.org/doc/html/rfc7089
UK-specific: 9. Open Data Institute — Open Data Certificate (ODI bronze/silver/gold/platinum levels) — https://certificates.theodi.org/ 10. mySociety (2023) — Unlocking the Value of Fragmented Public Data — https://www.mysociety.org/2023/02/21/unlocking-the-value-of-fragmented-public-data/ 11. Centre for Public Data — publications on local-government data fragmentation — https://www.centreforpublicdata.org/publications 12. Institute for Fiscal Studies — local-government finance methodology notes — https://ifs.org.uk/topics/local-government-finance
These are the floor. Anyone picking up this project should skim all of them before making methodology decisions.
Problem statement: Between 2026-04-21 and 2026-04-23 we rolled out 22 councils
to North-Star-complete state. A routine spot-check of Leeds (ordered by the
project owner on 2026-04-23) found that most declared-complete councils had
drifted against their Tier-1 source datasets and Tier-4 live pages. The
structural validators (audit-north-star.mjs, ux-audit.mjs, tier-classification.mjs,
source-truth.mjs on a limited field subset) had all passed even while 187
Tier-1 cells rendered values that did not match the current parsed CSVs, 12
Tier-4 URLs had linkrot, and 2 CE names were months out of date.
Root cause: structural validators check form (every entry has sha256, every value has SourceAnnotation, etc.). They do not check correctness against current reference data. Drift accumulates silently.
Standing policy going forward:
| Drift source | Audit tool | Target cadence | Pass bar |
|---|---|---|---|
src/data/population.ts vs parsed-population.csv |
audit-tier1-drift.mjs |
quarterly + every rollout | 0 cells drifted |
council_tax.band_d_YYYY vs parsed-area-band-d.csv |
audit-tier1-drift.mjs |
same | 0 cells |
budget.* vs RA_Part1_LA_Data.csv |
audit-tier1-drift.mjs |
same | 0 cells |
budget.total_service vs sum of categories |
validate.mjs cross-field |
every validator run | match within £1k |
budget.net_current vs RA netcurrtot |
audit-tier1-drift.mjs |
same | match within £1k |
| Tier-4 URL rot (404/403/5xx) | link-check-tier4.mjs |
monthly | 0 broken (HEAD-403 bot-blocks documented) |
| CE / Leader / councillor count personnel drift | manual spot-check + WebSearch | quarterly | current as of published leadership page |
| Tier-3 archived PDF superseded by newer version | compare-checksums.mjs |
annually | sha256 unchanged OR refresh triggered |
Missing / non-verbatim page_image_url (← added 2026-04-24) |
screenshot-parity.mjs |
every rollout + quarterly | ≥1 screenshot per council, excerpt verbatim in archive |
A council is North-Star complete if and only if all 5 structural gates
(north-star 0/5, ux-audit 0/0, validator 0 errors, live-site-reality 3/3,
screenshot-parity ✓) pass simultaneously on the current reference
CSVs. Drift detected by the quarterly audit → council's status/<slug>.json
flipped to north_star_complete: false, removed from STRICT_COUNCILS
until fix PR lands. No exceptions.
/rollout-councilskill: runs Phases 3.5, 3.6, 5c, 5d as mandatory gates./audit-councilskill: lightweight quarterly re-verification across 7 gates./refresh-dataskill (new): pulls fresh parsed CSVs from GOV.UK and re-runsaudit-tier1-drift.mjsacross all strict councils.
This document was adopted 2026-04-22, revised 2026-04-23 (v1.2 drift-prevention additions). Any change to its principles (section 2) requires explicit sign-off from the project owner. Changes to process/tooling (sections 3-20, 23) can be made by commit with clear commit message; reverted at owner's request.