version-requirements

claude-code
v2.1.0+

version-last-verified

2026-04-06

measurement-claims

claim	source	date	revalidate
H-CONFIG-01 confidence advanced from 3.0/5 to 4.7/5 over 3 days through 5 explicit revalidation events	Direct analysis — third-brain H-CONFIG-01-evidence.md confidence history	2026-04-06	2026-10-06

claim	source	date	revalidate
H-NDR-FEDERATION-01 confidence at 4.6/5 after 4 milestones (M1-M4) each with production-scale benchmarks	Direct analysis — third-brain H-NDR-FEDERATION-01-evidence.md	2026-04-06	2026-10-06

status

PRODUCTION

last-verified

2026-04-06

evidence-tier

applies-to-signals

revalidation-trigger

model-version-migration

audit-always-fetch

generated-docs-no-drift-gate

revalidate-by

2026-10-22

Evidence-Based Revalidation: Hypothesis-Driven Confidence Tracking

Evidence Tier: A — Direct observation of hypothesis validation across third-brain and spoke repos

Purpose

This document analyzes the pattern of re-running tests, benchmarks, and validations before making claims or presenting results — specifically the hypothesis confidence scoring system used across the portfolio. The key insight: confidence scores are not opinions; they are audit trails tied to specific technical events, with explicit statements of what remains unvalidated.

The Revalidation Pattern

Hypothesis → Milestones → Confidence → Demo

Hypothesis formulated (confidence 0.5-1.0)
    │
    ▼
Milestone 1: Initial validation (confidence → 2.0-3.0)
    │
    ▼
Milestone N: Production-scale benchmark (confidence → 4.0-4.5)
    │
    ▼
Revalidation event: Re-run benchmarks with current code (confidence → 4.5-5.0)
    │
    ▼
Demo/presentation: Confidence scores cited as evidence with remaining gaps explicit

Confidence as Audit Trail

Each confidence increment is tied to a specific technical event and explicitly states remaining gaps:

Confidence	What It Means	Required Evidence
1.0-2.0	Hypothesis plausible, no implementation	Literature review, competitor analysis
2.0-3.0	Initial validation against known data	Ground truth detection, POC working
3.0-4.0	Multi-milestone validation	Production-scale benchmarks passing
4.0-4.5	All dimensions validated	Automated end-to-end pipeline, LLM integration
4.5-5.0	External validation pending	Peer review, customer validation, blind assessment

Case Study: H-CONFIG-01 Confidence Progression

Date	Confidence	Event	Remaining Gap
2026-04-04	3.0/5	Hoosier ground truth: 12/12 known issues detected	Fleet Manager API access unknown
2026-04-04	3.5/5	Blocker resolved: health-inventory already collects config via LogScale	POC Phases 2-4 not yet run
2026-04-04	4.0/5	POC complete: adapter + engine + remediation validated	LLM remediation untested
2026-04-04	4.5/5	LLM remediation: 5 findings generated successfully (433.5s, Gemma 4 31B)	Suricata telemetry extraction missing
2026-04-06	4.7/5	Suricata YAML extraction: all 5 dimensions fully automated	PS engineer blind review pending

Critical property: Each row explicitly states what was NOT validated ("Remaining Gap"). A confidence score of 4.7/5 does not mean "almost certain" — it means "all automated dimensions validated, external peer review outstanding."

Case Study: H-NDR-FEDERATION-01

Milestone	Confidence	Evidence	Method
M1: OCSF Pipeline	3.5/5	20M events, 74 fields, 25 compliance checks	Production-scale data generation + validation
M2: Federation Benchmark	4.0/5	15/15 queries pass, all < 10s	15-query benchmark suite (`d2_benchmark_suite.py`)
M3: WAN Bandwidth	4.3/5	93-99.9% reduction across all scenarios	Centralized vs. federated transfer measurement
M4: Competitive Parity	4.6/5	Path to match ExtraHop in 3.5-6.5 weeks	Competitive analysis + OCSF mapping assessment

Revalidation Before Demo

The AI Stakeholder Forum Meeting 2 demo script shows how revalidation works in practice. Before presenting results, each deliverable is re-verified:

TME MCP: 33 patterns, 26 playbooks, 67 tests — re-run test suite before demo
Config Assessment: Confidence 4.7/5, all 5 dimensions — re-run against current fleet data
MNDR + TME Integration: 3-phase enrichment — validate Inspector + Investigator + TME Playbook all responding
Federated Query: 15/15 queries — re-run benchmark suite against current data

Why revalidate: Code changes between validation and presentation can break previously-passing results. A benchmark that passed on 2026-04-04 may fail on 2026-04-07 if a dependency updated. The revalidation step catches regressions before they become false claims in a presentation.

Scheduled Revalidation

Cross-repo dependency monitoring (from third-brain scheduled_tasks.json) runs weekday mornings:

Checks corelight-inspector for upstream changes
Alerts if tool signatures or schemas changed
Prevents silent integration failures between revalidation events

This is continuous revalidation — not triggered by milestones, but by the passage of time and upstream changes.

Pre-Commit Drift Gates: Revalidation as a Commit Hook

Both patterns above — revalidate dates and scheduled cron checks — trust a human to act later. A drift gate moves the check to commit time and fails the commit when a generated or derived value no longer matches its cited source. It closes the "stale claim cited in a presentation" failure this document already names, mechanically rather than by discipline.

One production instance (a research knowledge base) runs three modes off a single generator:

A --check-staged gate comparing each generated claim to its source, deadband 0, blocking the commit on any divergence.
A --divergence report (not a gate) flagging a retired value still presented as current on reader-facing "product" surfaces.
A --worklist <CLAIM_ID> feedforward listing every doc that cites a claim, so a source change propagates — the intra-repo complement to the cross-repo cascade in Cross-Project Synchronization.

The detail that matters for Claude Code specifically: git does not version .git/hooks/, so the gate needs a tracked hook source plus an install-hooks.sh — the same coordination problem Harness Engineering raises for hooks-as-enforcement (re-run the installer after editing the hook).

One finding from validating this gate against a gold set: keep the deterministic oracle above any LLM "gate skill" verdict, and adding more skill rounds did not help. That is a Tier-B production confirmation of the ablation evidence in Harness Engineering ("verifiers hurt, self-evolution helps") and the self-evaluation-rationalization caveat in Confidence Scoring — let the deterministic check decide, not a stacked panel of model judges.

Evidence tier: B — single production project. Karpathy's "coding is the ideal self-improvement loop because verification is built-in" is the nearest external anchor, but it is about the agent's own loop, not the project-as-loop framing, so don't stretch it. The scored work-selection surface this same project runs (ranking a backlog by a weighting formula) is deliberately not generalized here — it is project-specific prioritization that dates with the project, not a portable Claude Code discipline.

Case Study: Model Migration as Revalidation Trigger (Opus 4.6 → 4.7, April 2026)

The Opus 4.7 release on 2026-04-16 is a canonical revalidation trigger that does not fit the usual milestone pattern. No code changed; no benchmark ran; the only change was the underlying model's prompt-interpretation behavior. Yet claims validated on 4.6 can no longer be cited without re-verification.

Why This Is a Revalidation Event

The Anthropic migration guide states: "Claude Opus 4.7 interprets prompts more literally and explicitly than Claude Opus 4.6... It will not silently generalize an instruction from one item to another." This means any prompt, skill, or CLAUDE.md instruction that depended on 4.6's willingness to infer intent may now produce a silent no-op on 4.7 — the output looks plausible, but the actual work was never performed.

Silent no-ops are the worst class of regression: no exception, no failed assertion, no visible symptom. Only revalidation catches them.

Revalidation Table (this repo's own claims)

Claim	4.6 status	4.7 revalidation approach	Gap
CLAUDE.md instructions followed ~80% of the time	Validated (Boris Cherny, March 2026)	Re-run sample on 4.7 with same CLAUDE.md; measure whether literal interpretation raises or lowers rate	Not yet measured
References in CLAUDE.md trigger file reads	Advisory loading worked on 4.6	Mechanical enforcement required on 4.7 (PreToolUse hook, explicit Read step)	See progressive-disclosure
Implicit subagent dispatch ("execute in parallel") spawns subagents	Validated on 4.6	Migration guide confirms 4.7 spawns fewer by default — explicit dispatch required	See Model Migration Anti-Patterns
16 occurrences of "Opus 4.5/4.6" across analysis/	Current as of prior release	Each needs either revalidation on 4.7 or explicit historical framing	Tracked in PLAN.md

Pattern Generalization

Any model release is a revalidation trigger for:

Prompt-sensitivity claims — literal vs. inferred interpretation can shift
Subagent-dispatch claims — default spawning behavior can shift
Verbosity claims — response-length calibration can shift
Tool-use frequency claims — default tool-call behavior can shift

Unlike code-driven revalidation (a benchmark script you can re-run), model-driven revalidation requires side-by-side behavior comparison: run the same prompt on the old and new version, diff outputs, flag divergences. This is a new revalidation method worth formalizing.

Integration with Evidence Tiers

The revalidation pattern connects to the Evidence Tiers system:

Evidence Tier	Revalidation Requirement
Tier A (primary observation)	Re-run benchmark/test with current code before citing
Tier B (expert practitioner)	Verify claim still holds for current version
Tier C (industry report)	Check publication date; flag if > 6 months old
Tier D (opinion/anecdote)	Do not cite without corroborating evidence

The measurement-claims frontmatter in each analysis document includes revalidate dates — explicit expiration timestamps for claims. This is revalidation built into the document format.

Anti-Patterns

Anti-Pattern	Symptom	Fix
Citing stale benchmarks	"15/15 pass" from last month, but code changed since	Re-run benchmark suite before citing; include revalidation date
Confidence without remaining gaps	"4.7/5 confidence" with no mention of what's unvalidated	Every confidence score must state what remains
Demo without revalidation	Presenter discovers failures live	Revalidation step in demo prep checklist
One-time validation	"It worked in March" as permanent proof	Scheduled revalidation or `revalidate` dates in frontmatter
Revalidation as honor-system	`revalidate-by` dates pass silently; stale generated values ship	Pre-commit drift gate comparing generated docs to source; feedforward worklist so a source edit propagates
Model version drift	Claim validated on 4.6 cited after 4.7 release	Re-verify on new model; prompt behavior changes silently

Sources

Tier A (Direct Production Observation)

H-CONFIG-01 confidence history (April 2026) — 5 revalidation events over 3 days, 3.0→4.7/5
H-NDR-FEDERATION-01 milestone validation (April 2026) — 4 milestones with production-scale benchmarks
AI Stakeholder Forum Meeting 2 demo script (April 2026) — 4 deliverables with pre-demo revalidation

Related Analysis

Evidence Tiers — Dual-tier classification system for claims
Confidence Scoring — Assessment framework for research hypotheses
Automated Config Assessment — H-CONFIG-01 as primary revalidation case study
Federated Query Architecture — H-NDR-FEDERATION-01 as milestone revalidation case study
Model Migration Anti-Patterns — Opus 4.6 → 4.7 as a revalidation trigger; six prompt anti-patterns to audit on each release

Last updated: April 2026

Related (from graph)

analysis/automated-config-assessment.md [EXTRACTED (1.00) ×2] — references
analysis/confidence-scoring.md [EXTRACTED (1.00)] — references
analysis/session-quality-tools.md [EXTRACTED (1.00)] — references
AUDIT-CONTEXT.md [EXTRACTED (1.00)] — references
INDEX.md [EXTRACTED (1.00)] — references
analysis/behavioral-insights.md [EXTRACTED (1.00)] — references
analysis/agent-principles.md [EXTRACTED (1.00)] — conceptually_related_to

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evidence-Based Revalidation: Hypothesis-Driven Confidence Tracking

Purpose

The Revalidation Pattern

Hypothesis → Milestones → Confidence → Demo

Confidence as Audit Trail

Case Study: H-CONFIG-01 Confidence Progression

Case Study: H-NDR-FEDERATION-01

Revalidation Before Demo

Scheduled Revalidation

Pre-Commit Drift Gates: Revalidation as a Commit Hook

Case Study: Model Migration as Revalidation Trigger (Opus 4.6 → 4.7, April 2026)

Why This Is a Revalidation Event

Revalidation Table (this repo's own claims)

Pattern Generalization

Integration with Evidence Tiers

Anti-Patterns

Sources

Tier A (Direct Production Observation)

Related Analysis

Related (from graph)

FilesExpand file tree

evidence-based-revalidation.md

Latest commit

History

evidence-based-revalidation.md

File metadata and controls

Evidence-Based Revalidation: Hypothesis-Driven Confidence Tracking

Purpose

The Revalidation Pattern

Hypothesis → Milestones → Confidence → Demo

Confidence as Audit Trail

Case Study: H-CONFIG-01 Confidence Progression

Case Study: H-NDR-FEDERATION-01

Revalidation Before Demo

Scheduled Revalidation

Pre-Commit Drift Gates: Revalidation as a Commit Hook

Case Study: Model Migration as Revalidation Trigger (Opus 4.6 → 4.7, April 2026)

Why This Is a Revalidation Event

Revalidation Table (this repo's own claims)

Pattern Generalization

Integration with Evidence Tiers

Anti-Patterns

Sources

Tier A (Direct Production Observation)

Related Analysis

Related (from graph)