| version-requirements |
|
||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| version-last-verified | 2026-04-06 | ||||||||||||||||||
| measurement-claims |
|
||||||||||||||||||
| status | PRODUCTION | ||||||||||||||||||
| last-verified | 2026-04-06 | ||||||||||||||||||
| evidence-tier | A | ||||||||||||||||||
| applies-to-signals |
|
||||||||||||||||||
| revalidate-by | 2026-10-22 |
Evidence Tier: A — Direct observation of hypothesis validation across third-brain and spoke repos
This document analyzes the pattern of re-running tests, benchmarks, and validations before making claims or presenting results — specifically the hypothesis confidence scoring system used across the portfolio. The key insight: confidence scores are not opinions; they are audit trails tied to specific technical events, with explicit statements of what remains unvalidated.
Hypothesis formulated (confidence 0.5-1.0)
│
▼
Milestone 1: Initial validation (confidence → 2.0-3.0)
│
▼
Milestone N: Production-scale benchmark (confidence → 4.0-4.5)
│
▼
Revalidation event: Re-run benchmarks with current code (confidence → 4.5-5.0)
│
▼
Demo/presentation: Confidence scores cited as evidence with remaining gaps explicit
Each confidence increment is tied to a specific technical event and explicitly states remaining gaps:
| Confidence | What It Means | Required Evidence |
|---|---|---|
| 1.0-2.0 | Hypothesis plausible, no implementation | Literature review, competitor analysis |
| 2.0-3.0 | Initial validation against known data | Ground truth detection, POC working |
| 3.0-4.0 | Multi-milestone validation | Production-scale benchmarks passing |
| 4.0-4.5 | All dimensions validated | Automated end-to-end pipeline, LLM integration |
| 4.5-5.0 | External validation pending | Peer review, customer validation, blind assessment |
| Date | Confidence | Event | Remaining Gap |
|---|---|---|---|
| 2026-04-04 | 3.0/5 | Hoosier ground truth: 12/12 known issues detected | Fleet Manager API access unknown |
| 2026-04-04 | 3.5/5 | Blocker resolved: health-inventory already collects config via LogScale | POC Phases 2-4 not yet run |
| 2026-04-04 | 4.0/5 | POC complete: adapter + engine + remediation validated | LLM remediation untested |
| 2026-04-04 | 4.5/5 | LLM remediation: 5 findings generated successfully (433.5s, Gemma 4 31B) | Suricata telemetry extraction missing |
| 2026-04-06 | 4.7/5 | Suricata YAML extraction: all 5 dimensions fully automated | PS engineer blind review pending |
Critical property: Each row explicitly states what was NOT validated ("Remaining Gap"). A confidence score of 4.7/5 does not mean "almost certain" — it means "all automated dimensions validated, external peer review outstanding."
| Milestone | Confidence | Evidence | Method |
|---|---|---|---|
| M1: OCSF Pipeline | 3.5/5 | 20M events, 74 fields, 25 compliance checks | Production-scale data generation + validation |
| M2: Federation Benchmark | 4.0/5 | 15/15 queries pass, all < 10s | 15-query benchmark suite (d2_benchmark_suite.py) |
| M3: WAN Bandwidth | 4.3/5 | 93-99.9% reduction across all scenarios | Centralized vs. federated transfer measurement |
| M4: Competitive Parity | 4.6/5 | Path to match ExtraHop in 3.5-6.5 weeks | Competitive analysis + OCSF mapping assessment |
The AI Stakeholder Forum Meeting 2 demo script shows how revalidation works in practice. Before presenting results, each deliverable is re-verified:
- TME MCP: 33 patterns, 26 playbooks, 67 tests — re-run test suite before demo
- Config Assessment: Confidence 4.7/5, all 5 dimensions — re-run against current fleet data
- MNDR + TME Integration: 3-phase enrichment — validate Inspector + Investigator + TME Playbook all responding
- Federated Query: 15/15 queries — re-run benchmark suite against current data
Why revalidate: Code changes between validation and presentation can break previously-passing results. A benchmark that passed on 2026-04-04 may fail on 2026-04-07 if a dependency updated. The revalidation step catches regressions before they become false claims in a presentation.
Cross-repo dependency monitoring (from third-brain scheduled_tasks.json) runs weekday mornings:
- Checks corelight-inspector for upstream changes
- Alerts if tool signatures or schemas changed
- Prevents silent integration failures between revalidation events
This is continuous revalidation — not triggered by milestones, but by the passage of time and upstream changes.
Both patterns above — revalidate dates and scheduled cron checks — trust a human to act later. A drift gate moves the check to commit time and fails the commit when a generated or derived value no longer matches its cited source. It closes the "stale claim cited in a presentation" failure this document already names, mechanically rather than by discipline.
One production instance (a research knowledge base) runs three modes off a single generator:
- A
--check-stagedgate comparing each generated claim to its source, deadband 0, blocking the commit on any divergence. - A
--divergencereport (not a gate) flagging a retired value still presented as current on reader-facing "product" surfaces. - A
--worklist <CLAIM_ID>feedforward listing every doc that cites a claim, so a source change propagates — the intra-repo complement to the cross-repo cascade in Cross-Project Synchronization.
The detail that matters for Claude Code specifically: git does not version .git/hooks/, so the gate needs a tracked hook source plus an install-hooks.sh — the same coordination problem Harness Engineering raises for hooks-as-enforcement (re-run the installer after editing the hook).
One finding from validating this gate against a gold set: keep the deterministic oracle above any LLM "gate skill" verdict, and adding more skill rounds did not help. That is a Tier-B production confirmation of the ablation evidence in Harness Engineering ("verifiers hurt, self-evolution helps") and the self-evaluation-rationalization caveat in Confidence Scoring — let the deterministic check decide, not a stacked panel of model judges.
Evidence tier: B — single production project. Karpathy's "coding is the ideal self-improvement loop because verification is built-in" is the nearest external anchor, but it is about the agent's own loop, not the project-as-loop framing, so don't stretch it. The scored work-selection surface this same project runs (ranking a backlog by a weighting formula) is deliberately not generalized here — it is project-specific prioritization that dates with the project, not a portable Claude Code discipline.
The Opus 4.7 release on 2026-04-16 is a canonical revalidation trigger that does not fit the usual milestone pattern. No code changed; no benchmark ran; the only change was the underlying model's prompt-interpretation behavior. Yet claims validated on 4.6 can no longer be cited without re-verification.
The Anthropic migration guide states: "Claude Opus 4.7 interprets prompts more literally and explicitly than Claude Opus 4.6... It will not silently generalize an instruction from one item to another." This means any prompt, skill, or CLAUDE.md instruction that depended on 4.6's willingness to infer intent may now produce a silent no-op on 4.7 — the output looks plausible, but the actual work was never performed.
Silent no-ops are the worst class of regression: no exception, no failed assertion, no visible symptom. Only revalidation catches them.
| Claim | 4.6 status | 4.7 revalidation approach | Gap |
|---|---|---|---|
| CLAUDE.md instructions followed ~80% of the time | Validated (Boris Cherny, March 2026) | Re-run sample on 4.7 with same CLAUDE.md; measure whether literal interpretation raises or lowers rate | Not yet measured |
| References in CLAUDE.md trigger file reads | Advisory loading worked on 4.6 | Mechanical enforcement required on 4.7 (PreToolUse hook, explicit Read step) | See progressive-disclosure |
| Implicit subagent dispatch ("execute in parallel") spawns subagents | Validated on 4.6 | Migration guide confirms 4.7 spawns fewer by default — explicit dispatch required | See Model Migration Anti-Patterns |
| 16 occurrences of "Opus 4.5/4.6" across analysis/ | Current as of prior release | Each needs either revalidation on 4.7 or explicit historical framing | Tracked in PLAN.md |
Any model release is a revalidation trigger for:
- Prompt-sensitivity claims — literal vs. inferred interpretation can shift
- Subagent-dispatch claims — default spawning behavior can shift
- Verbosity claims — response-length calibration can shift
- Tool-use frequency claims — default tool-call behavior can shift
Unlike code-driven revalidation (a benchmark script you can re-run), model-driven revalidation requires side-by-side behavior comparison: run the same prompt on the old and new version, diff outputs, flag divergences. This is a new revalidation method worth formalizing.
The revalidation pattern connects to the Evidence Tiers system:
| Evidence Tier | Revalidation Requirement |
|---|---|
| Tier A (primary observation) | Re-run benchmark/test with current code before citing |
| Tier B (expert practitioner) | Verify claim still holds for current version |
| Tier C (industry report) | Check publication date; flag if > 6 months old |
| Tier D (opinion/anecdote) | Do not cite without corroborating evidence |
The measurement-claims frontmatter in each analysis document includes revalidate dates — explicit expiration timestamps for claims. This is revalidation built into the document format.
| Anti-Pattern | Symptom | Fix |
|---|---|---|
| Citing stale benchmarks | "15/15 pass" from last month, but code changed since | Re-run benchmark suite before citing; include revalidation date |
| Confidence without remaining gaps | "4.7/5 confidence" with no mention of what's unvalidated | Every confidence score must state what remains |
| Demo without revalidation | Presenter discovers failures live | Revalidation step in demo prep checklist |
| One-time validation | "It worked in March" as permanent proof | Scheduled revalidation or revalidate dates in frontmatter |
| Revalidation as honor-system | revalidate-by dates pass silently; stale generated values ship |
Pre-commit drift gate comparing generated docs to source; feedforward worklist so a source edit propagates |
| Model version drift | Claim validated on 4.6 cited after 4.7 release | Re-verify on new model; prompt behavior changes silently |
- H-CONFIG-01 confidence history (April 2026) — 5 revalidation events over 3 days, 3.0→4.7/5
- H-NDR-FEDERATION-01 milestone validation (April 2026) — 4 milestones with production-scale benchmarks
- AI Stakeholder Forum Meeting 2 demo script (April 2026) — 4 deliverables with pre-demo revalidation
- Evidence Tiers — Dual-tier classification system for claims
- Confidence Scoring — Assessment framework for research hypotheses
- Automated Config Assessment — H-CONFIG-01 as primary revalidation case study
- Federated Query Architecture — H-NDR-FEDERATION-01 as milestone revalidation case study
- Model Migration Anti-Patterns — Opus 4.6 → 4.7 as a revalidation trigger; six prompt anti-patterns to audit on each release
Last updated: April 2026
analysis/automated-config-assessment.md[EXTRACTED (1.00) ×2] — referencesanalysis/confidence-scoring.md[EXTRACTED (1.00)] — referencesanalysis/session-quality-tools.md[EXTRACTED (1.00)] — referencesAUDIT-CONTEXT.md[EXTRACTED (1.00)] — referencesINDEX.md[EXTRACTED (1.00)] — referencesanalysis/behavioral-insights.md[EXTRACTED (1.00)] — referencesanalysis/agent-principles.md[EXTRACTED (1.00)] — conceptually_related_to