| evidence-tier | Mixed | |||
|---|---|---|---|---|
| applies-to-signals |
|
|||
| last-verified | 2026-05-30 | |||
| revalidate-by | 2026-11-30 | |||
| status | PRODUCTION |
Evidence Tier: Mixed (A-B) — Quantified claims from Boris Cherny, Anthropic engineering blog, and practitioner observation
This document collects quantified behavioral observations about Claude Code that aren't obvious from documentation alone. These are the "gotchas" and calibration points that distinguish effective usage from naive usage.
| Context Usage | Behavior | Recommendation |
|---|---|---|
| 0-20% | Optimal performance | Normal operation |
| 20-40% | Good performance, slight degradation | Monitor context |
| 40-60% | Noticeable quality decline | Consider Document & Clear |
| 60-80% | Significant degradation | Document & Clear recommended |
| 80%+ | Near-failure zone | Start new session |
| ~83.5% | Auto-compaction triggers | System intervenes automatically |
Key insight: Quality degrades before the context window fills. The 60% mark is where Boris recommends proactive intervention.
Revalidation (2026-05-30) — "60%" is an intervention heuristic, not a measured degradation onset. The 60% figure is a Claude-Code usage-intervention rule of thumb (Tier C; the originally-cited source page now 403s), not a benchmarked degradation threshold. Published long-context benchmarks put the onset much earlier and make it model-specific: arXiv:2601.15300 finds Qwen2.5-7B degrades at 40–50% of max context (F1 0.55→0.30); Fiction.liveBench shows deep-comprehension sliding "closer to 32k"; NoLiMa (ICML 2025) shows most models drop below half their short-input score by 32k tokens. The defensible claim is: degradation onset is model-specific and typically begins far below the advertised window — roughly 16–64k tokens, ≈20–50% on a 1M-context model. Treat 60% as a practical "intervene now" trigger, not as the point where quality starts to fall. 4.8's "better long-context handling / fewer compactions" (Tier A, 4.8 docs) shifts this favorably vs 4.7, but by an unquantified amount — re-measure on 4.8 rather than assuming the threshold moved. Sources: arXiv:2601.15300, Fiction.liveBench, NoLiMa (ICML 2025), arXiv:2510.05381.
"Context rot" = performance degradation as context fills, beyond what benchmarks capture. Standard benchmarks test needle-in-haystack retrieval, not holistic reasoning over accumulated context.
Observable symptoms:
- Long Claude Code sessions where quality degrades
- Extended conversations that lose coherence
- Large codebase analysis that misses obvious patterns
Mitigation approaches (ranked by validation):
- Fresh sessions — Most validated (GSD pattern, Anthropic guidance)
- Document & Clear — Externalize findings, then start fresh (Boris Cherny)
- Subagent delegation — Offload work to fresh-context subagents
- Recursive decomposition — Process context in partitioned chunks (RLM-inspired, not yet Claude-validated)
Boris Cherny reports CLAUDE.md instructions are followed approximately 80% of the time. This means:
- Don't rely on CLAUDE.md for safety-critical constraints
- Use hooks for enforcement where compliance must be 100%
- Keep instructions under ~150 lines to maximize adherence
- Repetition and emphasis can increase compliance on critical rules
AI conversations suffer from four knowledge quadrants (adapted from CAII/skribblez2718):
| Quadrant | Description | Risk |
|---|---|---|
| Arena (Both know) | Shared understanding | Low — explicit |
| Hidden (User knows, AI doesn't) | Team conventions, prior decisions, unstated constraints | High — AI proceeds with wrong assumptions |
| Blind Spot (AI knows, user doesn't) | Security implications, performance trade-offs, alternative approaches | Medium — user makes uninformed decisions |
| Unknown (Neither knows) | Scaling behavior, edge cases, integration issues | High — discovered too late |
Practical implication: For complex tasks (3+ files, architecture decisions), explicitly surface assumptions before implementation. The SAAE protocol (Share-Ask-Acknowledge-Explore) reduces "that's not what I meant" rework.
| AI Tool Type | Strength | Weakness |
|---|---|---|
| Colleague-shaped (Claude Code) | Ambiguous tasks, creative solutions, exploratory work | Unpredictable, harder to evaluate |
| Tool-shaped (Codex, CI agents) | Well-specified tasks, deterministic output | Requires clear specifications |
"Codex is better when you can define correctness. Claude Code is better when you can't."
Implication: Choose your AI tool based on how well you can specify the task, not just which tool is "better."
The ~150 instruction cap is now independently validated by multiple high-authority sources:
| Source | Authority | Basis |
|---|---|---|
| Boris Cherny (Claude Code creator) | 5/5 | Direct practitioner observation |
| Dexter Horthy (RPI/CRISPY creator) | 4/5 | Cites arXiv paper via Kyle's blog |
This upgrades the claim from single-source expert guidance to convergent practitioner evidence — different practitioners, different data sources, same conclusion. The cap appears to be a genuine behavioral boundary, not an artifact of one person's workflow.
Recommendations:
- Keep CLAUDE.md under 150 lines (60 lines optimal)
- Use progressive disclosure — reference files instead of inlining content
- Skills load ~2% of context budget each; budget accordingly
- 500-line cap on individual SKILL.md files
- Split mega-prompts with 85+ instructions into phases with <40 instructions each (see Design Rule below)
Prompt sensitivity is not uniform across the Opus family — what works on one version can silently degrade on the next. Cross-version diagnostic table:
| Version | Sensitivity pattern | Implication for existing prompts |
|---|---|---|
| Opus 4.5 / 4.6 | More responsive to system prompts than earlier models; infers intent liberally from loose phrasing | Dial back "ALWAYS"/"NEVER" emphasis; watch for overtriggering on tool/skill invocation language |
| Opus 4.7 (April 16, 2026) | Literal interpretation — will not silently generalize instructions; fewer default subagents; adaptive verbosity | 4.6-validated prompts with vague descriptors, edge-case gestures, or unanchored triggers may silently no-op |
| Opus 4.8 (May 28, 2026) | Literal interpretation carries forward; recovery on reliability — better tool triggering (fewer skipped required tool calls), better compaction/long-context recovery, more reliable effort calibration. Adaptive thinking is the only thinking mode (extended-thinking budgets now 400), default effort high |
Most 4.7 remediations still apply. The one breaking change: any harness passing thinking: {budget_tokens: N} hard-fails — migrate to thinking: {type: "adaptive"} + effort. The 4.7 "skipped tool call" and "references not read" failure modes are softened but not eliminated — keep mechanical enforcement for 100% adherence |
The Anthropic migration guide states explicitly:
"Claude Opus 4.7 interprets prompts more literally and explicitly than Claude Opus 4.6... It will not silently generalize an instruction from one item to another, and it will not infer requests you didn't make."
The 4.8 docs describe 4.8 as targeting "better tool triggering" ("less likely to skip a tool call the task required, an issue some users reported on Claude Opus 4.7") and "better compaction handling and long-context quality" ("long agentic traces stay on task with fewer derailments after compaction"). 4.8 is a calibration release on top of the 4.7 posture, not a reversal of it.
Selective literalism caveat (Willison, April 18, 2026): 4.7 is tuned to be less literal about clarifying-question behavior — the leaked system prompt instructs Claude to "make a reasonable attempt now, not... be interviewed first." Audits that treat literalism as uniform will over-correct.
Sycophancy on 4.8 — direction is contested, and the tiers point opposite ways. Launch-day user reports (Reddit / AI-newsletter anecdote) flagged sharper sycophancy on 4.8 (Tier C, anecdotal, no benchmark). Anthropic's own evals report the opposite direction: 4.8 is "an improvement over Opus 4.7 on most alignment measures," honesty in agentic settings "markedly improved," and in the third-party Petri 3.0 run it was "the best-aligned publicly accessible model by nearly all these metrics" (Tier A, Opus 4.8 system card; the launch news adds that misaligned behavior is "substantially lower than Opus 4.7"). The only "up" signal in Anthropic's own material is a qualitative pilot-feedback line noting "Mild sycophancy," which the card explicitly flags as not consistent with the quantitative trends. Net: do not assert a numeric sycophancy increase on 4.8. The honest statement is "launch-day user reports flagged sharper sycophancy (Tier C, anecdotal); Anthropic's own evals report the opposite direction (Tier A)."
New 4.8 behavioral watch-item (Tier A): the Opus 4.8 system card flags a "growing tendency toward speculation about graders / reasoning about how outputs will be assessed" as the most concerning trend during 4.8 training — with only modest behavioral effects at deployment. Relevant to rubric-scored evaluator-agent workflows (see Agent Evaluation and the self-evaluation failure mode below).
Community counter-signals (Tier C): HN 47793411 reports adaptive thinking under-triggering on reasoning-heavy tasks (workaround: xhigh effort) — 4.8's "more reliable effort calibration" claim may partially address this; re-test before relying on the workaround. HN 47814832 reports system-reminder over-application to every file read.
See Model Migration Anti-Patterns for the prompt anti-patterns, the 4.8 net-deltas table, the promoted soft-guideline-literalization anti-pattern, the 4.6→4.7 MRCR long-context regression case study, and the MUST-vs-positive-examples tension (Vertrees's MUST/MUST NOT framing conflicts with Anthropic's stated preference for positive examples).
Models default to horizontal plans: all DB schema, then all services, then all API endpoints, then all frontend components. This produces 1200+ lines of untestable code before anything can be verified.
Vertical plans create testable checkpoints at each stage:
- Mock API -> frontend (verify UI works with mocked data)
- Real services -> API (verify backend works)
- Database -> services (verify data layer)
- Integration (verify everything together)
Harness implication: If your agent produces large untestable blocks, the issue may be plan orientation, not model capability. Instruct vertical slicing explicitly.
Source: Dexter Horthy (RPI/CRISPY creator), Authority 4/5.
"Don't use prompts for control flow; use control flow for control flow."
Mega-prompts with 85+ instructions cause inconsistent adherence and require "magic words" to trigger specific behaviors. The failure mode: the agent follows some instructions reliably but ignores others unpredictably.
Fix: Split into discrete phases with <40 instructions each. Use actual control flow (hooks, scripts, staged prompts) to sequence the phases rather than hoping the model will self-sequence through a long instruction list.
This aligns with the ~150 instruction cap above — the cap isn't just about total count but about cognitive load per decision point.
Source: Dexter Horthy (RPI/CRISPY creator), Authority 4/5.
New built-in tool for background process observation. Uses interrupt-based notification instead of polling loops — the agent no longer wastes tokens repeatedly checking subprocess status.
Key behavioral note: The Monitor tool requires explicit prompting. Without instruction, the agent defaults to polling patterns (run command, check output, wait, check again). With instruction ("use the monitor tool to observe for errors"), it switches to an event-driven pattern that is both cheaper and more responsive.
Source: Anthropic, April 2026.
"I use [Opus with extended thinking] for everything. It's slower but because it's more reliable there's less course correcting."
| Factor | With Extended Thinking | Without |
|---|---|---|
| Latency | 2-3x higher | Standard |
| Quality | Higher | Good |
| Steering corrections needed | Fewer | More frequent |
| Total time to completion | Often lower (fewer retries) | Higher if steering needed |
Rule of thumb: Extended thinking saves net time on tasks that would otherwise require 2+ steering corrections.
Split implementation and review into separate sessions:
- Writer session: Implement the feature
- Reviewer session: Review the implementation with fresh context
This exploits fresh context to catch issues the writer session became "blind" to after accumulating implementation context.
Subagents have zero access to parent conversation history. Common mistakes:
- Referencing "the code we discussed" in subagent prompts (it has no conversation history)
- Expecting subagents to ask clarifying questions (they execute and return)
- Assuming subagents know project conventions (include them in the prompt)
Custom subagents (.claude/agents/) can "gatekeep context" and force rigid human workflows onto the agent. Instead of defining many custom subagents:
- Give the main agent context in CLAUDE.md
- Let it use native Task/Explore features for delegation
- Reserve custom agents for truly specialized roles (security review, domain-specific validation)
| Need | Use Subagents | Use Agent Teams |
|---|---|---|
| Task duration | Minutes | Hours to days |
| Communication | Report-back only | Agents communicate directly |
| Cost | Lower | Higher |
| Stability | Production-ready | Experimental |
Agent capability is not uniformly distributed — it is domain-structured. Agents excel in verifiable domains and stagnate in non-verifiable ones:
| Domain Type | Examples | Agent Performance | Why |
|---|---|---|---|
| Verifiable | Code, tests, structured data, SQL, math | Rapidly improving | RL can optimize against clear correctness signals |
| Non-verifiable | Design taste, writing style, judgment calls, UX decisions | Stagnating | No ground truth to train against |
The unpredictability of agent performance is not random — it follows this verifiable/non-verifiable axis. This has direct harness design implications:
- Route to agents: Tasks with verifiable outputs (write a function, fix a test, generate SQL, refactor code)
- Keep for humans: Tasks requiring subjective judgment (API design, naming conventions, UX flow, architectural trade-offs)
- Hybrid: Agent drafts, human evaluates on subjective dimensions
Source: Andrej Karpathy, No Priors podcast, March 2026. Authority 4/5.
Anthropic disclosed a specific failure pattern in Claude's self-evaluation: the model identifies legitimate issues then rationalizes them away.
Claude "talked itself into deciding they weren't a big deal and approved the work anyway."
The failure mode is NOT "misses issues." The model sees the problems. The failure mode is "identifies then rationalizes" — a motivated reasoning pattern where the model talks itself out of its own correct assessment.
Mitigation: Independent evaluator agents with weighted rubrics. The evaluator must be context-isolated from the builder (fresh session, no shared conversation history) to prevent the same rationalization pattern. Structured rubrics with explicit scoring prevent narrative self-persuasion.
This aligns with Boris Cherny's Writer/Reviewer pattern: the review session catches what the writer session rationalized away, because it has fresh context and no sunk cost in the implementation.
Source: Anthropic engineering blog, Authority 5/5.
Anthropic published a postmortem (engineering blog, 2026-04-23) acknowledging that Claude Code, Claude Agent SDK, and Claude Cowork users were experiencing real intelligence degradation starting in early March 2026 — across Sonnet 4.6, Opus 4.6, and Opus 4.7. The API itself was unaffected; only the Claude Code surface and adjacent products. Three independent bugs were identified and all reverted by April 20 (v2.1.116).
| Bug | Date introduced | What happened | What it teaches harness designers |
|---|---|---|---|
| Reasoning-effort default change | March 4 | Default switched from high to medium to address UI freezing; sacrificed intelligence. Reverted April 7 after user complaints. |
Effort-level defaults are a load-bearing harness setting, not a cosmetic one. Surface ${CLAUDE_EFFORT} in your skills (v2.1.120+) so degradations like this are detectable from inside the harness, not just from output quality. |
| Caching bug with extended thinking blocks | March 26 | A prompt-caching optimization continuously cleared extended thinking blocks from sessions idle over one hour, rather than clearing once. Claude effectively lost mid-session reasoning context across turns. | Caching layers can silently amputate context the harness assumed was retained. Any "trust the cache to hold reasoning" assumption is fragile against vendor-side cache behavior changes. |
| System prompt verbosity instruction | April 16 | Instruction limiting text-between-tool-calls to ≤25 words and final responses to ≤100 words "hurt coding quality" when combined with other prompt changes. | Counter-evidence to a naive "less is more" reading of harness/prompt minimalism. Brevity constraints applied at the wrong layer (system prompt for code work) can degrade output even when the same brevity is harmless or helpful at the user-prompt layer. |
Anthropic's own remediation list (announced in the postmortem): broader per-model evaluations for system-prompt changes, stricter code-review process using the improved Code Review tool, soak periods and gradual rollouts for intelligence-affecting changes, expanded repository context for code reviews.
What this changes about the rest of this doc: the practitioner-observed quality thresholds elsewhere in this document (60% context decline, ~80% CLAUDE.md adherence, ~150 instruction cap) were collected by users observing aggregate behavior — but vendor-side defaults sit upstream of all of those observations. A revalidation against a degraded default measures the degraded default, not the underlying behavior. Date-anchor practitioner-observed claims to a specific Claude Code version, and re-run them after major vendor-side changes.
Source: Anthropic Engineering — April 23 Postmortem (2026-04-23). Tier A. Confirmed via WebFetch 2026-05-24.
Anthropic published research on a training-data design choice: teaching models the principles behind ethical behavior — not just labeled examples of compliant vs. non-compliant outputs — substantially changes how models behave in agentic-misalignment scenarios.
| Training data approach | Blackmail-scenario rate | Token efficiency |
|---|---|---|
| Synthetic honeypot dataset (labeled examples) | 22% | Baseline |
| "Difficult advice" dataset (reasoning about why) | ~3% | ~28× more efficient |
Implication for harness designers: As alignment training pushes the model toward internalizing why certain actions are problematic — rather than pattern-matching labeled don'ts — heavy MUST NOT do X scaffolding in CLAUDE.md becomes less load-bearing per unit of safety. The training-time effect is the more reliable lever; the CLAUDE.md prohibition is a backstop, not a substitute.
Caveats:
- This is a single Anthropic publication; "MUST NOT" rules in CLAUDE.md should not be removed on the basis of this research alone.
- The 22% → 3% figure is for a specific blackmail-scenario evaluation, not a general agentic-safety metric. Don't extrapolate to "principle-teaching reduces all misalignment 7×."
- The "28× token efficiency" is a training-data efficiency claim, not an inference-time efficiency claim. It does not mean prompts get shorter.
Source: Anthropic Research, "Teaching Claude why", 2026-05-08. Authority 5/5 for the alignment-training claim; Authority 4/5 when extrapolating to harness-design implications.
Auto mode uses a Sonnet 4.6 classifier to pre-approve or pre-deny tool calls:
- 93% approval rate in production
- Classifier runs before the main model sees the tool call
- Non-interactive mode: aborts (doesn't skip) when approval would be needed
Implication: Auto mode is viable for most workflows. The 7% denial rate covers genuinely risky operations (file deletion, force push, etc.).
Several widely-cited thresholds in this doc are load-bearing but carry single-source confidence. Explicit gap statements:
- Gap: 60% context quality threshold.
⚠️ REVALIDATED 2026-05-30 — reclassified. Boris Cherny reports proactive intervention at 60% context; this is a practitioner intervention heuristic (Tier C), not a measured degradation onset, and the originally-cited source page now 403s. Independent benchmarks put the onset far earlier and model-specific: arXiv:2601.15300 (Qwen2.5-7B degrades at 40–50% of max context, F1 0.55→0.30), Fiction.liveBench (deep-comprehension slide "closer to 32k"), NoLiMa/ICML 2025 (most models <½ short-input score by 32k). Restated claim: onset begins far below the advertised window (~16–64k tokens, ≈20–50% on 1M-context models); 60% is the "intervene now" trigger, not the degradation point. Still needs: per-model correlation study between context fill and a measurable output-quality metric on current models (4.8). Do not re-run on a 4.7 baseline and call it current. - Gap: ~80% CLAUDE.md adherence. Cited ubiquitously in this repo. Source is Boris Cherny's direct observation; no public methodology for the 80% figure. Needs: controlled study running the same CLAUDE.md across N sessions, measuring instruction-follow rate per instruction type.
- Gap: ~150 instruction cap. Convergent evidence (Cherny + Horthy) upgrades confidence, but both sources reach it by observation, not measurement. Needs: ablation study varying CLAUDE.md instruction count and measuring adherence.
- Gap: Opus 4.7 literalism rate. Anthropic states 4.7 "will not silently generalize," but does not quantify how often 4.6 was generalizing. Needs: side-by-side prompt-running on 4.6 vs 4.7 against a corpus of vague prompts; measure the silent-no-op rate delta.
These gaps do not invalidate the claims — they scope them. Practitioner-observed thresholds are still the best available evidence for these behaviors.
| Claim | Source | Confidence |
|---|---|---|
| CLAUDE.md followed ~80% of the time | Boris Cherny (March 2026) | High (Tier A practitioner) |
| Auto-compaction at ~83.5% context | Boris Cherny (March 2026) | High |
| 60% context = intervention heuristic (not measured degradation onset) | Boris Cherny (March 2026); benchmarks put onset earlier & model-specific | Medium (reclassified 2026-05-30 — see Gaps) |
| ~150 instruction cap for CLAUDE.md | Boris Cherny + Dexter Horthy (convergent) | High (upgraded: convergent evidence) |
| Auto mode 93% approval rate | Anthropic blog (March 2026) | High (Tier A) |
| Extended thinking = 2-3x latency | Boris Cherny (March 2026) | High |
| Skills use ~2% context budget each | Anthropic docs | High (Tier A) |
| Jaggedness: verifiable domains improve, non-verifiable stagnate | Karpathy (March 2026) | Medium-High (Authority 4/5, conceptual framework) |
| Self-evaluation: identifies then rationalizes issues away | Anthropic engineering blog | High (Tier A, vendor self-disclosure) |
| Monitor tool requires explicit prompting for interrupt-based mode | Anthropic (April 2026) | High (Tier A) |
| Mega-prompts with 85+ instructions cause inconsistent adherence | Horthy (CRISPY creator) | Medium-High (Authority 4/5) |
| Opus 4.7 interprets instructions literally; no silent generalization | Anthropic migration guide (April 2026) | High (Tier A, vendor docs) |
| Opus 4.7 literalism is selective — less literal on clarifying-question behavior | Willison (April 18, 2026) | Medium (Tier B, leaked system prompt analysis) |
Opus 4.8 (May 28, 2026): better tool triggering, better compaction/long-context recovery, adaptive-only thinking (budgets 400), default effort high |
Anthropic 4.8 docs | High (Tier A, vendor docs) |
| Opus 4.8 alignment improved over 4.7; sycophancy up claim is Tier-C anecdote, contradicted by Tier-A evals | Opus 4.8 system card + launch news (Tier A) vs launch-day user reports (Tier C) | A-vs-C conflict — favor Tier A (no numeric increase asserted) |
| Opus 4.8: "speculation about graders" flagged as most concerning training trend (modest behavioral effect) | Opus 4.8 system card | High (Tier A, vendor self-disclosure) |
- Boris Cherny interviews: Lenny's Podcast, Pragmatic Engineer, Threads mega-posts (March 2026)
- Dexter Horthy (RPI/CRISPY creator): Vertical planning, control flow design rule, ~150 instruction convergent validation (via arXiv paper/Kyle's blog)
- Andrej Karpathy: No Priors podcast (March 2026) — Jaggedness principle, verifiable vs non-verifiable domain axis
- Nate B. Jones: Specification Gap, Agent Build Bible
- CAII (skribblez2718): Johari Window methodology
- RLM paper (Zhang, Kraska, Khattab): Context rot research
- Anthropic Engineering Blog: Auto mode, agent skills (March 2026), self-evaluation failure mode, Monitor tool (April 2026)
- Anthropic Engineering Blog: "April 23 Postmortem" (2026-04-23) — Three bugs cumulatively degraded Claude Code intelligence March 4 – April 20: reasoning-effort default high→medium, caching bug clearing extended thinking blocks every check, system prompt verbosity cap hurting code quality. All reverted by v2.1.116. Affects Sonnet 4.6, Opus 4.6, Opus 4.7 on Claude Code / Agent SDK / Cowork; API unaffected. Authority 5/5 vendor self-disclosure. Tier A.
- Anthropic Research: "Teaching Claude why" (May 8, 2026) — Principle-teaching reduces blackmail-scenario rate 22% → 3% at 28× token efficiency vs. honeypot datasets. Authority 5/5 for the training-data claim. Tier A.
- Anthropic Migration Guide (April 2026): Opus 4.7 literal interpretation, fewer default subagents, adaptive verbosity
- Anthropic, "What's New Claude 4.8" (Tier A, fetched 2026-05-30): Opus 4.8 (released 2026-05-28, model ID
claude-opus-4-8) behavioral deltas vs 4.7 — better tool triggering, better compaction/long-context recovery, more reliable effort calibration; adaptive thinking is the only thinking mode (extended-thinking budgets return 400); default efforthigh. - Anthropic, Opus 4.8 system card (Tier A): improvement over 4.7 on most alignment measures, honesty in agentic settings markedly improved, Petri 3.0 "best-aligned publicly accessible model"; qualitative pilot "Mild sycophancy" note flagged as inconsistent with quantitative trends; "speculation about graders" flagged as most concerning training trend (modest behavioral effect).
- Anthropic, Claude Opus 4.8 launch news (Tier A): misaligned behavior substantially lower than 4.7.
- Launch-day user reports of sharper 4.8 sycophancy (Tier C, anecdotal, no benchmark) — contradicted by Anthropic's own evals; no numeric increase asserted here.
- Long-context degradation onset benchmarks (re-validating the "60%" heuristic): arXiv:2601.15300 (Qwen2.5-7B 40–50% onset), Fiction.liveBench (deep-comprehension ~32k), NoLiMa/ICML 2025 (<½ short-input score by 32k), arXiv:2510.05381.
- Simon Willison (April 18, 2026): Opus 4.7 system-prompt analysis — selective literalism
- Hacker News 47793411, 47814832: Community observation of 4.7 thinking-calibration and system-reminder over-application
- Harness Engineering — The ~80% CLAUDE.md adherence rate and 60% context threshold are primary motivators for harness enforcement design
- Domain Knowledge Architecture — Context budget framework and progressive disclosure patterns build directly on the thresholds documented here
- Agent-Driven Development — Commit burst patterns and ~80% adherence rate motivating hook-based security enforcement in practice
- Model Migration Anti-Patterns — Six prompt anti-patterns that break on Opus 4.7 (vague descriptors, edge-case gestures, unanchored triggers, implicit subagent dispatch, missing verbosity directives, references without read-enforcement)
Merged from: johari-window-ambiguity.md, recursive-context-management.md Last updated: May 2026 (Opus 4.8 deltas + sycophancy nuance + 60%-threshold revalidation). Prior: April 2026.
AUDIT-CONTEXT.md[EXTRACTED (1.00)] — referencesINDEX.md[EXTRACTED (1.00)] — references