Behavioral Insights: How Claude Code Actually Works

Evidence Tier: Mixed (A-B) — Quantified claims from Boris Cherny, Anthropic engineering blog, and practitioner observation

Purpose

This document collects quantified behavioral observations about Claude Code that aren't obvious from documentation alone. These are the "gotchas" and calibration points that distinguish effective usage from naive usage.

Context Window Behavior

Capacity Thresholds (Boris Cherny, March 2026)

Context Usage	Behavior	Recommendation
0-20%	Optimal performance	Normal operation
20-40%	Good performance, slight degradation	Monitor context
40-60%	Noticeable quality decline	Consider Document & Clear
60-80%	Significant degradation	Document & Clear recommended
80%+	Near-failure zone	Start new session
~83.5%	Auto-compaction triggers	System intervenes automatically

Key insight: Quality degrades before the context window fills. The 60% mark is where Boris recommends proactive intervention.

Revalidation (2026-05-30) — "60%" is an intervention heuristic, not a measured degradation onset. The 60% figure is a Claude-Code usage-intervention rule of thumb (Tier C; the originally-cited source page now 403s), not a benchmarked degradation threshold. Published long-context benchmarks put the onset much earlier and make it model-specific: arXiv:2601.15300 finds Qwen2.5-7B degrades at 40–50% of max context (F1 0.55→0.30); Fiction.liveBench shows deep-comprehension sliding "closer to 32k"; NoLiMa (ICML 2025) shows most models drop below half their short-input score by 32k tokens. The defensible claim is: degradation onset is model-specific and typically begins far below the advertised window — roughly 16–64k tokens, ≈20–50% on a 1M-context model. Treat 60% as a practical "intervene now" trigger, not as the point where quality starts to fall. 4.8's "better long-context handling / fewer compactions" (Tier A, 4.8 docs) shifts this favorably vs 4.7, but by an unquantified amount — re-measure on 4.8 rather than assuming the threshold moved. Sources: arXiv:2601.15300, Fiction.liveBench, NoLiMa (ICML 2025), arXiv:2510.05381.

Context Rot (RLM Research, Zhang et al.)

"Context rot" = performance degradation as context fills, beyond what benchmarks capture. Standard benchmarks test needle-in-haystack retrieval, not holistic reasoning over accumulated context.

Observable symptoms:

Long Claude Code sessions where quality degrades
Extended conversations that lose coherence
Large codebase analysis that misses obvious patterns

Mitigation approaches (ranked by validation):

Fresh sessions — Most validated (GSD pattern, Anthropic guidance)
Document & Clear — Externalize findings, then start fresh (Boris Cherny)
Subagent delegation — Offload work to fresh-context subagents
Recursive decomposition — Process context in partitioned chunks (RLM-inspired, not yet Claude-validated)

CLAUDE.md Adherence (~80%)

Boris Cherny reports CLAUDE.md instructions are followed approximately 80% of the time. This means:

Don't rely on CLAUDE.md for safety-critical constraints
Use hooks for enforcement where compliance must be 100%
Keep instructions under ~150 lines to maximize adherence
Repetition and emphasis can increase compliance on critical rules

Ambiguity and Assumptions

The Johari Window Problem

AI conversations suffer from four knowledge quadrants (adapted from CAII/skribblez2718):

Quadrant	Description	Risk
Arena (Both know)	Shared understanding	Low — explicit
Hidden (User knows, AI doesn't)	Team conventions, prior decisions, unstated constraints	High — AI proceeds with wrong assumptions
Blind Spot (AI knows, user doesn't)	Security implications, performance trade-offs, alternative approaches	Medium — user makes uninformed decisions
Unknown (Neither knows)	Scaling behavior, edge cases, integration issues	High — discovered too late

Practical implication: For complex tasks (3+ files, architecture decisions), explicitly surface assumptions before implementation. The SAAE protocol (Share-Ask-Acknowledge-Explore) reduces "that's not what I meant" rework.

Specification Gap (Nate B. Jones, January 2026)

AI Tool Type	Strength	Weakness
Colleague-shaped (Claude Code)	Ambiguous tasks, creative solutions, exploratory work	Unpredictable, harder to evaluate
Tool-shaped (Codex, CI agents)	Well-specified tasks, deterministic output	Requires clear specifications

"Codex is better when you can define correctness. Claude Code is better when you can't."

Implication: Choose your AI tool based on how well you can specify the task, not just which tool is "better."

Instruction Processing

~150 Instruction Cap (Convergent Evidence)

The ~150 instruction cap is now independently validated by multiple high-authority sources:

Source	Authority	Basis
Boris Cherny (Claude Code creator)	5/5	Direct practitioner observation
Dexter Horthy (RPI/CRISPY creator)	4/5	Cites arXiv paper via Kyle's blog

This upgrades the claim from single-source expert guidance to convergent practitioner evidence — different practitioners, different data sources, same conclusion. The cap appears to be a genuine behavioral boundary, not an artifact of one person's workflow.

Recommendations:

Keep CLAUDE.md under 150 lines (60 lines optimal)
Use progressive disclosure — reference files instead of inlining content
Skills load ~2% of context budget each; budget accordingly
500-line cap on individual SKILL.md files
Split mega-prompts with 85+ instructions into phases with <40 instructions each (see Design Rule below)

Prompt Sensitivity Across Model Versions

Prompt sensitivity is not uniform across the Opus family — what works on one version can silently degrade on the next. Cross-version diagnostic table:

Version	Sensitivity pattern	Implication for existing prompts
Opus 4.5 / 4.6	More responsive to system prompts than earlier models; infers intent liberally from loose phrasing	Dial back "ALWAYS"/"NEVER" emphasis; watch for overtriggering on tool/skill invocation language
Opus 4.7 (April 16, 2026)	Literal interpretation — will not silently generalize instructions; fewer default subagents; adaptive verbosity	4.6-validated prompts with vague descriptors, edge-case gestures, or unanchored triggers may silently no-op
Opus 4.8 (May 28, 2026)	Literal interpretation carries forward; recovery on reliability — better tool triggering (fewer skipped required tool calls), better compaction/long-context recovery, more reliable effort calibration. Adaptive thinking is the only thinking mode (extended-thinking budgets now 400), default effort `high`	Most 4.7 remediations still apply. The one breaking change: any harness passing `thinking: {budget_tokens: N}` hard-fails — migrate to `thinking: {type: "adaptive"}` + `effort`. The 4.7 "skipped tool call" and "references not read" failure modes are softened but not eliminated — keep mechanical enforcement for 100% adherence

The Anthropic migration guide states explicitly:

"Claude Opus 4.7 interprets prompts more literally and explicitly than Claude Opus 4.6... It will not silently generalize an instruction from one item to another, and it will not infer requests you didn't make."

The 4.8 docs describe 4.8 as targeting "better tool triggering" ("less likely to skip a tool call the task required, an issue some users reported on Claude Opus 4.7") and "better compaction handling and long-context quality" ("long agentic traces stay on task with fewer derailments after compaction"). 4.8 is a calibration release on top of the 4.7 posture, not a reversal of it.

Selective literalism caveat (Willison, April 18, 2026): 4.7 is tuned to be less literal about clarifying-question behavior — the leaked system prompt instructs Claude to "make a reasonable attempt now, not... be interviewed first." Audits that treat literalism as uniform will over-correct.

Sycophancy on 4.8 — direction is contested, and the tiers point opposite ways. Launch-day user reports (Reddit / AI-newsletter anecdote) flagged sharper sycophancy on 4.8 (Tier C, anecdotal, no benchmark). Anthropic's own evals report the opposite direction: 4.8 is "an improvement over Opus 4.7 on most alignment measures," honesty in agentic settings "markedly improved," and in the third-party Petri 3.0 run it was "the best-aligned publicly accessible model by nearly all these metrics" (Tier A, Opus 4.8 system card; the launch news adds that misaligned behavior is "substantially lower than Opus 4.7"). The only "up" signal in Anthropic's own material is a qualitative pilot-feedback line noting "Mild sycophancy," which the card explicitly flags as not consistent with the quantitative trends. Net: do not assert a numeric sycophancy increase on 4.8. The honest statement is "launch-day user reports flagged sharper sycophancy (Tier C, anecdotal); Anthropic's own evals report the opposite direction (Tier A)."

New 4.8 behavioral watch-item (Tier A): the Opus 4.8 system card flags a "growing tendency toward speculation about graders / reasoning about how outputs will be assessed" as the most concerning trend during 4.8 training — with only modest behavioral effects at deployment. Relevant to rubric-scored evaluator-agent workflows (see Agent Evaluation and the self-evaluation failure mode below).

Community counter-signals (Tier C): HN 47793411 reports adaptive thinking under-triggering on reasoning-heavy tasks (workaround: xhigh effort) — 4.8's "more reliable effort calibration" claim may partially address this; re-test before relying on the workaround. HN 47814832 reports system-reminder over-application to every file read.

See Model Migration Anti-Patterns for the prompt anti-patterns, the 4.8 net-deltas table, the promoted soft-guideline-literalization anti-pattern, the 4.6→4.7 MRCR long-context regression case study, and the MUST-vs-positive-examples tension (Vertrees's MUST/MUST NOT framing conflicts with Anthropic's stated preference for positive examples).

Vertical Planning Principle (Horthy, Authority 4/5)

Models default to horizontal plans: all DB schema, then all services, then all API endpoints, then all frontend components. This produces 1200+ lines of untestable code before anything can be verified.

Vertical plans create testable checkpoints at each stage:

Mock API -> frontend (verify UI works with mocked data)
Real services -> API (verify backend works)
Database -> services (verify data layer)
Integration (verify everything together)

Harness implication: If your agent produces large untestable blocks, the issue may be plan orientation, not model capability. Instruct vertical slicing explicitly.

Source: Dexter Horthy (RPI/CRISPY creator), Authority 4/5.

Design Rule: Control Flow, Not Prompts

"Don't use prompts for control flow; use control flow for control flow."

Mega-prompts with 85+ instructions cause inconsistent adherence and require "magic words" to trigger specific behaviors. The failure mode: the agent follows some instructions reliably but ignores others unpredictably.

Fix: Split into discrete phases with <40 instructions each. Use actual control flow (hooks, scripts, staged prompts) to sequence the phases rather than hoping the model will self-sequence through a long instruction list.

This aligns with the ~150 instruction cap above — the cap isn't just about total count but about cognitive load per decision point.

Source: Dexter Horthy (RPI/CRISPY creator), Authority 4/5.

Monitor Tool (Anthropic, April 2026)

New built-in tool for background process observation. Uses interrupt-based notification instead of polling loops — the agent no longer wastes tokens repeatedly checking subprocess status.

Key behavioral note: The Monitor tool requires explicit prompting. Without instruction, the agent defaults to polling patterns (run command, check output, wait, check again). With instruction ("use the monitor tool to observe for errors"), it switches to an event-driven pattern that is both cheaper and more responsive.

Source: Anthropic, April 2026.

Thinking and Reasoning

Extended Thinking Trade-offs (Boris Cherny)

"I use [Opus with extended thinking] for everything. It's slower but because it's more reliable there's less course correcting."

Factor	With Extended Thinking	Without
Latency	2-3x higher	Standard
Quality	Higher	Good
Steering corrections needed	Fewer	More frequent
Total time to completion	Often lower (fewer retries)	Higher if steering needed

Rule of thumb: Extended thinking saves net time on tasks that would otherwise require 2+ steering corrections.

Writer/Reviewer Pattern (Boris Cherny, March 2026)

Split implementation and review into separate sessions:

Writer session: Implement the feature
Reviewer session: Review the implementation with fresh context

This exploits fresh context to catch issues the writer session became "blind" to after accumulating implementation context.

Multi-Agent Behavior

Subagent Context Isolation

Subagents have zero access to parent conversation history. Common mistakes:

Referencing "the code we discussed" in subagent prompts (it has no conversation history)
Expecting subagents to ask clarifying questions (they execute and return)
Assuming subagents know project conventions (include them in the prompt)

Custom Subagent Gatekeeping Anti-Pattern (Boris Cherny)

Custom subagents (.claude/agents/) can "gatekeep context" and force rigid human workflows onto the agent. Instead of defining many custom subagents:

Give the main agent context in CLAUDE.md
Let it use native Task/Explore features for delegation
Reserve custom agents for truly specialized roles (security review, domain-specific validation)

Agent Teams vs Subagents (v2.1.32+)

Need	Use Subagents	Use Agent Teams
Task duration	Minutes	Hours to days
Communication	Report-back only	Agents communicate directly
Cost	Lower	Higher
Stability	Production-ready	Experimental

Agent Capability Boundaries

Jaggedness Principle (Karpathy, Authority 4/5)

Agent capability is not uniformly distributed — it is domain-structured. Agents excel in verifiable domains and stagnate in non-verifiable ones:

Domain Type	Examples	Agent Performance	Why
Verifiable	Code, tests, structured data, SQL, math	Rapidly improving	RL can optimize against clear correctness signals
Non-verifiable	Design taste, writing style, judgment calls, UX decisions	Stagnating	No ground truth to train against

The unpredictability of agent performance is not random — it follows this verifiable/non-verifiable axis. This has direct harness design implications:

Route to agents: Tasks with verifiable outputs (write a function, fix a test, generate SQL, refactor code)
Keep for humans: Tasks requiring subjective judgment (API design, naming conventions, UX flow, architectural trade-offs)
Hybrid: Agent drafts, human evaluates on subjective dimensions

Source: Andrej Karpathy, No Priors podcast, March 2026. Authority 4/5.

Poor Self-Evaluation Failure Mode (Anthropic, Authority 5/5)

Anthropic disclosed a specific failure pattern in Claude's self-evaluation: the model identifies legitimate issues then rationalizes them away.

Claude "talked itself into deciding they weren't a big deal and approved the work anyway."

The failure mode is NOT "misses issues." The model sees the problems. The failure mode is "identifies then rationalizes" — a motivated reasoning pattern where the model talks itself out of its own correct assessment.

Mitigation: Independent evaluator agents with weighted rubrics. The evaluator must be context-isolated from the builder (fresh session, no shared conversation history) to prevent the same rationalization pattern. Structured rubrics with explicit scoring prevent narrative self-persuasion.

This aligns with Boris Cherny's Writer/Reviewer pattern: the review session catches what the writer session rationalized away, because it has fresh context and no sunk cost in the implementation.

Source: Anthropic engineering blog, Authority 5/5.

Vendor-Side Quality Regression Case Study: The April 2026 Postmortem

Anthropic published a postmortem (engineering blog, 2026-04-23) acknowledging that Claude Code, Claude Agent SDK, and Claude Cowork users were experiencing real intelligence degradation starting in early March 2026 — across Sonnet 4.6, Opus 4.6, and Opus 4.7. The API itself was unaffected; only the Claude Code surface and adjacent products. Three independent bugs were identified and all reverted by April 20 (v2.1.116).

Bug	Date introduced	What happened	What it teaches harness designers
Reasoning-effort default change	March 4	Default switched from `high` to `medium` to address UI freezing; sacrificed intelligence. Reverted April 7 after user complaints.	Effort-level defaults are a load-bearing harness setting, not a cosmetic one. Surface `${CLAUDE_EFFORT}` in your skills (v2.1.120+) so degradations like this are detectable from inside the harness, not just from output quality.
Caching bug with extended thinking blocks	March 26	A prompt-caching optimization continuously cleared extended thinking blocks from sessions idle over one hour, rather than clearing once. Claude effectively lost mid-session reasoning context across turns.	Caching layers can silently amputate context the harness assumed was retained. Any "trust the cache to hold reasoning" assumption is fragile against vendor-side cache behavior changes.
System prompt verbosity instruction	April 16	Instruction limiting text-between-tool-calls to ≤25 words and final responses to ≤100 words "hurt coding quality" when combined with other prompt changes.	Counter-evidence to a naive "less is more" reading of harness/prompt minimalism. Brevity constraints applied at the wrong layer (system prompt for code work) can degrade output even when the same brevity is harmless or helpful at the user-prompt layer.

Anthropic's own remediation list (announced in the postmortem): broader per-model evaluations for system-prompt changes, stricter code-review process using the improved Code Review tool, soak periods and gradual rollouts for intelligence-affecting changes, expanded repository context for code reviews.

What this changes about the rest of this doc: the practitioner-observed quality thresholds elsewhere in this document (60% context decline, ~80% CLAUDE.md adherence, ~150 instruction cap) were collected by users observing aggregate behavior — but vendor-side defaults sit upstream of all of those observations. A revalidation against a degraded default measures the degraded default, not the underlying behavior. Date-anchor practitioner-observed claims to a specific Claude Code version, and re-run them after major vendor-side changes.

Source: Anthropic Engineering — April 23 Postmortem (2026-04-23). Tier A. Confirmed via WebFetch 2026-05-24.

Principle-Teaching Reduces Agentic Misalignment (Anthropic Research, May 2026)

Anthropic published research on a training-data design choice: teaching models the principles behind ethical behavior — not just labeled examples of compliant vs. non-compliant outputs — substantially changes how models behave in agentic-misalignment scenarios.

Training data approach	Blackmail-scenario rate	Token efficiency
Synthetic honeypot dataset (labeled examples)	22%	Baseline
"Difficult advice" dataset (reasoning about why)	~3%	~28× more efficient

Implication for harness designers: As alignment training pushes the model toward internalizing why certain actions are problematic — rather than pattern-matching labeled don'ts — heavy MUST NOT do X scaffolding in CLAUDE.md becomes less load-bearing per unit of safety. The training-time effect is the more reliable lever; the CLAUDE.md prohibition is a backstop, not a substitute.

Caveats:

This is a single Anthropic publication; "MUST NOT" rules in CLAUDE.md should not be removed on the basis of this research alone.
The 22% → 3% figure is for a specific blackmail-scenario evaluation, not a general agentic-safety metric. Don't extrapolate to "principle-teaching reduces all misalignment 7×."
The "28× token efficiency" is a training-data efficiency claim, not an inference-time efficiency claim. It does not mean prompts get shorter.

Source: Anthropic Research, "Teaching Claude why", 2026-05-08. Authority 5/5 for the alignment-training claim; Authority 4/5 when extrapolating to harness-design implications.

Auto Mode Behavior (v2.1.84+)

Two-Stage Classifier

Auto mode uses a Sonnet 4.6 classifier to pre-approve or pre-deny tool calls:

93% approval rate in production
Classifier runs before the main model sees the tool call
Non-interactive mode: aborts (doesn't skip) when approval would be needed

Implication: Auto mode is viable for most workflows. The 7% denial rate covers genuinely risky operations (file deletion, force push, etc.).

Gaps

Several widely-cited thresholds in this doc are load-bearing but carry single-source confidence. Explicit gap statements:

Gap: 60% context quality threshold. ⚠️ REVALIDATED 2026-05-30 — reclassified. Boris Cherny reports proactive intervention at 60% context; this is a practitioner intervention heuristic (Tier C), not a measured degradation onset, and the originally-cited source page now 403s. Independent benchmarks put the onset far earlier and model-specific: arXiv:2601.15300 (Qwen2.5-7B degrades at 40–50% of max context, F1 0.55→0.30), Fiction.liveBench (deep-comprehension slide "closer to 32k"), NoLiMa/ICML 2025 (most models <½ short-input score by 32k). Restated claim: onset begins far below the advertised window (~16–64k tokens, ≈20–50% on 1M-context models); 60% is the "intervene now" trigger, not the degradation point. Still needs: per-model correlation study between context fill and a measurable output-quality metric on current models (4.8). Do not re-run on a 4.7 baseline and call it current.
Gap: ~80% CLAUDE.md adherence. Cited ubiquitously in this repo. Source is Boris Cherny's direct observation; no public methodology for the 80% figure. Needs: controlled study running the same CLAUDE.md across N sessions, measuring instruction-follow rate per instruction type.
Gap: ~150 instruction cap. Convergent evidence (Cherny + Horthy) upgrades confidence, but both sources reach it by observation, not measurement. Needs: ablation study varying CLAUDE.md instruction count and measuring adherence.
Gap: Opus 4.7 literalism rate. Anthropic states 4.7 "will not silently generalize," but does not quantify how often 4.6 was generalizing. Needs: side-by-side prompt-running on 4.6 vs 4.7 against a corpus of vague prompts; measure the silent-no-op rate delta.

These gaps do not invalidate the claims — they scope them. Practitioner-observed thresholds are still the best available evidence for these behaviors.

Quantified Claims Summary

Claim	Source	Confidence
CLAUDE.md followed ~80% of the time	Boris Cherny (March 2026)	High (Tier A practitioner)
Auto-compaction at ~83.5% context	Boris Cherny (March 2026)	High
60% context = intervention heuristic (not measured degradation onset)	Boris Cherny (March 2026); benchmarks put onset earlier & model-specific	Medium (reclassified 2026-05-30 — see Gaps)
~150 instruction cap for CLAUDE.md	Boris Cherny + Dexter Horthy (convergent)	High (upgraded: convergent evidence)
Auto mode 93% approval rate	Anthropic blog (March 2026)	High (Tier A)
Extended thinking = 2-3x latency	Boris Cherny (March 2026)	High
Skills use ~2% context budget each	Anthropic docs	High (Tier A)
Jaggedness: verifiable domains improve, non-verifiable stagnate	Karpathy (March 2026)	Medium-High (Authority 4/5, conceptual framework)
Self-evaluation: identifies then rationalizes issues away	Anthropic engineering blog	High (Tier A, vendor self-disclosure)
Monitor tool requires explicit prompting for interrupt-based mode	Anthropic (April 2026)	High (Tier A)
Mega-prompts with 85+ instructions cause inconsistent adherence	Horthy (CRISPY creator)	Medium-High (Authority 4/5)
Opus 4.7 interprets instructions literally; no silent generalization	Anthropic migration guide (April 2026)	High (Tier A, vendor docs)
Opus 4.7 literalism is selective — less literal on clarifying-question behavior	Willison (April 18, 2026)	Medium (Tier B, leaked system prompt analysis)
Opus 4.8 (May 28, 2026): better tool triggering, better compaction/long-context recovery, adaptive-only thinking (budgets 400), default effort `high`	Anthropic 4.8 docs	High (Tier A, vendor docs)
Opus 4.8 alignment improved over 4.7; sycophancy up claim is Tier-C anecdote, contradicted by Tier-A evals	Opus 4.8 system card + launch news (Tier A) vs launch-day user reports (Tier C)	A-vs-C conflict — favor Tier A (no numeric increase asserted)
Opus 4.8: "speculation about graders" flagged as most concerning training trend (modest behavioral effect)	Opus 4.8 system card	High (Tier A, vendor self-disclosure)

Sources

Boris Cherny interviews: Lenny's Podcast, Pragmatic Engineer, Threads mega-posts (March 2026)
Dexter Horthy (RPI/CRISPY creator): Vertical planning, control flow design rule, ~150 instruction convergent validation (via arXiv paper/Kyle's blog)
Andrej Karpathy: No Priors podcast (March 2026) — Jaggedness principle, verifiable vs non-verifiable domain axis
Nate B. Jones: Specification Gap, Agent Build Bible
CAII (skribblez2718): Johari Window methodology
RLM paper (Zhang, Kraska, Khattab): Context rot research
Anthropic Engineering Blog: Auto mode, agent skills (March 2026), self-evaluation failure mode, Monitor tool (April 2026)
Anthropic Engineering Blog: "April 23 Postmortem" (2026-04-23) — Three bugs cumulatively degraded Claude Code intelligence March 4 – April 20: reasoning-effort default high→medium, caching bug clearing extended thinking blocks every check, system prompt verbosity cap hurting code quality. All reverted by v2.1.116. Affects Sonnet 4.6, Opus 4.6, Opus 4.7 on Claude Code / Agent SDK / Cowork; API unaffected. Authority 5/5 vendor self-disclosure. Tier A.
Anthropic Research: "Teaching Claude why" (May 8, 2026) — Principle-teaching reduces blackmail-scenario rate 22% → 3% at 28× token efficiency vs. honeypot datasets. Authority 5/5 for the training-data claim. Tier A.
Anthropic Migration Guide (April 2026): Opus 4.7 literal interpretation, fewer default subagents, adaptive verbosity
Anthropic, "What's New Claude 4.8" (Tier A, fetched 2026-05-30): Opus 4.8 (released 2026-05-28, model ID claude-opus-4-8) behavioral deltas vs 4.7 — better tool triggering, better compaction/long-context recovery, more reliable effort calibration; adaptive thinking is the only thinking mode (extended-thinking budgets return 400); default effort high.
Anthropic, Opus 4.8 system card (Tier A): improvement over 4.7 on most alignment measures, honesty in agentic settings markedly improved, Petri 3.0 "best-aligned publicly accessible model"; qualitative pilot "Mild sycophancy" note flagged as inconsistent with quantitative trends; "speculation about graders" flagged as most concerning training trend (modest behavioral effect).
Anthropic, Claude Opus 4.8 launch news (Tier A): misaligned behavior substantially lower than 4.7.
Launch-day user reports of sharper 4.8 sycophancy (Tier C, anecdotal, no benchmark) — contradicted by Anthropic's own evals; no numeric increase asserted here.
Long-context degradation onset benchmarks (re-validating the "60%" heuristic): arXiv:2601.15300 (Qwen2.5-7B 40–50% onset), Fiction.liveBench (deep-comprehension ~32k), NoLiMa/ICML 2025 (<½ short-input score by 32k), arXiv:2510.05381.
Simon Willison (April 18, 2026): Opus 4.7 system-prompt analysis — selective literalism
Hacker News 47793411, 47814832: Community observation of 4.7 thinking-calibration and system-reminder over-application

Related Analysis

Harness Engineering — The ~80% CLAUDE.md adherence rate and 60% context threshold are primary motivators for harness enforcement design
Domain Knowledge Architecture — Context budget framework and progressive disclosure patterns build directly on the thresholds documented here
Agent-Driven Development — Commit burst patterns and ~80% adherence rate motivating hook-based security enforcement in practice
Model Migration Anti-Patterns — Six prompt anti-patterns that break on Opus 4.7 (vague descriptors, edge-case gestures, unanchored triggers, implicit subagent dispatch, missing verbosity directives, references without read-enforcement)

Merged from: johari-window-ambiguity.md, recursive-context-management.md Last updated: May 2026 (Opus 4.8 deltas + sycophancy nuance + 60%-threshold revalidation). Prior: April 2026.

Related (from graph)

AUDIT-CONTEXT.md [EXTRACTED (1.00)] — references
INDEX.md [EXTRACTED (1.00)] — references

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Behavioral Insights: How Claude Code Actually Works

Purpose

Context Window Behavior

Capacity Thresholds (Boris Cherny, March 2026)

Context Rot (RLM Research, Zhang et al.)

CLAUDE.md Adherence (~80%)

Ambiguity and Assumptions

The Johari Window Problem

Specification Gap (Nate B. Jones, January 2026)

Instruction Processing

~150 Instruction Cap (Convergent Evidence)

Prompt Sensitivity Across Model Versions

Vertical Planning Principle (Horthy, Authority 4/5)

Design Rule: Control Flow, Not Prompts

Monitor Tool (Anthropic, April 2026)

Thinking and Reasoning

Extended Thinking Trade-offs (Boris Cherny)

Writer/Reviewer Pattern (Boris Cherny, March 2026)

Multi-Agent Behavior

Subagent Context Isolation

Custom Subagent Gatekeeping Anti-Pattern (Boris Cherny)

Agent Teams vs Subagents (v2.1.32+)

Agent Capability Boundaries

Jaggedness Principle (Karpathy, Authority 4/5)

Poor Self-Evaluation Failure Mode (Anthropic, Authority 5/5)

Vendor-Side Quality Regression Case Study: The April 2026 Postmortem

Principle-Teaching Reduces Agentic Misalignment (Anthropic Research, May 2026)

Auto Mode Behavior (v2.1.84+)

Two-Stage Classifier

Gaps

Quantified Claims Summary

Sources

Related Analysis

Related (from graph)

FilesExpand file tree

behavioral-insights.md

Latest commit

History

behavioral-insights.md

File metadata and controls

Behavioral Insights: How Claude Code Actually Works

Purpose

Context Window Behavior

Capacity Thresholds (Boris Cherny, March 2026)

Context Rot (RLM Research, Zhang et al.)

CLAUDE.md Adherence (~80%)

Ambiguity and Assumptions

The Johari Window Problem

Specification Gap (Nate B. Jones, January 2026)

Instruction Processing

~150 Instruction Cap (Convergent Evidence)

Prompt Sensitivity Across Model Versions

Vertical Planning Principle (Horthy, Authority 4/5)

Design Rule: Control Flow, Not Prompts

Monitor Tool (Anthropic, April 2026)

Thinking and Reasoning

Extended Thinking Trade-offs (Boris Cherny)

Writer/Reviewer Pattern (Boris Cherny, March 2026)

Multi-Agent Behavior

Subagent Context Isolation

Custom Subagent Gatekeeping Anti-Pattern (Boris Cherny)

Agent Teams vs Subagents (v2.1.32+)

Agent Capability Boundaries

Jaggedness Principle (Karpathy, Authority 4/5)

Poor Self-Evaluation Failure Mode (Anthropic, Authority 5/5)

Vendor-Side Quality Regression Case Study: The April 2026 Postmortem

Principle-Teaching Reduces Agentic Misalignment (Anthropic Research, May 2026)

Auto Mode Behavior (v2.1.84+)

Two-Stage Classifier

Gaps

Quantified Claims Summary

Sources

Related Analysis

Related (from graph)