feat: input-side token-saving upgrades + eval quality gate#579
feat: input-side token-saving upgrades + eval quality gate#579edubraqd wants to merge 14 commits into
Conversation
…ed reinforcement, one-shot nudge) In agentic Claude Code the weekly limit is dominated by INPUT (context replayed every turn), not output. These Tier-1 changes trim the input caveman itself adds and the structural output that compounds into input: - skills/caveman/SKILL.md: new `## Agentic Loop` section — result-first, no recap of files just read, plan once then deltas, one intent clause per tool batch, no redundant confirmations. Cross-references Auto-Clarity so security / irreversible / multi-step work stays full prose. - caveman-mode-tracker.js: emit the per-turn reinforcement every 3rd turn (1,4,7, ...) plus any turn that (re)activates the mode, instead of every turn. The counter lives in a separate flag file so the injected string stays byte-stable and keeps hitting the prompt cache. caveman-activate.js resets it each session. - caveman-activate.js: make the statusline-setup nudge one-shot via a .caveman-nudged marker, instead of re-appending ~90 tokens to the cached SessionStart prefix every session for users who never wire up the statusline. - caveman-help SKILL.md: document /caveman-compress on memory files as the headline input-token lever, plus statusline setup. - tests/test_hook_output.js: cover cadence, byte-stable cache-safe payloads, and the one-shot nudge. Update checksums.sha256. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
measure.py only counts tokens, which is gameable (a skill replying `k` scores -99% and "wins"). Add the missing correctness axis so a compression change is accepted only if it saves tokens AND stays correct. - evals/prompts/rubrics.json: per-prompt binary fact checklist, risk class, and verbatim invariants (tokens that must survive compression, e.g. TCP/EXPLAIN/ rebase) for every prompt in en.txt. - evals/judge.py: scores each (arm, prompt) answer — LLM judge (temp 0, optional --runs majority) for facts + deterministic verbatim check — into fidelity.json. Fails closed on unparseable judge output. - evals/gate.py: pure/offline two-gate decision. TOKEN gate (savings don't regress past a noise floor) AND QUALITY gate (mean fidelity drop <= tol, no prompt past its risk band's hard limit — normal 10pt, high 0pt — and every verbatim invariant holds). High-risk prompts (e.g. git-rebase history warning) allow zero fidelity loss. - tests/test_eval_gate.py: offline unit tests of the gate logic (no LLM/tiktoken), incl. the canonical "reply k to everything is REJECTED". - .github/workflows/eval-gate.yml: runs the offline gate self-test + token measurement on PRs touching SKILL.md / rules / evals. - evals/README.md: document the fidelity axis, the two-gate rule, and calibration. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…(Tier 2) In agentic Claude Code the weekly token limit is dominated by INPUT — every turn re-sends every prior tool result. caveman's output compression never touches those bytes. This adds a PostToolUse hook that, when enabled, replaces oversized built-in tool results (Read/Bash/Grep/Glob) with a trimmed version via the public `updatedToolOutput` field, before they ever cost context. - src/hooks/caveman-trim-tool-result.js: pure, deterministic transform — strips ANSI/CR/whitespace noise (lossless), then for still-huge text keeps head+tail with a re-run marker. Never touches JSON-ish results. Fails open (emits nothing on any error, so the original result passes through). Determinism keeps the re-sent result byte-stable so the prompt cache stays warm. - OFF by default: the hook exits immediately unless CAVEMAN_TRIM_TOOL_RESULTS=1. Threshold tunable via CAVEMAN_TRIM_THRESHOLD (default 8000 chars). - bin/install.js --with-trim: wires a PostToolUse(Read|Bash|Grep|Glob) entry in settings.json, independent of the plugin (which registers no PostToolUse), so it never double-fires. Not wired by the plugin manifest — avoids a per-tool hook spawn for users who don't opt in. - bin/lib/settings.js: addCommandHook now supports an optional `matcher`; caveman-trim-tool-result.js added to MANAGED_HOOK_BASENAMES. - tests/test_trim_tool_result.js: 15 tests — determinism, lossless stripping, head/tail elision, JSON passthrough, token preservation, fail-open, env gate. - src/hooks/README.md: document opt-in wiring + runtime enable. checksums updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nt output
Locate-shaped work ("where is X", "what calls Y", "list uses", "map dir") is
where delegation pays off most: the verbose Grep/Read runs in the subagent and
never enters main context — only a ~60%-smaller path:line map returns. Nudge the
main thread toward it and bound the result so a delegation can't itself blow up.
- caveman-mode-tracker.js: on investigation-shaped prompts (tightly-scoped verb
regex), append an advisory nudge to prefer cavecrew-investigator over inline
Grep/Read. Assembled alongside the cadence-gated reinforcement as byte-stable
segments (no per-turn-varying token -> cache-safe). Advisory only — the model
still skips it for a one-line lookup.
- agents/cavecrew-investigator.md / cavecrew-reviewer.md: cap output at 25 rows/
findings with a "+N more" line, so a huge match set returns bounded and the
main thread knows it was capped.
- skills/cavecrew/SKILL.md: document the locate default + the row cap.
- tests/test_hook_output.js: cover the nudge (fires on locate prompts incl.
off-cadence, silent otherwise). checksums updated.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- evals/snapshots/fidelity.json: baseline correctness snapshot for the caveman arm — all 10 prompts at 100% fidelity, all verbatim invariants held (incl. the high-risk git-rebase "rewrites history" warning). Judged inline by Opus because judge.py's `claude -p` path returns 401 in this environment; regenerate with evals/judge.py once that auth is available. Pairs with the committed results.json. - evals/gate.py: replace the check/cross/em-dash glyphs in the report with ASCII so it doesn't crash on a Windows cp1252 console (UnicodeEncodeError). Verified end-to-end: gate baseline-vs-self ACCEPTs (token savings ~50%, 0 fidelity drop). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…(Tier 4) - skills/caveman/SKILL.md: strengthen the `ultra` level to collapse multi-sentence explanations into a single causal chain and cut transition sentences (not just abbreviate words), keeping the verbatim code-symbol/error-string exemption. - caveman-stats.js: /caveman-stats now reports the input vs output token split — caveman compresses output, but in agentic use the weekly limit is dominated by input (the whole context is re-sent every turn). Reframes the win and points at the input-side levers (/caveman-compress, the opt-in trim hook). Also pin number formatting to en-US so output is deterministic regardless of system locale (a pt-BR locale rendered "1.000" and broke the existing savings test). - tests/test_caveman_stats.js: cover the split (shown with input data, omitted without). checksums regenerated from the committed (LF) blobs. Deferred with reasons: aggressive auto-intensity (trivial->ultra) needs the Tier-3 gate to validate empirically first; caveman-shrink tools/call result compression is low-impact (MCP-proxy users only) and risks corrupting structured data — the Tier-2 trim hook already covers built-in tool results. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the inline-Opus baseline with a real judge.py run (claude-haiku-4-5, temp 0): the caveman arm scores 100% fidelity on all 10 prompts, every verbatim invariant held (incl. the high-risk git-rebase "rewrites history" warning). Per-fact verdicts now come from the judge model rather than inline judgment — same result as the inline pass, confirming it. Gate baseline-vs-self ACCEPTs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…uirement Two Windows-user bugs surfaced while regenerating fidelity.json: - run_claude hardcoded ["claude", ...]; on Windows the CLI is a .cmd shim that CreateProcess can't launch without a shell, so judge.py died before any judge call. Resolve via shutil.which (prefers claude.exe via PATHEXT) and route a .cmd/.bat shim through `cmd /c` — verified it resolves the local npm shim. - the docs said `uv run python evals/judge.py`, but judge.py is pure stdlib and the repo ships no pyproject.toml / uv. Use plain `python`; note tiktoken is needed only for gate.py/measure.py. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
caveman-shrink only compressed tools/list descriptions before. Add an opt-in (CAVEMAN_SHRINK_RESULTS=1) pass over tools/call result content[].text: a LOSSLESS strip of ANSI/terminal noise + whitespace only — never the prose compress() (that would corrupt data the model acts on). JSON-ish text and non-text content (images/resources) are skipped; deterministic so the re-sent result stays cache-stable. Tests cover the strip, the no-prose-compress guarantee, JSON passthrough, and determinism. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
/caveman auto lets caveman pick the intensity per answer instead of a fixed level: trivial fact/yes-no -> ultra fragments, routine explain/fix -> full, design/security/irreversible/multi-step -> full prose (Auto-Clarity wins). Opt-in, so it never surprises a user who set a fixed level — no behaviour change by default, hence no eval-gate dependency. - caveman-config.js: add `auto` to VALID_MODES (so /caveman auto and the flag round-trip; the mode-tracker slash handler already routes any valid level). - skills/caveman/SKILL.md: `auto` intensity row + frontmatter; caveman-help: row. - caveman-statusline.sh/.ps1: whitelist `auto` so the badge renders [CAVEMAN:AUTO]. - tests/test_hook_output.js: /caveman auto writes the flag. checksums updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ot just strings
The Bash tool_response is an object ({stdout, stderr, ...}); the string-only
guard made the trim hook a no-op for Bash. extractText() now pulls stdout (or
output/content/text/result) from an object, trims it, re-appends a short stderr
tail, and returns a string updatedToolOutput. Objects with no text field and
JSON-ish text still pass through.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… Code
Verified empirically on a live session (logging wrapper on the PostToolUse hook):
1. Claude Code does NOT apply a string updatedToolOutput when tool_response is a
structured object — and every built-in tool returns one (Bash {stdout,...},
Read {type,file}, Grep {mode,content,...}, Glob likewise). The hook fires and
emits a valid trimmed updatedToolOutput, but CC shows the original unchanged.
So it's a no-op for all four matched tools, not just Bash.
2. CC already persists oversized results natively (~30KB -> disk + ~2KB preview),
more aggressively than this hook would.
The hook code is correct + unit-tested; the limitation is in the PostToolUse API
surface. Left OFF by default. README note added so users/maintainers know its
real-world effect is ~0 until a future CC honors updatedToolOutput for objects
(which would require returning the object shape with a trimmed text field).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Heads-up on item 3 (the opt-in PostToolUse trim hook)Verified empirically on a live Claude Code session (logging wrapper on the hook): the trim hook currently has ~no real-world effect, for two reasons:
The hook code is correct and unit-tested; the limitation is in the PostToolUse API surface, not the implementation. It ships OFF by default, so it harms nothing — but it won't save tokens on current Claude Code. A doc note was added ( The other items in this PR (agentic-loop discipline, gated reinforcement, one-shot nudge, eval quality gate, cavecrew locate routing, structural ultra, /caveman-stats input/output split) are unaffected. |
…urce) PostToolUse can't trim built-in results (CC honors output replacement for MCP tools only); PreToolUse updatedInput IS applied to the real call (verified live). caveman-bound-tool-input.js caps an unbounded Read's `limit` (maxResultSizeChars is Infinity = no native protection) and an oversized Grep head_limit, at the source. Opt-in (--with-bound + CAVEMAN_BOUND_TOOL_INPUT=1). 13 tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…lever)
The weekly token limit is dominated by cache_read: the entire conversation
context is re-sent on EVERY turn, so a long session that never resets is the
true burn (a 17k-turn session re-sending a ~500K context = billions of
cache_read tokens, far more than any output the caveman style saves).
mode-tracker now reads the transcript tail each turn, estimates the live
context size (input + cache_creation + cache_read of the last turn), and writes
a humanized value to .caveman-ctx. The statusline renders it color-coded
(green <180K / yellow <320K / red), so the user SEES the session ballooning.
When it crosses the soft/hard threshold the hook periodically nudges the model
to suggest /clear or a fresh session — rate-limited so it never spams.
- caveman-mode-tracker.js: readContextSize() (256KB tail read, fail-open),
humanizeTok(), .caveman-ctx write via safeWriteFlag, graduated guard segments
- caveman-statusline.{sh,ps1}: render `ctx 200K` color-coded; same symlink-
refuse + whitelist hardening as the flag/savings files
- 4 new tests (ctx write, no-transcript silence, hard guard at turn 20,
below-threshold silence); 13/13 pass
- checksums.sha256 regenerated for the 3 changed hook files
- src/hooks/README.md: document the meter + guard
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Why
In agentic Claude Code the weekly token limit is dominated by input — the whole context (system prompt, tool schemas, prior prose, every tool result) is re-sent every turn; caveman's output compression is the smallest slice. These changes target the input side and add a quality gate so compression never silently trades away correctness. A token saved on the re-sent prefix is re-billed every later turn (compounds); a token saved on output is billed once.
What
1. Agentic-loop output discipline + gated reinforcement + one-shot nudge (
24c8e03)skills/caveman/SKILL.md: new## Agentic Loopsection — result-first, no recap of files just read, plan once then deltas, one intent clause per tool batch. Cross-references Auto-Clarity so security/irreversible/multi-step stay full prose.caveman-mode-tracker.js: per-turn reinforcement now fires every 3rd turn (+ on mode (re)activation), not every turn — the counter lives in a separate flag file so the injected string stays byte-stable and keeps hitting the prompt cache.caveman-activate.js: statusline-setup nudge is one-shot (a.caveman-nudgedmarker) instead of re-appending ~90 tokens to the cached SessionStart prefix every session.2. Eval quality gate — fidelity axis + two-gate accept rule (
5681437,5c7e19a)measure.pyonly counts tokens, which is gameable (a skill replyingkscores -99% and "wins"). Addsevals/judge.py(LLM judge for rubric facts + deterministic verbatim check ->fidelity.json) andevals/gate.py(accept only if tokens hold AND fidelity holds; high-risk prompts like the git-rebase "rewrites history" warning allow zero fidelity loss).evals/prompts/rubrics.jsonfor all 10 prompts; a committedfidelity.jsonbaseline;tests/test_eval_gate.py(incl. the canonical "reply k -> REJECT");.github/workflows/eval-gate.yml.3. Opt-in PostToolUse tool-result trim (
de7da10)caveman-trim-tool-result.js: when enabled, replaces oversized Read/Bash/Grep/Glob results (via the publicupdatedToolOutput) with a deterministic, lossless-first trimmed version before they enter context. OFF by default (CAVEMAN_TRIM_TOOL_RESULTS=1); wired via--with-trim, independent of the plugin (no double-fire).4. cavecrew as the default locate route (
e2e7076)cavecrew-investigator(its verbose Grep/Read stays out of main context); cap investigator/reviewer output at 25 rows.5. Structural ultra + honest input/output split in /caveman-stats (
b01efd3)skills/caveman/SKILL.md: theultralevel now collapses multi-sentence explanation into one causal chain and cuts transition sentences (not just word abbreviation), keeping the verbatim code-symbol/error-string exemption.caveman-stats.js:/caveman-statsreports the input-vs-output token split — output is the demo, input is where the weekly budget goes — and points at the input-side levers. Number formatting pinned toen-USso output is locale-deterministic.Testing
tests/test_hook_output.js(8),tests/test_trim_tool_result.js(15),tests/test_eval_gate.py(7),tests/test_caveman_stats.js(31) all pass.Notes
safeWriteFlagtemp-leak fix is a separate PR (fix(hooks): prevent safeWriteFlag temp-file leak on Windows rename co… #578) — this branch deliberately leavescaveman-config.jsuntouched.fidelity.jsonwas judged inline (an authenticatedclaude -pwas unavailable in the author's environment); regenerate withevals/judge.pyonce CLI auth is set up.🤖 Generated with Claude Code