Independent measurement across 5 models: output reduction is real (~-31%) but total cost never dropped; the 65-75% claim does not reproduce #520
Replies: 1 comment
-
|
This matches what I see on my own real Claude Code usage rather than a benchmark, and the dollar/token gap you flag is even starker on agentic coding. Over ~1,300 prompts across 26 days: output is about 0.6% of my total tokens (cache reads alone are ~97%), and about 17% of the dollar bill. Context (the cache being loaded and then re-read every turn) is ~82% of the bill, rent alone ~55%. caveman leaves code verbatim, so what it trims is the prose half of output. On my data prose is ~86% of generation, which is ~14% of the whole bill. So even the ~31% prose reduction you measured nets out to roughly 4% of the bill, before the ~1k-token instruction riding along on every request claws some of it back, exactly the input-domination effect you saw on coding tasks. So your conclusion holds up outside the benchmark too: real output-token savings, model-dependent, but no meaningful total-cost drop on coding work. The thing almost nobody does is check it on their own logs. I open-sourced a tool that prices every prompt straight from the local ~/.claude files and can diff a before/after around the day you install something, so you measure the actual delta instead of trusting a headline number: https://github.com/romainfjgaspard/prompt-analytics-for-claude-code |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Measured the skill with deterministic gates on two self-hosted models (Qwen3.6-35B, Mistral-Small-4, via opencode headless, skill injected verbatim as project rules, injection verified with a canary instruction) and three Claude models via API (Sonnet 4.6, Opus 4.8, Fable 5). N=3 per cell, chat answers scored against frozen fact checklists, coding tasks gated by typecheck/rename verification.
What reproduces: output-token reduction on chat-style answers, consistently around -31% on four of five models (best case -33% on Opus). Technical accuracy held, all checklists passed in both arms.
What does not reproduce: the 65-75% token-reduction claim, on any of the five models. Two reasons fall out of the data. First, the instruction itself rides along as ~1k input tokens on every request, and on coding tasks input dominates, so total tokens often went up (Qwen ts-rename: 89k baseline vs 111k with the skill). Second, measured in dollars on the Claude models, the caveman arm was never cheaper (e.g. Opus $0.554 vs $0.555; Fable 5 outputs got 18% longer and cost more). One model going the wrong direction entirely suggests the effect is also model-dependent.
Suggestion: qualify the claim toward "up to ~33% shorter chat outputs, model-dependent, with no total-cost saving measured on agentic coding workloads", or scope it to the workloads where it holds. Happy to share raw runs: harness https://github.com/cipherfoxie/agent-bench, full writeup https://sovgrid.org/blog/caveman-local-benchmark/
Beta Was this translation helpful? Give feedback.
All reactions