Hi — we ran an A/B battery measuring codedb's effect on agent-level metrics (token usage + wall time of an LLM coding agent over representative tasks), separate from the tool-level ripgrep benchmarks you publish.
TL;DR: Tool-level speedup did not obviously translate to agent-level token reduction at our sample size. Likely a measurement problem on our end — would value your take on methodology.
Setup:
- 5-task battery across Python, Dart, and TypeScript code, plus cross-module navigation
- Baseline (grep + file-read tools only) vs treatment (codedb MCP registered, indexes pre-warmed)
- Fresh agent session per run (no shared context/cache); N = 1–2 per cell
- Ran the whole battery twice to check reproducibility (32 runs total)
Round 1 vs Round 2, Δ% tokens (baseline → treatment):
| Task type |
R1 Δ% |
R2 Δ% |
| Python symbol caller lookup |
−8.7% |
+10.3% |
| Dart widget structural trace |
−25.9% |
+0.1% |
| Cross-module flow trace |
−29.6% |
+21.4% |
| TS config-value audit |
+16.0% |
−18.5% |
| Python async race-condition hunt |
−0.6% |
−24.2% |
Every task produced contradictory signals across rounds. Baseline tokens alone varied 2× run-to-run on identical prompts. Agent exploration stochasticity appears to dominate codedb's effect at our N.
Open questions:
- Recommended methodology for measuring agent-level impact? Any reference design you'd point to?
- Does codedb's benefit compound over long-running multi-turn agent loops (vs single-shot sessions)? That seems plausible but we haven't tested it.
- Any tell-tale signs in agent behavior that indicate codedb is being used well vs underutilized?
What we'd change next time:
- Deterministic-answer prompts (to remove agent exploration variance)
- N ≥ 10 per cell
- Test inside a long-running multi-turn agent loop
Happy to share more of the methodology if useful for future codedb benchmarking docs. Thanks for the tool — install was clean via the release binary and the MCP surface is well-designed.
Hi — we ran an A/B battery measuring codedb's effect on agent-level metrics (token usage + wall time of an LLM coding agent over representative tasks), separate from the tool-level ripgrep benchmarks you publish.
TL;DR: Tool-level speedup did not obviously translate to agent-level token reduction at our sample size. Likely a measurement problem on our end — would value your take on methodology.
Setup:
Round 1 vs Round 2, Δ% tokens (baseline → treatment):
Every task produced contradictory signals across rounds. Baseline tokens alone varied 2× run-to-run on identical prompts. Agent exploration stochasticity appears to dominate codedb's effect at our N.
Open questions:
What we'd change next time:
Happy to share more of the methodology if useful for future codedb benchmarking docs. Thanks for the tool — install was clean via the release binary and the MCP surface is well-designed.