Request for guidance — measuring agent-level (not tool-level) impact

Hi — we ran an A/B battery measuring codedb's effect on **agent-level** metrics (token usage + wall time of an LLM coding agent over representative tasks), separate from the tool-level ripgrep benchmarks you publish.

**TL;DR:** Tool-level speedup did not obviously translate to agent-level token reduction at our sample size. Likely a measurement problem on our end — would value your take on methodology.

**Setup:**

- 5-task battery across Python, Dart, and TypeScript code, plus cross-module navigation
- Baseline (grep + file-read tools only) vs treatment (codedb MCP registered, indexes pre-warmed)
- Fresh agent session per run (no shared context/cache); N = 1–2 per cell
- Ran the whole battery twice to check reproducibility (32 runs total)

**Round 1 vs Round 2, Δ% tokens (baseline → treatment):**

| Task type | R1 Δ% | R2 Δ% |
|---|---:|---:|
| Python symbol caller lookup | −8.7% | +10.3% |
| Dart widget structural trace | −25.9% | +0.1% |
| Cross-module flow trace | −29.6% | +21.4% |
| TS config-value audit | +16.0% | −18.5% |
| Python async race-condition hunt | −0.6% | −24.2% |

Every task produced contradictory signals across rounds. Baseline tokens alone varied 2× run-to-run on identical prompts. Agent exploration stochasticity appears to dominate codedb's effect at our N.

**Open questions:**

1. **Recommended methodology** for measuring agent-level impact? Any reference design you'd point to?
2. Does codedb's benefit compound over long-running multi-turn agent loops (vs single-shot sessions)? That seems plausible but we haven't tested it.
3. Any tell-tale signs in agent behavior that indicate codedb is being used well vs underutilized?

**What we'd change next time:**

- Deterministic-answer prompts (to remove agent exploration variance)
- N ≥ 10 per cell
- Test inside a long-running multi-turn agent loop

Happy to share more of the methodology if useful for future codedb benchmarking docs. Thanks for the tool — install was clean via the release binary and the MCP surface is well-designed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for guidance — measuring agent-level (not tool-level) impact #302

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Task type	R1 Δ%	R2 Δ%
Python symbol caller lookup	−8.7%	+10.3%
Dart widget structural trace	−25.9%	+0.1%
Cross-module flow trace	−29.6%	+21.4%
TS config-value audit	+16.0%	−18.5%
Python async race-condition hunt	−0.6%	−24.2%

Request for guidance — measuring agent-level (not tool-level) impact #302

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions