Skip to content

Request for guidance — measuring agent-level (not tool-level) impact #302

@GrowtricsAI

Description

@GrowtricsAI

Hi — we ran an A/B battery measuring codedb's effect on agent-level metrics (token usage + wall time of an LLM coding agent over representative tasks), separate from the tool-level ripgrep benchmarks you publish.

TL;DR: Tool-level speedup did not obviously translate to agent-level token reduction at our sample size. Likely a measurement problem on our end — would value your take on methodology.

Setup:

  • 5-task battery across Python, Dart, and TypeScript code, plus cross-module navigation
  • Baseline (grep + file-read tools only) vs treatment (codedb MCP registered, indexes pre-warmed)
  • Fresh agent session per run (no shared context/cache); N = 1–2 per cell
  • Ran the whole battery twice to check reproducibility (32 runs total)

Round 1 vs Round 2, Δ% tokens (baseline → treatment):

Task type R1 Δ% R2 Δ%
Python symbol caller lookup −8.7% +10.3%
Dart widget structural trace −25.9% +0.1%
Cross-module flow trace −29.6% +21.4%
TS config-value audit +16.0% −18.5%
Python async race-condition hunt −0.6% −24.2%

Every task produced contradictory signals across rounds. Baseline tokens alone varied 2× run-to-run on identical prompts. Agent exploration stochasticity appears to dominate codedb's effect at our N.

Open questions:

  1. Recommended methodology for measuring agent-level impact? Any reference design you'd point to?
  2. Does codedb's benefit compound over long-running multi-turn agent loops (vs single-shot sessions)? That seems plausible but we haven't tested it.
  3. Any tell-tale signs in agent behavior that indicate codedb is being used well vs underutilized?

What we'd change next time:

  • Deterministic-answer prompts (to remove agent exploration variance)
  • N ≥ 10 per cell
  • Test inside a long-running multi-turn agent loop

Happy to share more of the methodology if useful for future codedb benchmarking docs. Thanks for the tool — install was clean via the release binary and the MCP surface is well-designed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions