Skip to content

fix(evaluations): populate baseline performance metrics in computeBaseline()#1

Open
nanookclaw wants to merge 1 commit intoalokemajumder:mainfrom
nanookclaw:fix/baseline-metrics-population
Open

fix(evaluations): populate baseline performance metrics in computeBaseline()#1
nanookclaw wants to merge 1 commit intoalokemajumder:mainfrom
nanookclaw:fix/baseline-metrics-population

Conversation

@nanookclaw
Copy link
Copy Markdown

Problem

computeBaseline() in control-plane/lib/evaluations.js declares four accumulators (totalTokens, totalResponseTime, totalErrors, totalToolCalls) but only two of them (totalScore and a partial totalToolCalls via an efficiency criterion proxy) are ever incremented in the evaluation loop.

The result: avgTokensPerSession, avgResponseTime, and errorRate are always 0 in every computed baseline, which silently disables drift detection for those three metrics in detectDrift(). The drift checks for tokenUsage and responseTime gate on bm.avgTokensPerSession > 0 and bm.avgResponseTime > 0 respectively — so they can never fire.

The root cause is that the evaluation schema had no fields to carry per-session performance data. Without somewhere to store tokensUsed, responseTime, etc. on individual evaluations, computeBaseline() had nothing to sum.

Fix

control-plane/lib/evaluations.js

  1. Add optional tokensUsed, responseTime, errorCount, toolCallCount fields to the evaluation object in createEvaluation(). These default to null when not supplied — fully backward compatible; existing evaluations are unaffected.

  2. Update computeBaseline() to accumulate from these fields when present, using per-metric sample counts (tokenCount, responseTimeCount, toolCallCount) so averages are computed only over evaluations that actually carry the data. Evaluations without performance fields contribute only to avgScore.

test/evaluations/evaluations.test.js — three new tests:

  • should populate performance metrics in baseline when evaluations include them — verifies avgTokensPerSession, avgResponseTime, avgToolCalls, and errorRate are computed correctly.
  • should leave metrics at 0 when no evaluations carry performance fields — verifies the zero-default path (existing behaviour) is unaffected.
  • should detect token usage drift when baseline includes performance metrics — end-to-end: evaluations with tokensUsed/responseTime → baseline → detectDrift() surfaces tokenUsage and responseTime factors.

Test results

# tests 45
# pass  45
# fail  0

All existing tests pass unchanged.

…eline()

The computeBaseline() function declared totalTokens, totalResponseTime,
and totalErrors accumulators but never incremented them in the evaluation
loop. This caused avgTokensPerSession, avgResponseTime, and errorRate to
always be 0 in computed baselines, silently disabling drift detection for
those three metrics in detectDrift().

The root issue was twofold:
1. The evaluation schema had no fields to carry per-session performance
   data (tokensUsed, responseTime, errorCount, toolCallCount).
2. The accumulation loop only summed ev.score and extracted the
   efficiency criterion score as a proxy for tool calls — leaving the
   other three counters permanently at zero.

Fix:
- Add optional tokensUsed, responseTime, errorCount, toolCallCount fields
  to the evaluation object created by createEvaluation(). These fields are
  null when not supplied, preserving backward compatibility.
- Update computeBaseline() to accumulate from these fields when present,
  using per-metric sample counts (tokenCount, responseTimeCount,
  toolCallCount) so averages remain correct when only a subset of
  evaluations carry performance data.

Tests added:
- 'should populate performance metrics in baseline when evaluations
  include them' — asserts avgTokensPerSession, avgResponseTime,
  avgToolCalls, and errorRate are computed correctly.
- 'should leave metrics at 0 when no evaluations carry performance
  fields' — asserts the zero-default path is unaffected.
- 'should detect token usage drift when baseline includes performance
  metrics' — asserts tokenUsage and responseTime drift factors are
  surfaced by detectDrift() once the baseline carries real values.

All 45 tests pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant