fix(evaluations): populate baseline performance metrics in computeBaseline() by nanookclaw · Pull Request #1 · alokemajumder/FleetControlCenter

nanookclaw · 2026-03-22T06:05:55Z

Problem

computeBaseline() in control-plane/lib/evaluations.js declares four accumulators (totalTokens, totalResponseTime, totalErrors, totalToolCalls) but only two of them (totalScore and a partial totalToolCalls via an efficiency criterion proxy) are ever incremented in the evaluation loop.

The result: avgTokensPerSession, avgResponseTime, and errorRate are always 0 in every computed baseline, which silently disables drift detection for those three metrics in detectDrift(). The drift checks for tokenUsage and responseTime gate on bm.avgTokensPerSession > 0 and bm.avgResponseTime > 0 respectively — so they can never fire.

The root cause is that the evaluation schema had no fields to carry per-session performance data. Without somewhere to store tokensUsed, responseTime, etc. on individual evaluations, computeBaseline() had nothing to sum.

Fix

control-plane/lib/evaluations.js

Add optional tokensUsed, responseTime, errorCount, toolCallCount fields to the evaluation object in createEvaluation(). These default to null when not supplied — fully backward compatible; existing evaluations are unaffected.
Update computeBaseline() to accumulate from these fields when present, using per-metric sample counts (tokenCount, responseTimeCount, toolCallCount) so averages are computed only over evaluations that actually carry the data. Evaluations without performance fields contribute only to avgScore.

test/evaluations/evaluations.test.js — three new tests:

should populate performance metrics in baseline when evaluations include them — verifies avgTokensPerSession, avgResponseTime, avgToolCalls, and errorRate are computed correctly.
should leave metrics at 0 when no evaluations carry performance fields — verifies the zero-default path (existing behaviour) is unaffected.
should detect token usage drift when baseline includes performance metrics — end-to-end: evaluations with tokensUsed/responseTime → baseline → detectDrift() surfaces tokenUsage and responseTime factors.

Test results

# tests 45
# pass  45
# fail  0

All existing tests pass unchanged.

…eline() The computeBaseline() function declared totalTokens, totalResponseTime, and totalErrors accumulators but never incremented them in the evaluation loop. This caused avgTokensPerSession, avgResponseTime, and errorRate to always be 0 in computed baselines, silently disabling drift detection for those three metrics in detectDrift(). The root issue was twofold: 1. The evaluation schema had no fields to carry per-session performance data (tokensUsed, responseTime, errorCount, toolCallCount). 2. The accumulation loop only summed ev.score and extracted the efficiency criterion score as a proxy for tool calls — leaving the other three counters permanently at zero. Fix: - Add optional tokensUsed, responseTime, errorCount, toolCallCount fields to the evaluation object created by createEvaluation(). These fields are null when not supplied, preserving backward compatibility. - Update computeBaseline() to accumulate from these fields when present, using per-metric sample counts (tokenCount, responseTimeCount, toolCallCount) so averages remain correct when only a subset of evaluations carry performance data. Tests added: - 'should populate performance metrics in baseline when evaluations include them' — asserts avgTokensPerSession, avgResponseTime, avgToolCalls, and errorRate are computed correctly. - 'should leave metrics at 0 when no evaluations carry performance fields' — asserts the zero-default path is unaffected. - 'should detect token usage drift when baseline includes performance metrics' — asserts tokenUsage and responseTime drift factors are surfaced by detectDrift() once the baseline carries real values. All 45 tests pass.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(evaluations): populate baseline performance metrics in computeBaseline()#1

fix(evaluations): populate baseline performance metrics in computeBaseline()#1
nanookclaw wants to merge 1 commit intoalokemajumder:mainfrom
nanookclaw:fix/baseline-metrics-population

nanookclaw commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nanookclaw commented Mar 22, 2026

Problem

Fix

Test results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant