fix(evaluations): populate baseline performance metrics in computeBaseline()#1
Open
nanookclaw wants to merge 1 commit intoalokemajumder:mainfrom
Open
Conversation
…eline() The computeBaseline() function declared totalTokens, totalResponseTime, and totalErrors accumulators but never incremented them in the evaluation loop. This caused avgTokensPerSession, avgResponseTime, and errorRate to always be 0 in computed baselines, silently disabling drift detection for those three metrics in detectDrift(). The root issue was twofold: 1. The evaluation schema had no fields to carry per-session performance data (tokensUsed, responseTime, errorCount, toolCallCount). 2. The accumulation loop only summed ev.score and extracted the efficiency criterion score as a proxy for tool calls — leaving the other three counters permanently at zero. Fix: - Add optional tokensUsed, responseTime, errorCount, toolCallCount fields to the evaluation object created by createEvaluation(). These fields are null when not supplied, preserving backward compatibility. - Update computeBaseline() to accumulate from these fields when present, using per-metric sample counts (tokenCount, responseTimeCount, toolCallCount) so averages remain correct when only a subset of evaluations carry performance data. Tests added: - 'should populate performance metrics in baseline when evaluations include them' — asserts avgTokensPerSession, avgResponseTime, avgToolCalls, and errorRate are computed correctly. - 'should leave metrics at 0 when no evaluations carry performance fields' — asserts the zero-default path is unaffected. - 'should detect token usage drift when baseline includes performance metrics' — asserts tokenUsage and responseTime drift factors are surfaced by detectDrift() once the baseline carries real values. All 45 tests pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
computeBaseline()incontrol-plane/lib/evaluations.jsdeclares four accumulators (totalTokens,totalResponseTime,totalErrors,totalToolCalls) but only two of them (totalScoreand a partialtotalToolCallsvia an efficiency criterion proxy) are ever incremented in the evaluation loop.The result:
avgTokensPerSession,avgResponseTime, anderrorRateare always 0 in every computed baseline, which silently disables drift detection for those three metrics indetectDrift(). The drift checks fortokenUsageandresponseTimegate onbm.avgTokensPerSession > 0andbm.avgResponseTime > 0respectively — so they can never fire.The root cause is that the evaluation schema had no fields to carry per-session performance data. Without somewhere to store
tokensUsed,responseTime, etc. on individual evaluations,computeBaseline()had nothing to sum.Fix
control-plane/lib/evaluations.jsAdd optional
tokensUsed,responseTime,errorCount,toolCallCountfields to the evaluation object increateEvaluation(). These default tonullwhen not supplied — fully backward compatible; existing evaluations are unaffected.Update
computeBaseline()to accumulate from these fields when present, using per-metric sample counts (tokenCount,responseTimeCount,toolCallCount) so averages are computed only over evaluations that actually carry the data. Evaluations without performance fields contribute only toavgScore.test/evaluations/evaluations.test.js— three new tests:avgTokensPerSession,avgResponseTime,avgToolCalls, anderrorRateare computed correctly.tokensUsed/responseTime→ baseline →detectDrift()surfacestokenUsageandresponseTimefactors.Test results
All existing tests pass unchanged.