feat(evals): unified trading matrix eval + CI evals typecheck lane#171
Conversation
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 60ce124b
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T10:59:05Z
tangletools
left a comment
There was a problem hiding this comment.
🟡 Value Audit — sound-with-nits
| Verdict | sound-with-nits |
| Concerns | 3 (1 medium-concern, 1 low, 1 weak-concern) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 216.3s (2 bridge agents) |
| Total | 216.3s |
💰 Value — sound-with-nits
Adds a unified profile×(persona×market) trading matrix eval that drives the real operator stack and scores cells on observable artifacts plus objective walk-forward backtests, wires it into the existing eval CLI behind --matrix, and adds a CI typecheck lane for evals; good change with a small local
- What it does: Introduces evals/src/trading/unified-trading-matrix.ts, which runs a runProfileMatrix-spined campaign sweeping operator LLM model variants (kimi-k2 / glm-4.7 / glm-5.1) against personas × trading market scenarios. Each cell provisions a real bot via runMultishotUserSim, pins the in-sandbox operator model via agentEnv, and scores the result as 60% artifact-based signal (scoreUserSimArtifact) + 40%
- Goals it achieves: Unify the trading eval around one real-operator integration surface that judges on genuine state changes (bot_artifacts / tick_side_effects) plus an objective backtest anchor, rather than prose-only scoring; enable comparative selection of which LLM model runs the operator best across personas and market regimes; ensure the eval TypeScript stays green going forward with a CI typecheck gate; preser
- Assessment: The change is coherent and in-grain. It consumes the agent-eval campaign substrate (runProfileMatrix, recordRunsToScorecard, assertRealBackend) correctly, reuses the existing multishot user-sim and walk-forward backtest modules, maintains back-compat for existing callers, and documents its design trade-offs honestly (e.g., multi-round degenerating to one round for the real-operator dispatch). The
- Better / existing approach: No existing equivalent for the unified trading matrix exists in this repo (grep for persona-profile-matrix / persona-conversation-matrix / persona-decision-loop surfaces only returns the new file), and runProfileMatrix from @tangle-network/agent-eval/campaign is the right substrate. The only materially simpler design available is local: llmCallWithUsage in evals/src/sim/llm-call.ts duplicates ~80%
🎯 Usefulness — sound-with-nits
A coherent, reachable real-operator matrix eval and a required CI typecheck lane; the only real risk is that the default kimi-k2 profile may not be runnable by the current opencode sidecar provider wiring.
- Integration: Yes.
package.json:20addseval:trading-matrix, which runsdist/evals/bin/agent-eval-trading-personas.jswithTRADING_UNIFIED_MATRIX=1.evals/src/bin/agent-eval-trading-personas.ts:26-46branches on--matrix/TRADING_UNIFIED_MATRIXtorunUnifiedTradingMatrix, while keeping the existingeval:trading-personasback-compat path..github/workflows/ci.yml:337-354adds a classified `Eva - Fit with existing patterns: It fits the existing eval substrate: it drives
@tangle-network/agent-eval/campaign'srunProfileMatrix(evals/src/trading/unified-trading-matrix.ts:65, :388), reuses theOperatorClientprovision/configure/chat flow already used by lifecycle/research/robustness evals, and writes to the same scorecard JSONL pattern used bypersona-agent-eval.tsandscorecard-integration.ts(`unified-tradi - Real-world viability: The heavy integration path is honest about its needs: it validates
operatorUrl, token/privateKey, and unknown/missing model keys up front (unified-trading-matrix.ts:347-361), serializes cells by default (maxConcurrency: 1), and exits non-zero on stub runs or no winner (agent-eval-trading-personas.ts:48-51). The main real-world gap is the default profile list includeskimi-k2as an operat
🔎 Heuristic Signals
🟡 Cruft: console debug added evals/src/bin/agent-eval-trading-personas.ts
- console.log(JSON.stringify(summary, null, 2))
💰 Value Audit
🟡 llmCallWithUsage duplicates the core llmCall stream loop [better-architecture] ``
evals/src/sim/llm-call.ts:137-173 and :180-221 share identical backend setup, AbortController/timer handling, stream consumption, and error-path return logic. The only difference is accumulating llm_call event usage. Refactoring to a shared inner helper, or extending LlmCallResult with optional usage so llmCall always accumulates it, would eliminate the duplicate code and prevent divergence in timeout/abort/error behavior.
🎯 Usefulness Audit
🟠 Default kimi-k2 profile may be unsupported by the operator sidecar [integration] ``
UNIFIED_PROFILE_MODELSdefaults to['kimi-k2', 'glm-4.7', 'glm-5.1'](evals/src/trading/unified-trading-matrix.ts:94), andagentEnvForModelsetsOPENCODE_MODEL_PROVIDER: 'moonshot'for that model (:145-146). Repo evidence of sidecar-supported providers is limited toanthropic,zai-coding-plan,openrouter, andgemini(trading-instance-blueprint-lib/src/operator_api.rs:2531-2545;arena/src/lib/config/aiProviders.ts:1;evals/src/sim/operator-client.ts:498-516). If opencode
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
✅ No Blockers —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 44 | 60 | 44 |
| Confidence | 80 | 80 | 80 |
| Correctness | 44 | 60 | 44 |
| Security | 44 | 60 | 44 |
| Testing | 44 | 60 | 44 |
| Architecture | 44 | 60 | 44 |
Full multi-shot audit completed 4/4 planned shots over 7 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 4/4 planned shots over 7 changed files. Global verifier still owns final merge decision.
🟠 MEDIUM llmCallWithUsage duplicates core stream loop from llmCall (~45 lines) — evals/src/sim/llm-call.ts
llmCallWithUsage (lines 180-221) is a near-identical copy of llmCall (lines 137-173) differing only in accumulating usage from
llm_callstream events and returningLlmCallUsageResult. Every aspect — abort controller wiring, text_delta accumulation, backend_error handling, try/catch/finally, timeout clearance — is duplicated. Any bug fix applied to one must be manually propagated to the other. Extract a shared core (e.g.,async function llmCallCore(opts, onEvent?): Promise<...>) that both callers compose.
🟠 MEDIUM No test coverage for new exports (4 files, 0 tests) — evals/src/sim/multishot-user-sim.ts
scoreUserSimArtifact (lines 304-328) is a new exported scoring function used by unifiedJudge to produce composite scores that feed profile rankings and promotion decisions. None of the four files in this shot have corresponding test files. The composite scoring formula — 0.20intent + 0.15constraints + 0.40committed + 0.15self_improvement + 0.10*conversation — is untested against known edge cases (empty artifacts, missing bot_artifacts, prose-only sessions). Similarly, agentEnvForModel, buildUnifiedScenarios, llmCallWithUsage, and the unified judge all lack tests. At minimum, unit tests for scoring edge cases and a smoke test for the matri
🟠 MEDIUM 'warn' integrity mode may still throw on stubs at post-campaign check — evals/src/trading/unified-trading-matrix.ts
Lines 502-505:
(options.integrity ?? 'assert') === 'off' ? summarizeBackendIntegrity(result.records) : assertRealBackend(result.records, { allowMixed: true }). Whenoptions.integrity === 'warn', the condition'warn' === 'off'is false, soassertRealBackendis called regardless. The comment at CLI bin line 44 says'warn' lets a dry run proceed without LLM keys— but ifassertRealBackendwithallowMixed:truethrows on stub records, the warn path crashes identically to asser
🟠 MEDIUM moonshot provider name is unverified — all other code uses zai-coding-plan exclusively — evals/src/trading/unified-trading-matrix.ts
Lines ~113-120: agentEnvForModel sets OPENCODE_MODEL_PROVIDER:'moonshot' for kimi-k2, but every other code path in the codebase (deterministicAgentEnv in operator-client.ts, chat-code-strategy-runner.ts, chat-sandbox-runner.ts, chat-mcp-strategy-runner.ts) hardcodes 'zai-coding-plan'. The sidecar constructs the opencode provider config from OPENCODE_MODEL_PROVIDER — if it doesn't know 'moonshot', the in-sandbox agent won't start and the bot never replies. This would silently break the entire kimi-k2 profile axis (1 of 3 profiles). The OPENCODE_MODEL_BASE_URL env is also new (not set by deterministicAgentEnv) — may or may not be recognized by the sidecar. Fix: verify the sidecar supports 'moonshot' as a provider name before relying on it, or route Moonshot through the same generic OpenAI-co
🟠 MEDIUM unifiedJudge relies on failed:true to exclude non-responding cells, but substrate does not honor it — evals/src/trading/unified-trading-matrix.ts
Lines ~186-192: unifiedJudge().score() returns {composite:0, failed:true} when operatorResponded is false, with comment 'cell failed (not scored as zero)'. But the substrate's cellComposite in @tangle-network/agent-eval (run-profile-matrix.ts) computes mean of ALL judgeScores without checking the 'failed' flag — confirmed by source inspection ('failed' does not appear in that file). The zero IS folded into the profile mean and pass rate, contradicting the judge's documented intent. Impact: a profile whose cells fail (e.g. kimi-k2 if the moonshot provider misconfigures) gets unfairly dragged down by zeros that should be excluded, skewing the model comparison the matrix exists to answer. Fix: either filter failed scores in the matrix's own meanScore/passRate computation, or remove the failed
🟡 LOW Broad evals trigger: root package.json/package-lock.json changes trigger evals lane regardless of relevance — .github/workflows/ci.yml
Lines 97-100: The case pattern
package.json|package-lock.json|tsconfig.json|tsconfig.*.jsonmatches ANY change to root package.json or package-lock.json, even if the change is unrelated to evals/ (e.g., adding an arena-only dep). Root package.json scripts show this is a workspace root with deps shared across arena, sdk-ts, and evals. The evals job only runs tsc --noEmit, so it won't fail spuriously — this is wasteful CI cycles, not a correctness bug. Consider narrowing toevals/*|package.jsononly when evals-related fields change, or accept the harmless over-trigger.
🟡 LOW Bypasses existing npm typecheck:evals script — .github/workflows/ci.yml
Line 354: Uses
npx tsc -p evals/tsconfig.json --noEmitdirectly. Root package.json line 24 already defines"typecheck:evals": "tsc -p evals/tsconfig.json --noEmit". Usingnpm run typecheck:evalsinstead of the raw npx command would keep the CI command in sync with the npm script if the tsconfig path or flags ever change.
🟡 LOW Dead patterns: tsconfig.json and tsconfig.*.json match no files in this repo — .github/workflows/ci.yml
Lines 98: Both
tsconfig.jsonandtsconfig.*.jsonare unqualified (no directory prefix). They match only at repo root. This repo has no root-level tsconfig.json or tsconfig.*.json (tsconfig files live under evals/, arena/, sdk-ts/). These are forward-looking patterns that currently match nothing. If a root tsconfig is added later, it would trigger evals for changes that may only affect arena or sdk-ts.
🟡 LOW Dead root tsconfig change-detection patterns — .github/workflows/ci.yml
Lines 98: the case patterns
tsconfig.json|tsconfig.*.jsonmatch only root-level files, but no root tsconfig.json exists (only evals/tsconfig.json, arena/tsconfig.json, sdk-ts/tsconfig.json do).evals/*already catches evals/tsconfig.json. These patterns are harmless and defensive (would trigger evals lane if a shared root tsconfig is added later), but are currently dead. Not blocking.
🟡 LOW Inlines tsc command instead of reusing typecheck:evals script — .github/workflows/ci.yml
Line 354:
npx tsc -p evals/tsconfig.json --noEmitduplicates the roottypecheck:evalsnpm script (tsc -p evals/tsconfig.json --noEmit). Usingnpm run typecheck:evalswould keep the CI invocation as the single source of truth for the typecheck command. Functionally equivalent; cosmetic consistency only.
🟡 LOW No tests for scoreUserSimArtifact extraction or unified matrix construction — evals/src/sim/multishot-user-sim.ts
scoreUserSimArtifact is extracted from userSimJudge with identical weights — a clean factoring, but the duplication (same 5-term composite formula now in two places) means a future change to one without the other silently diverges. No unit test pins the weights or verifies the two functions produce identical scores for the same input. Similarly, unified-trading-matrix.ts (606 lines, the core new eval) has no test file. The matrix needs live infra to run end-to-end, but the pure functions (buildUnifiedProfiles, buildUnifiedScenarios, modelOfProfile, agentEnvForModel, firstArtifact, campaignTokenUsage, scoreOf, meanScore) are all unit-testable.
🟡 LOW agentEnvForModel provider detection uses fragile URL string match — evals/src/trading/unified-trading-matrix.ts
Line 145:
const isMoonshot = cfg.baseUrl.includes('moonshot'). Mapping model → opencode provider name via URL substring is brittle: if the Moonshot API baseUrl changes (e.g., regional endpoint), the provider name flips to 'zai-coding-plan' and the opencode sidecar receives wrong credentials. The MODEL_CONFIG in llm-call.ts (line 56) already centralizes per-model routing — adding aproviderName: 'moonshot' | 'zai-coding-plan'field there and reading it directly in agentEnvForModel would
🟡 LOW metaSpendCall adds real LLM cost per cell and can cause false-negative integrity on transient failures — evals/src/trading/unified-trading-matrix.ts
Lines ~300-320: Each cell makes an extra LLM call (metaSpendCall) purely for token accounting, since the inner multishot campaign's LLM spend happens inside the sandbox (invisible to the eval). If this grounding call fails (rate limit, timeout, network), llmCallWithUsage returns usage:{input:0,output:0,costUsd:0}. Combined with the inner campaign's zero tokenUsage, the cell reports zero tokens total, and assertRealBackend flags the entire matrix as 'stub', failing loud even though the real operator ran fine. The comment says this is 'HONEST' but doesn't degrade gracefully. Fix: treat a failed grounding call as a warning, not a zero — or use a minimum token floor from the session transcript length to avoid a hard integrity failure.
🟡 LOW metaSpendCall fires for sessions with zero operator turns — evals/src/trading/unified-trading-matrix.ts
firstArtifact (line 546) checks
cell.artifact && cell.artifact.turns, but JS arrays are always truthy (including empty[]). This means even a session with zero turns returnsoperatorResponded: falsebut still triggers metaSpendCall (line 447) — an LLM call feeding '(no turns)' as the transcript. The call costs tokens for no useful signal and the combined judge already marks the cell as failed. Short-circuitmetaSpendCallwhensession.turns.length === 0to avoid this waste. The co
🟡 LOW Residual agent-eval version split (0.91 top-level vs 0.79 nested under agent-knowledge) — package-lock.json
agent-knowledge 1.7.0 hard-pins agent-eval to >=0.77.0 <0.80.0 (lockfile line 197), incompatible with the top-level ^0.91.0. npm nests agent-eval 0.79.0 under node_modules/@tangle-network/agent-knowledge/ (lockfile line 208). This is a 12-version gap. Impact is low: agent-knowledge is NOT imported by any eval TypeScript (verified via rg — only referenced in Rust activate.rs as a sandbox CLI tool and in prompt templates), so the eval pipeline uses only the top-level 0.91.0. This split existed before (was 0.91 vs 0.42, a 49-version gap) and t
🟡 LOW agent-knowledge@1.7.0 peer dep on agent-eval<0.80 untested against project agent-eval@0.91.0 — package-lock.json
agent-knowledge@1.7.0 declares peerDependency on @tangle-network/agent-eval with range >=0.77.0 <0.80.0, but the root project ships agent-eval@0.91.0. npm resolved this by nesting agent-eval@0.79.0 under agent-knowledge/node_modules/ — correct from a resolution standpoint, but this version combination (agent-knowledge 1.7 + agent-eval 0.91) was never tested by the agent-knowledge release. If project code passes agent-eval instances or types from root (0.91) into agent-knowledge APIs expecting types from 0.79, runtime errors or type mismatches could surface. Mitigation: the grep confirms agent-knowledge is consumed as a CLI/data package, not a library API, so blast radius is low. Fix: either bump agent-knowledge to a version that accepts agent-eval >=0.83 (aligning with agent-runtime@0.52 p
🟡 LOW agent-knowledge@1.7.0 peer-dep semver gap with agent-eval@0.91.0 — package.json
Line 41:
@tangle-network/agent-knowledgebumped to^1.7.0. This version declares@tangle-network/agent-eval: >=0.77.0 <0.80.0as a regular dependency, but the project root suppliesagent-eval@0.91.0(line 40). The lock file resolves it via npm peer hoisting (node_modules/@tangle-network/agent-knowledge/node_modules/@tangle-network/agent-evalhaspeer:true, version:0.79.0), but the hoisted actual is 0.91.0 which exceeds the 0.80.0 ceiling. At runtime, agent-knowledge receives agent-eval@0.91.0 types/APIs it was not validated against. No sour
tangletools · 2026-06-14T11:21:23Z · trace
60ce124 to
e0ec4e6
Compare
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — e0ec4e6e
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T12:25:44Z
tangletools
left a comment
There was a problem hiding this comment.
🟡 Value Audit — sound-with-nits
| Verdict | sound-with-nits |
| Concerns | 4 (4 weak-concern) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 239.0s (2 bridge agents) |
| Total | 239.0s |
💰 Value — sound-with-nits
Adds a real-operator profile×persona×market matrix eval that degrades cleanly to the deterministic backtest when no operator URL is present, plus a CI typecheck lane for evals — coherent and in-grain, with only small local refactor opportunities.
- What it does: Extends
evals/src/trading/persona-agent-eval.tsfrom a deterministic backtest-only bridge into a unifiedrunTradingPersonaEvalsurface: whenoperatorUrl(or env) is present it runs arunProfileMatrixcampaign that sweeps operator LLM models (kimi-k2 / glm-4.7 / glm-5.1) against persona×market scenarios, each cell exercising the full operator stack viarunMultishotUserSim; without an oper - Goals it achieves: 1) Give the trading eval a single entry point that works both offline in CI (deterministic backtest) and against live operator infra (real bot artifacts + tick side-effects). 2) Judge real-operator cells with a blend of observable artifact signal and the objective, model-invariant walk-forward backtest ground truth. 3) Prevent TypeScript regressions in the eval code from reaching main via a requir
- Assessment: Good. The change is coherent: it reuses existing primitives (
OperatorClient,runMultishotUserSim,runProfileMatrix, scorecard wiring) rather than inventing parallel infrastructure, and the two-mode degradation is explicit and documented. The CI lane is correctly classified on evals/package/tsconfig changes and wired into the gate. The scoring weights, integrity assertions, and trace/RunRecor - Better / existing approach: No materially better architecture. Two small local improvements: (1)
llmCallWithUsageinevals/src/sim/llm-call.tsduplicates the body ofllmCall; they should share a single internal stream-accumulation helper. (2)agentEnvForModelinevals/src/trading/persona-agent-eval.tsbuilds operator sandbox env from model routing; it would be more reusable if it lived next to the model routing SOT
🎯 Usefulness — sound-with-nits
A coherent consolidation: one entry point runTradingPersonaEval unifies deterministic backtest and real operator-matrix modes, reuses established substrate primitives, and is reachable from the existing bin, full-eval, and CI typecheck lane.
- Integration: Wired and reachable. The bin
evals/src/bin/agent-eval-trading-personas.ts:35callsrunTradingPersonaEval;evals/src/full/full-eval-runner.ts:39calls it withoutoperatorUrlso CI/full-eval runs deterministic mode today;package.json:19exposeseval:trading-personas;.github/workflows/ci.yml:337-354adds theEvals typechecklane and the gate atci.yml:401requires it. Operator-mat - Fit with existing patterns: Fits the codebase's grain. It reuses the single
OperatorClient(evals/src/sim/operator-client.ts:2-16), the existingrunMultishotUserSimcampaign primitive (evals/src/sim/multishot-user-sim.ts:425), and the substraterunProfileMatrixfrom@tangle-network/agent-eval/campaign(confirmed exported and loaded atnode_modules/@tangle-network/agent-eval/package.jsoncampaign export). It con - Real-world viability: Holds up on the non-happy paths that matter. It requires
tokenorprivateKeyand throws a clear error if neither (persona-agent-eval.ts:441-446); defaultsintegrityto'assert'with an env override to'warn'; respectscostCeiling; and marks no-turn cells asfailedrather than scored zero (persona-agent-eval.ts:412-414).maxConcurrencydefaults to 1, which is safe for the operat
💰 Value Audit
🟡 llmCallWithUsage duplicates llmCall body [duplication] ``
evals/src/sim/llm-call.ts:180-221mirrors:137-173almost exactly, only adding usage accumulation onllm_callevents. Better: extract a shared internal helper that always accumulates usage and returns it;llmCallcan simply drop the usage field. This reduces drift and makes future usage-aware callers cheaper.
🟡 agentEnvForModel is local to the trading eval [better-architecture] ``
agentEnvForModelinevals/src/trading/persona-agent-eval.ts:344maps a logical model to the operator sandbox env (OPENCODE_MODEL_PROVIDER,OPENCODE_MODEL_NAME, etc.). The codebase already has model routing SOT insim/llm-call.ts(resolveModel/MODEL_CONFIG) and operator env defaults insim/operator-client.ts(deterministicAgentEnv). MovingagentEnvForModelnext to one of those would let other eval surfaces sweep operator models without copying the env-building logic.
🎯 Usefulness Audit
🟡 PR body describes surface that was consolidated away [ergonomics] ``
The PR body mentions
--matrix,eval:trading-matrix, andevals/src/trading/unified-trading-matrix.ts, but the actual commit (HEADe0ec4e6) folded the operator-matrix capability into the existingpersona-agent-eval.tsand deleted the standalone module/flag/script. The code is better unified, but the stale description could confuse reviewers; align the PR body with the commit message.
🟡 Operator-matrix mode writes scorecard records but skips regression diff [robustness] ``
In operator-matrix mode
persona-agent-eval.ts:527-532appendsRunRecords viarecordRunsToScorecard, yet the summary hardcodesregressed: false(persona-agent-eval.ts:576) and omits the formatted diff. Deterministic mode computes both viarecordScorecardAndDiff(persona-agent-eval.ts:225-230). Consider callingrecordScorecardAndDiffper profile in operator-matrix mode so regression detection works in both modes.
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
…x) + CI lane Follow-up to #169 (model-driven trading) / #170 (deps). ONE entry point — runTradingPersonaEval (evals/src/trading/persona-agent-eval.ts) — that degrades by infra, instead of separate modules: - No operator URL -> DETERMINISTIC mode: the Rust walk-forward backtest -> RunRecords + trace + scorecard (offline; what full-eval/CI run). Unchanged. - operatorUrl present -> OPERATOR-MATRIX mode: runProfileMatrix sweeps the PROFILE axis (operator model variants: kimi-k2/glm-4.7/glm-5.1, pinned into the REAL operator via agentEnv) x (persona x market). Each cell runs the FULL operator simulation (runMultishotUserSim -> real bot_artifacts + tick_side_effects), judged on real artifacts (60%) + objective backtest ground truth (40%) — not prose. Scorecard + assertRealBackend + byProfile/byPersona read straight from the matrix. Multi-round honestly degenerates to 1 (the provision->chat->capture cycle is single-pass; turns live inside each cell). Consolidation: folds the operator-matrix capability INTO the existing bridge file and DELETES the standalone module + the dual --matrix bin flag + the redundant npm script. One surface, one entry point, shared scorecard/profile/ground-truth helpers. The bin auto-degrades by --operator-url; full-eval routes through the same function. CI: new 'Evals typecheck' lane (node 22 + npm ci + tsc -p evals/tsconfig.json), classified on evals/ + package*.json + tsconfig, required in the gate. Deps: agent-runtime ^0.52, agent-knowledge ^1.7 (over #170's ^0.50/^1.5); agent-eval ^0.91. Validated: npm ci clean, tsc 0 errors.
e0ec4e6 to
ae57872
Compare
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — ae578723
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T12:34:22Z
tangletools
left a comment
There was a problem hiding this comment.
🟡 Value Audit — sound-with-nits
| Verdict | sound-with-nits |
| Concerns | 5 (2 medium-concern, 3 weak-concern) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 248.3s (2 bridge agents) |
| Total | 248.3s |
💰 Value — sound-with-nits
Unifies the trading persona eval into a single degrading entry point (deterministic backtest vs. real operator model×persona matrix) and adds a CI typecheck lane — coherent and in the codebase's grain, with minor local duplication that should be shared.
- What it does: The change folds a real-operator profile/persona matrix eval into the existing
evals/src/trading/persona-agent-eval.tsbridge.runTradingPersonaEvalauto-routes: with no operator URL it runs the deterministic Rust walk-forward backtest (runPersonaSuite) and feeds the scorecard; withoperatorUrlit sweeps operator model variants (kimi-k2 / glm-4.7 / glm-5.1) × personas × markets via `runPro - Goals it achieves: From the code, the goals are: (1) replace multiple persona/profile eval surfaces with one entry point that degrades gracefully by available infra, so CI/local runs stay cheap while live operator testing is one flag away; (2) judge real operator behavior on observable state (
bot_artifacts+tick_side_effects) plus an objective backtest, not prose-only rubrics; (3) gate the TypeScript eval code - Assessment: Good change. The unified surface is simpler than the prior separate bridge, the degradation-by-infra design is appropriate for a codebase that runs both in cheap CI and expensive live-operator campaigns, and it reuses existing helpers (
runMultishotUserSim,scoreUserSimArtifact,buildTradingScorecardAgentProfile). The CI lane fills a real gap (no TS eval typecheck was required before). The co - Better / existing approach: A few local simplifications exist, but no materially different architecture is needed. I searched
evals/srcforrunProfileMatrix,scoreUserSimArtifact,llmCall, andllmCallWithUsageand checked the parent commit; no existing equivalent covers the unified matrix, so the change is not reinventing something already present.
🎯 Usefulness — sound-with-nits
A coherent consolidation: one eval entry point that degrades from deterministic CI backtest to a real operator model×persona×market matrix, wired into the bin, full-eval, and a new CI typecheck lane.
- Integration: Reachable. The unified function is called by
evals/src/full/full-eval-runner.ts:9,39(deterministic path) and the binevals/src/bin/agent-eval-trading-personas.ts:35, which is exposed asnpm run eval:trading-personas(package.json:19). The newEvals typecheckCI lane is classified in.github/workflows/ci.yml:98and required in the gate at line 401. (Note: the PR body mentions a `--matr - Fit with existing patterns: Fits the codebase grain. It consumes the substrate primitive
runProfileMatrixfrom@tangle-network/agent-eval/campaign(evals/src/trading/persona-agent-eval.ts:468) exactly as documented in the type definitions (node_modules/@tangle-network/agent-eval/dist/campaign/index.d.ts:945-1045), reusesrunMultishotUserSim/OperatorClient.configureSecretsfor real-operator provisioning, and reuse - Real-world viability: Mostly robust for realistic use. Auth supports token or private-key with refresh; it fails loudly if neither is supplied (
persona-agent-eval.ts:441-446). Concurrency defaults to 1, cost ceilings andassertRealBackend/summarizeBackendIntegrityguards are wired, and the judge marks cellsfailed(not zero) when the operator produces no turns (persona-agent-eval.ts:412-413). One profile-spec
💰 Value Audit
🟠 llmCall and llmCallWithUsage are near-identical implementations [duplication] ``
The two functions in
evals/src/sim/llm-call.tsshare almost the entire stream-consumption loop, differing only in whether they accumulatellm_callusage events (llm-call.ts:137-173vs:180-221). A singlellmCallthat always reports usage (e.g.,usageas an optional field onLlmCallResult) would remove the duplicate code and likely eliminate the per-cellmetaSpendCallworkaround inevals/src/trading/persona-agent-eval.ts:509-515, because the user-sim turns and judge calls would
🟡 Scoring weights are duplicated between scoreUserSimArtifact and userSimJudge [duplication] ``
The composite formula
0.20 intent + 0.15 constraints + 0.40 committed + 0.15 self-improve + 0.10 productiveappears verbatim inevals/src/sim/multishot-user-sim.ts:311-315and again in:357-361. The comment at:294-297explicitly notes they must stay in sync. A shared helper (e.g.,computeArtifactComposite(rubric, state)) would remove the sync risk.
🟡 Provider mapping in agentEnvForModel is inferred from baseUrl string [better-architecture] ``
In
evals/src/trading/persona-agent-eval.ts:347the provider is decided bycfg.baseUrl.includes('moonshot'). The single source of truth for models isMODEL_CONFIGinevals/src/sim/llm-call.ts:56-75; adding aproviderfield there would letagentEnvForModelread it directly and avoid a fragile string match if a new provider endpoint is added.
🎯 Usefulness Audit
🟠 kimi-k2 profile may not resolve in the real operator sidecar [integration] ``
Evidence:
agentEnvForModel(evals/src/trading/persona-agent-eval.ts:352-358) setsOPENCODE_MODEL_PROVIDER: 'moonshot'andOPENCODE_MODEL: 'moonshot/kimi-k2.6'for thekimi-k2profile, withSIDECAR_DEFAULT_HARNESS: 'opencode'. However, the sidecar opencode config written during activation (trading-blueprint-lib/src/jobs/activate.rs:122-163) only declareszai-coding-planandopenrouterproviders, and the arena's provider table (arena/src/lib/config/aiProviders.ts) also lacks `m
🟡 failOnRegression is exposed but ignored in operator-matrix mode [ergonomics] ``
Evidence:
TradingPersonaEvalOptions.failOnRegressionis in the shared options block (evals/src/trading/persona-agent-eval.ts:87) and is honored in deterministic mode (persona-agent-eval.ts:232-236), but operator-matrix mode appends to the scorecard and returnsregressed: falseunconditionally (persona-agent-eval.ts:576) without checking the option or runningdiffScorecard. Either wire the same regression throw in operator-matrix mode, or movefailOnRegressionout of the shared type
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
Follow-up to #169 (model-driven trading — merged) and #170 (deps bump — merged).
One unified trading eval
Collapses the three earlier persona/profile surfaces into a single
runUnifiedTradingMatrix(evals/src/trading/unified-trading-matrix.ts):runProfileMatrixspine. PROFILE axis = operator model variants (kimi-k2 / glm-4.7 / glm-5.1, pinned into the real operator viaagentEnv); SCENARIO axis = personas × market scenarios.runMultishotUserSimagainst the liveOperatorClient→ realbot_artifacts+tick_side_effects.assertRealBackendguard +byProfile/byPersonaaggregation preserved.--matrix, one scripteval:trading-matrix.CI evals typecheck lane
New
Evals typecheckjob (node 22 +npm ci+tsc -p evals/tsconfig.json --noEmit), classified onevals/+package*.json+tsconfigchanges and required in the CI gate — so the eval/matrix work is gated going forward (previously there was no TypeScript lane).Deps
agent-runtime
^0.50 → ^0.52(over #170), agent-knowledge^1.5 → ^1.7; agent-eval stays^0.91. BringsrunPersonaDispatch/runPersonified/loopUntil/runProfileMatrix(the persona-matrix + multi-round substrate).Validated locally:
npm ciclean,tsc -p evals/tsconfig.json0 errors,ci.ymlvalid YAML.