Skip to content

feat(evals): unified trading matrix eval + CI evals typecheck lane#171

Merged
drewstone merged 1 commit into
mainfrom
feat/unified-trading-eval
Jun 14, 2026
Merged

feat(evals): unified trading matrix eval + CI evals typecheck lane#171
drewstone merged 1 commit into
mainfrom
feat/unified-trading-eval

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

Follow-up to #169 (model-driven trading — merged) and #170 (deps bump — merged).

One unified trading eval

Collapses the three earlier persona/profile surfaces into a single runUnifiedTradingMatrix (evals/src/trading/unified-trading-matrix.ts):

  • runProfileMatrix spine. PROFILE axis = operator model variants (kimi-k2 / glm-4.7 / glm-5.1, pinned into the real operator via agentEnv); SCENARIO axis = personas × market scenarios.
  • Each cell = the full operator simulation, not a prose stand-in: runMultishotUserSim against the live OperatorClient → real bot_artifacts + tick_side_effects.
  • Judged on real signal: artifact score (60%) + objective walk-forward backtest ground truth (40%). No prose-only scoring.
  • Scorecard timeline + assertRealBackend guard + byProfile/byPersona aggregation preserved.
  • Multi-round honestly degenerates to 1 round (provision→chat→capture is single-pass; the turns live inside each cell) — documented in the header.
  • One bin flag --matrix, one script eval:trading-matrix.

CI evals typecheck lane

New Evals typecheck job (node 22 + npm ci + tsc -p evals/tsconfig.json --noEmit), classified on evals/ + package*.json + tsconfig changes and required in the CI gate — so the eval/matrix work is gated going forward (previously there was no TypeScript lane).

Deps

agent-runtime ^0.50 → ^0.52 (over #170), agent-knowledge ^1.5 → ^1.7; agent-eval stays ^0.91. Brings runPersonaDispatch / runPersonified / loopUntil / runProfileMatrix (the persona-matrix + multi-round substrate).

Validated locally: npm ci clean, tsc -p evals/tsconfig.json 0 errors, ci.yml valid YAML.

tangletools
tangletools previously approved these changes Jun 14, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 60ce124b

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T10:59:05Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Value Audit — sound-with-nits

Verdict sound-with-nits
Concerns 3 (1 medium-concern, 1 low, 1 weak-concern)
Heuristic 0.0s
Duplication 0.0s
Interrogation 216.3s (2 bridge agents)
Total 216.3s

💰 Value — sound-with-nits

Adds a unified profile×(persona×market) trading matrix eval that drives the real operator stack and scores cells on observable artifacts plus objective walk-forward backtests, wires it into the existing eval CLI behind --matrix, and adds a CI typecheck lane for evals; good change with a small local

  • What it does: Introduces evals/src/trading/unified-trading-matrix.ts, which runs a runProfileMatrix-spined campaign sweeping operator LLM model variants (kimi-k2 / glm-4.7 / glm-5.1) against personas × trading market scenarios. Each cell provisions a real bot via runMultishotUserSim, pins the in-sandbox operator model via agentEnv, and scores the result as 60% artifact-based signal (scoreUserSimArtifact) + 40%
  • Goals it achieves: Unify the trading eval around one real-operator integration surface that judges on genuine state changes (bot_artifacts / tick_side_effects) plus an objective backtest anchor, rather than prose-only scoring; enable comparative selection of which LLM model runs the operator best across personas and market regimes; ensure the eval TypeScript stays green going forward with a CI typecheck gate; preser
  • Assessment: The change is coherent and in-grain. It consumes the agent-eval campaign substrate (runProfileMatrix, recordRunsToScorecard, assertRealBackend) correctly, reuses the existing multishot user-sim and walk-forward backtest modules, maintains back-compat for existing callers, and documents its design trade-offs honestly (e.g., multi-round degenerating to one round for the real-operator dispatch). The
  • Better / existing approach: No existing equivalent for the unified trading matrix exists in this repo (grep for persona-profile-matrix / persona-conversation-matrix / persona-decision-loop surfaces only returns the new file), and runProfileMatrix from @tangle-network/agent-eval/campaign is the right substrate. The only materially simpler design available is local: llmCallWithUsage in evals/src/sim/llm-call.ts duplicates ~80%

🎯 Usefulness — sound-with-nits

A coherent, reachable real-operator matrix eval and a required CI typecheck lane; the only real risk is that the default kimi-k2 profile may not be runnable by the current opencode sidecar provider wiring.

  • Integration: Yes. package.json:20 adds eval:trading-matrix, which runs dist/evals/bin/agent-eval-trading-personas.js with TRADING_UNIFIED_MATRIX=1. evals/src/bin/agent-eval-trading-personas.ts:26-46 branches on --matrix/TRADING_UNIFIED_MATRIX to runUnifiedTradingMatrix, while keeping the existing eval:trading-personas back-compat path. .github/workflows/ci.yml:337-354 adds a classified `Eva
  • Fit with existing patterns: It fits the existing eval substrate: it drives @tangle-network/agent-eval/campaign's runProfileMatrix (evals/src/trading/unified-trading-matrix.ts:65, :388), reuses the OperatorClient provision/configure/chat flow already used by lifecycle/research/robustness evals, and writes to the same scorecard JSONL pattern used by persona-agent-eval.ts and scorecard-integration.ts (`unified-tradi
  • Real-world viability: The heavy integration path is honest about its needs: it validates operatorUrl, token/privateKey, and unknown/missing model keys up front (unified-trading-matrix.ts:347-361), serializes cells by default (maxConcurrency: 1), and exits non-zero on stub runs or no winner (agent-eval-trading-personas.ts:48-51). The main real-world gap is the default profile list includes kimi-k2 as an operat

🔎 Heuristic Signals

🟡 Cruft: console debug added evals/src/bin/agent-eval-trading-personas.ts

  • console.log(JSON.stringify(summary, null, 2))

💰 Value Audit

🟡 llmCallWithUsage duplicates the core llmCall stream loop [better-architecture] ``

evals/src/sim/llm-call.ts:137-173 and :180-221 share identical backend setup, AbortController/timer handling, stream consumption, and error-path return logic. The only difference is accumulating llm_call event usage. Refactoring to a shared inner helper, or extending LlmCallResult with optional usage so llmCall always accumulates it, would eliminate the duplicate code and prevent divergence in timeout/abort/error behavior.

🎯 Usefulness Audit

🟠 Default kimi-k2 profile may be unsupported by the operator sidecar [integration] ``

UNIFIED_PROFILE_MODELS defaults to ['kimi-k2', 'glm-4.7', 'glm-5.1'] (evals/src/trading/unified-trading-matrix.ts:94), and agentEnvForModel sets OPENCODE_MODEL_PROVIDER: 'moonshot' for that model (:145-146). Repo evidence of sidecar-supported providers is limited to anthropic, zai-coding-plan, openrouter, and gemini (trading-instance-blueprint-lib/src/operator_api.rs:2531-2545; arena/src/lib/config/aiProviders.ts:1; evals/src/sim/operator-client.ts:498-516). If opencode


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260614T110506Z

@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — 60ce124b

Readiness 44/100 · Confidence 80/100 · 17 findings (5 medium, 12 low)

deepseek glm aggregate
Readiness 44 60 44
Confidence 80 80 80
Correctness 44 60 44
Security 44 60 44
Testing 44 60 44
Architecture 44 60 44

Full multi-shot audit completed 4/4 planned shots over 7 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 4/4 planned shots over 7 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM llmCallWithUsage duplicates core stream loop from llmCall (~45 lines) — evals/src/sim/llm-call.ts

llmCallWithUsage (lines 180-221) is a near-identical copy of llmCall (lines 137-173) differing only in accumulating usage from llm_call stream events and returning LlmCallUsageResult. Every aspect — abort controller wiring, text_delta accumulation, backend_error handling, try/catch/finally, timeout clearance — is duplicated. Any bug fix applied to one must be manually propagated to the other. Extract a shared core (e.g., async function llmCallCore(opts, onEvent?): Promise<...>) that both callers compose.

🟠 MEDIUM No test coverage for new exports (4 files, 0 tests) — evals/src/sim/multishot-user-sim.ts

scoreUserSimArtifact (lines 304-328) is a new exported scoring function used by unifiedJudge to produce composite scores that feed profile rankings and promotion decisions. None of the four files in this shot have corresponding test files. The composite scoring formula — 0.20intent + 0.15constraints + 0.40committed + 0.15self_improvement + 0.10*conversation — is untested against known edge cases (empty artifacts, missing bot_artifacts, prose-only sessions). Similarly, agentEnvForModel, buildUnifiedScenarios, llmCallWithUsage, and the unified judge all lack tests. At minimum, unit tests for scoring edge cases and a smoke test for the matri

🟠 MEDIUM 'warn' integrity mode may still throw on stubs at post-campaign check — evals/src/trading/unified-trading-matrix.ts

Lines 502-505: (options.integrity ?? 'assert') === 'off' ? summarizeBackendIntegrity(result.records) : assertRealBackend(result.records, { allowMixed: true }). When options.integrity === 'warn', the condition 'warn' === 'off' is false, so assertRealBackend is called regardless. The comment at CLI bin line 44 says 'warn' lets a dry run proceed without LLM keys — but if assertRealBackend with allowMixed:true throws on stub records, the warn path crashes identically to asser

🟠 MEDIUM moonshot provider name is unverified — all other code uses zai-coding-plan exclusively — evals/src/trading/unified-trading-matrix.ts

Lines ~113-120: agentEnvForModel sets OPENCODE_MODEL_PROVIDER:'moonshot' for kimi-k2, but every other code path in the codebase (deterministicAgentEnv in operator-client.ts, chat-code-strategy-runner.ts, chat-sandbox-runner.ts, chat-mcp-strategy-runner.ts) hardcodes 'zai-coding-plan'. The sidecar constructs the opencode provider config from OPENCODE_MODEL_PROVIDER — if it doesn't know 'moonshot', the in-sandbox agent won't start and the bot never replies. This would silently break the entire kimi-k2 profile axis (1 of 3 profiles). The OPENCODE_MODEL_BASE_URL env is also new (not set by deterministicAgentEnv) — may or may not be recognized by the sidecar. Fix: verify the sidecar supports 'moonshot' as a provider name before relying on it, or route Moonshot through the same generic OpenAI-co

🟠 MEDIUM unifiedJudge relies on failed:true to exclude non-responding cells, but substrate does not honor it — evals/src/trading/unified-trading-matrix.ts

Lines ~186-192: unifiedJudge().score() returns {composite:0, failed:true} when operatorResponded is false, with comment 'cell failed (not scored as zero)'. But the substrate's cellComposite in @tangle-network/agent-eval (run-profile-matrix.ts) computes mean of ALL judgeScores without checking the 'failed' flag — confirmed by source inspection ('failed' does not appear in that file). The zero IS folded into the profile mean and pass rate, contradicting the judge's documented intent. Impact: a profile whose cells fail (e.g. kimi-k2 if the moonshot provider misconfigures) gets unfairly dragged down by zeros that should be excluded, skewing the model comparison the matrix exists to answer. Fix: either filter failed scores in the matrix's own meanScore/passRate computation, or remove the failed

🟡 LOW Broad evals trigger: root package.json/package-lock.json changes trigger evals lane regardless of relevance — .github/workflows/ci.yml

Lines 97-100: The case pattern package.json|package-lock.json|tsconfig.json|tsconfig.*.json matches ANY change to root package.json or package-lock.json, even if the change is unrelated to evals/ (e.g., adding an arena-only dep). Root package.json scripts show this is a workspace root with deps shared across arena, sdk-ts, and evals. The evals job only runs tsc --noEmit, so it won't fail spuriously — this is wasteful CI cycles, not a correctness bug. Consider narrowing to evals/*|package.json only when evals-related fields change, or accept the harmless over-trigger.

🟡 LOW Bypasses existing npm typecheck:evals script — .github/workflows/ci.yml

Line 354: Uses npx tsc -p evals/tsconfig.json --noEmit directly. Root package.json line 24 already defines "typecheck:evals": "tsc -p evals/tsconfig.json --noEmit". Using npm run typecheck:evals instead of the raw npx command would keep the CI command in sync with the npm script if the tsconfig path or flags ever change.

🟡 LOW Dead patterns: tsconfig.json and tsconfig.*.json match no files in this repo — .github/workflows/ci.yml

Lines 98: Both tsconfig.json and tsconfig.*.json are unqualified (no directory prefix). They match only at repo root. This repo has no root-level tsconfig.json or tsconfig.*.json (tsconfig files live under evals/, arena/, sdk-ts/). These are forward-looking patterns that currently match nothing. If a root tsconfig is added later, it would trigger evals for changes that may only affect arena or sdk-ts.

🟡 LOW Dead root tsconfig change-detection patterns — .github/workflows/ci.yml

Lines 98: the case patterns tsconfig.json|tsconfig.*.json match only root-level files, but no root tsconfig.json exists (only evals/tsconfig.json, arena/tsconfig.json, sdk-ts/tsconfig.json do). evals/* already catches evals/tsconfig.json. These patterns are harmless and defensive (would trigger evals lane if a shared root tsconfig is added later), but are currently dead. Not blocking.

🟡 LOW Inlines tsc command instead of reusing typecheck:evals script — .github/workflows/ci.yml

Line 354: npx tsc -p evals/tsconfig.json --noEmit duplicates the root typecheck:evals npm script (tsc -p evals/tsconfig.json --noEmit). Using npm run typecheck:evals would keep the CI invocation as the single source of truth for the typecheck command. Functionally equivalent; cosmetic consistency only.

🟡 LOW No tests for scoreUserSimArtifact extraction or unified matrix construction — evals/src/sim/multishot-user-sim.ts

scoreUserSimArtifact is extracted from userSimJudge with identical weights — a clean factoring, but the duplication (same 5-term composite formula now in two places) means a future change to one without the other silently diverges. No unit test pins the weights or verifies the two functions produce identical scores for the same input. Similarly, unified-trading-matrix.ts (606 lines, the core new eval) has no test file. The matrix needs live infra to run end-to-end, but the pure functions (buildUnifiedProfiles, buildUnifiedScenarios, modelOfProfile, agentEnvForModel, firstArtifact, campaignTokenUsage, scoreOf, meanScore) are all unit-testable.

🟡 LOW agentEnvForModel provider detection uses fragile URL string match — evals/src/trading/unified-trading-matrix.ts

Line 145: const isMoonshot = cfg.baseUrl.includes('moonshot'). Mapping model → opencode provider name via URL substring is brittle: if the Moonshot API baseUrl changes (e.g., regional endpoint), the provider name flips to 'zai-coding-plan' and the opencode sidecar receives wrong credentials. The MODEL_CONFIG in llm-call.ts (line 56) already centralizes per-model routing — adding a providerName: 'moonshot' | 'zai-coding-plan' field there and reading it directly in agentEnvForModel would

🟡 LOW metaSpendCall adds real LLM cost per cell and can cause false-negative integrity on transient failures — evals/src/trading/unified-trading-matrix.ts

Lines ~300-320: Each cell makes an extra LLM call (metaSpendCall) purely for token accounting, since the inner multishot campaign's LLM spend happens inside the sandbox (invisible to the eval). If this grounding call fails (rate limit, timeout, network), llmCallWithUsage returns usage:{input:0,output:0,costUsd:0}. Combined with the inner campaign's zero tokenUsage, the cell reports zero tokens total, and assertRealBackend flags the entire matrix as 'stub', failing loud even though the real operator ran fine. The comment says this is 'HONEST' but doesn't degrade gracefully. Fix: treat a failed grounding call as a warning, not a zero — or use a minimum token floor from the session transcript length to avoid a hard integrity failure.

🟡 LOW metaSpendCall fires for sessions with zero operator turns — evals/src/trading/unified-trading-matrix.ts

firstArtifact (line 546) checks cell.artifact && cell.artifact.turns, but JS arrays are always truthy (including empty []). This means even a session with zero turns returns operatorResponded: false but still triggers metaSpendCall (line 447) — an LLM call feeding '(no turns)' as the transcript. The call costs tokens for no useful signal and the combined judge already marks the cell as failed. Short-circuit metaSpendCall when session.turns.length === 0 to avoid this waste. The co

🟡 LOW Residual agent-eval version split (0.91 top-level vs 0.79 nested under agent-knowledge) — package-lock.json

agent-knowledge 1.7.0 hard-pins agent-eval to >=0.77.0 <0.80.0 (lockfile line 197), incompatible with the top-level ^0.91.0. npm nests agent-eval 0.79.0 under node_modules/@tangle-network/agent-knowledge/ (lockfile line 208). This is a 12-version gap. Impact is low: agent-knowledge is NOT imported by any eval TypeScript (verified via rg — only referenced in Rust activate.rs as a sandbox CLI tool and in prompt templates), so the eval pipeline uses only the top-level 0.91.0. This split existed before (was 0.91 vs 0.42, a 49-version gap) and t

🟡 LOW agent-knowledge@1.7.0 peer dep on agent-eval<0.80 untested against project agent-eval@0.91.0 — package-lock.json

agent-knowledge@1.7.0 declares peerDependency on @tangle-network/agent-eval with range >=0.77.0 <0.80.0, but the root project ships agent-eval@0.91.0. npm resolved this by nesting agent-eval@0.79.0 under agent-knowledge/node_modules/ — correct from a resolution standpoint, but this version combination (agent-knowledge 1.7 + agent-eval 0.91) was never tested by the agent-knowledge release. If project code passes agent-eval instances or types from root (0.91) into agent-knowledge APIs expecting types from 0.79, runtime errors or type mismatches could surface. Mitigation: the grep confirms agent-knowledge is consumed as a CLI/data package, not a library API, so blast radius is low. Fix: either bump agent-knowledge to a version that accepts agent-eval >=0.83 (aligning with agent-runtime@0.52 p

🟡 LOW agent-knowledge@1.7.0 peer-dep semver gap with agent-eval@0.91.0 — package.json

Line 41: @tangle-network/agent-knowledge bumped to ^1.7.0. This version declares @tangle-network/agent-eval: >=0.77.0 <0.80.0 as a regular dependency, but the project root supplies agent-eval@0.91.0 (line 40). The lock file resolves it via npm peer hoisting (node_modules/@tangle-network/agent-knowledge/node_modules/@tangle-network/agent-eval has peer:true, version:0.79.0), but the hoisted actual is 0.91.0 which exceeds the 0.80.0 ceiling. At runtime, agent-knowledge receives agent-eval@0.91.0 types/APIs it was not validated against. No sour


tangletools · 2026-06-14T11:21:23Z · trace

tangletools
tangletools previously approved these changes Jun 14, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — e0ec4e6e

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T12:25:44Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Value Audit — sound-with-nits

Verdict sound-with-nits
Concerns 4 (4 weak-concern)
Heuristic 0.0s
Duplication 0.0s
Interrogation 239.0s (2 bridge agents)
Total 239.0s

💰 Value — sound-with-nits

Adds a real-operator profile×persona×market matrix eval that degrades cleanly to the deterministic backtest when no operator URL is present, plus a CI typecheck lane for evals — coherent and in-grain, with only small local refactor opportunities.

  • What it does: Extends evals/src/trading/persona-agent-eval.ts from a deterministic backtest-only bridge into a unified runTradingPersonaEval surface: when operatorUrl (or env) is present it runs a runProfileMatrix campaign that sweeps operator LLM models (kimi-k2 / glm-4.7 / glm-5.1) against persona×market scenarios, each cell exercising the full operator stack via runMultishotUserSim; without an oper
  • Goals it achieves: 1) Give the trading eval a single entry point that works both offline in CI (deterministic backtest) and against live operator infra (real bot artifacts + tick side-effects). 2) Judge real-operator cells with a blend of observable artifact signal and the objective, model-invariant walk-forward backtest ground truth. 3) Prevent TypeScript regressions in the eval code from reaching main via a requir
  • Assessment: Good. The change is coherent: it reuses existing primitives (OperatorClient, runMultishotUserSim, runProfileMatrix, scorecard wiring) rather than inventing parallel infrastructure, and the two-mode degradation is explicit and documented. The CI lane is correctly classified on evals/package/tsconfig changes and wired into the gate. The scoring weights, integrity assertions, and trace/RunRecor
  • Better / existing approach: No materially better architecture. Two small local improvements: (1) llmCallWithUsage in evals/src/sim/llm-call.ts duplicates the body of llmCall; they should share a single internal stream-accumulation helper. (2) agentEnvForModel in evals/src/trading/persona-agent-eval.ts builds operator sandbox env from model routing; it would be more reusable if it lived next to the model routing SOT

🎯 Usefulness — sound-with-nits

A coherent consolidation: one entry point runTradingPersonaEval unifies deterministic backtest and real operator-matrix modes, reuses established substrate primitives, and is reachable from the existing bin, full-eval, and CI typecheck lane.

  • Integration: Wired and reachable. The bin evals/src/bin/agent-eval-trading-personas.ts:35 calls runTradingPersonaEval; evals/src/full/full-eval-runner.ts:39 calls it without operatorUrl so CI/full-eval runs deterministic mode today; package.json:19 exposes eval:trading-personas; .github/workflows/ci.yml:337-354 adds the Evals typecheck lane and the gate at ci.yml:401 requires it. Operator-mat
  • Fit with existing patterns: Fits the codebase's grain. It reuses the single OperatorClient (evals/src/sim/operator-client.ts:2-16), the existing runMultishotUserSim campaign primitive (evals/src/sim/multishot-user-sim.ts:425), and the substrate runProfileMatrix from @tangle-network/agent-eval/campaign (confirmed exported and loaded at node_modules/@tangle-network/agent-eval/package.json campaign export). It con
  • Real-world viability: Holds up on the non-happy paths that matter. It requires token or privateKey and throws a clear error if neither (persona-agent-eval.ts:441-446); defaults integrity to 'assert' with an env override to 'warn'; respects costCeiling; and marks no-turn cells as failed rather than scored zero (persona-agent-eval.ts:412-414). maxConcurrency defaults to 1, which is safe for the operat

💰 Value Audit

🟡 llmCallWithUsage duplicates llmCall body [duplication] ``

evals/src/sim/llm-call.ts:180-221 mirrors :137-173 almost exactly, only adding usage accumulation on llm_call events. Better: extract a shared internal helper that always accumulates usage and returns it; llmCall can simply drop the usage field. This reduces drift and makes future usage-aware callers cheaper.

🟡 agentEnvForModel is local to the trading eval [better-architecture] ``

agentEnvForModel in evals/src/trading/persona-agent-eval.ts:344 maps a logical model to the operator sandbox env (OPENCODE_MODEL_PROVIDER, OPENCODE_MODEL_NAME, etc.). The codebase already has model routing SOT in sim/llm-call.ts (resolveModel/MODEL_CONFIG) and operator env defaults in sim/operator-client.ts (deterministicAgentEnv). Moving agentEnvForModel next to one of those would let other eval surfaces sweep operator models without copying the env-building logic.

🎯 Usefulness Audit

🟡 PR body describes surface that was consolidated away [ergonomics] ``

The PR body mentions --matrix, eval:trading-matrix, and evals/src/trading/unified-trading-matrix.ts, but the actual commit (HEAD e0ec4e6) folded the operator-matrix capability into the existing persona-agent-eval.ts and deleted the standalone module/flag/script. The code is better unified, but the stale description could confuse reviewers; align the PR body with the commit message.

🟡 Operator-matrix mode writes scorecard records but skips regression diff [robustness] ``

In operator-matrix mode persona-agent-eval.ts:527-532 appends RunRecords via recordRunsToScorecard, yet the summary hardcodes regressed: false (persona-agent-eval.ts:576) and omits the formatted diff. Deterministic mode computes both via recordScorecardAndDiff (persona-agent-eval.ts:225-230). Consider calling recordScorecardAndDiff per profile in operator-matrix mode so regression detection works in both modes.


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260614T123327Z

…x) + CI lane

Follow-up to #169 (model-driven trading) / #170 (deps). ONE entry point —
runTradingPersonaEval (evals/src/trading/persona-agent-eval.ts) — that degrades
by infra, instead of separate modules:

- No operator URL -> DETERMINISTIC mode: the Rust walk-forward backtest ->
  RunRecords + trace + scorecard (offline; what full-eval/CI run). Unchanged.
- operatorUrl present -> OPERATOR-MATRIX mode: runProfileMatrix sweeps the
  PROFILE axis (operator model variants: kimi-k2/glm-4.7/glm-5.1, pinned into the
  REAL operator via agentEnv) x (persona x market). Each cell runs the FULL
  operator simulation (runMultishotUserSim -> real bot_artifacts +
  tick_side_effects), judged on real artifacts (60%) + objective backtest ground
  truth (40%) — not prose. Scorecard + assertRealBackend + byProfile/byPersona
  read straight from the matrix. Multi-round honestly degenerates to 1 (the
  provision->chat->capture cycle is single-pass; turns live inside each cell).

Consolidation: folds the operator-matrix capability INTO the existing bridge file
and DELETES the standalone module + the dual --matrix bin flag + the redundant
npm script. One surface, one entry point, shared scorecard/profile/ground-truth
helpers. The bin auto-degrades by --operator-url; full-eval routes through the
same function.

CI: new 'Evals typecheck' lane (node 22 + npm ci + tsc -p evals/tsconfig.json),
classified on evals/ + package*.json + tsconfig, required in the gate.

Deps: agent-runtime ^0.52, agent-knowledge ^1.7 (over #170's ^0.50/^1.5);
agent-eval ^0.91. Validated: npm ci clean, tsc 0 errors.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — ae578723

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T12:34:22Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Value Audit — sound-with-nits

Verdict sound-with-nits
Concerns 5 (2 medium-concern, 3 weak-concern)
Heuristic 0.0s
Duplication 0.0s
Interrogation 248.3s (2 bridge agents)
Total 248.3s

💰 Value — sound-with-nits

Unifies the trading persona eval into a single degrading entry point (deterministic backtest vs. real operator model×persona matrix) and adds a CI typecheck lane — coherent and in the codebase's grain, with minor local duplication that should be shared.

  • What it does: The change folds a real-operator profile/persona matrix eval into the existing evals/src/trading/persona-agent-eval.ts bridge. runTradingPersonaEval auto-routes: with no operator URL it runs the deterministic Rust walk-forward backtest (runPersonaSuite) and feeds the scorecard; with operatorUrl it sweeps operator model variants (kimi-k2 / glm-4.7 / glm-5.1) × personas × markets via `runPro
  • Goals it achieves: From the code, the goals are: (1) replace multiple persona/profile eval surfaces with one entry point that degrades gracefully by available infra, so CI/local runs stay cheap while live operator testing is one flag away; (2) judge real operator behavior on observable state (bot_artifacts + tick_side_effects) plus an objective backtest, not prose-only rubrics; (3) gate the TypeScript eval code
  • Assessment: Good change. The unified surface is simpler than the prior separate bridge, the degradation-by-infra design is appropriate for a codebase that runs both in cheap CI and expensive live-operator campaigns, and it reuses existing helpers (runMultishotUserSim, scoreUserSimArtifact, buildTradingScorecardAgentProfile). The CI lane fills a real gap (no TS eval typecheck was required before). The co
  • Better / existing approach: A few local simplifications exist, but no materially different architecture is needed. I searched evals/src for runProfileMatrix, scoreUserSimArtifact, llmCall, and llmCallWithUsage and checked the parent commit; no existing equivalent covers the unified matrix, so the change is not reinventing something already present.

🎯 Usefulness — sound-with-nits

A coherent consolidation: one eval entry point that degrades from deterministic CI backtest to a real operator model×persona×market matrix, wired into the bin, full-eval, and a new CI typecheck lane.

  • Integration: Reachable. The unified function is called by evals/src/full/full-eval-runner.ts:9,39 (deterministic path) and the bin evals/src/bin/agent-eval-trading-personas.ts:35, which is exposed as npm run eval:trading-personas (package.json:19). The new Evals typecheck CI lane is classified in .github/workflows/ci.yml:98 and required in the gate at line 401. (Note: the PR body mentions a `--matr
  • Fit with existing patterns: Fits the codebase grain. It consumes the substrate primitive runProfileMatrix from @tangle-network/agent-eval/campaign (evals/src/trading/persona-agent-eval.ts:468) exactly as documented in the type definitions (node_modules/@tangle-network/agent-eval/dist/campaign/index.d.ts:945-1045), reuses runMultishotUserSim/OperatorClient.configureSecrets for real-operator provisioning, and reuse
  • Real-world viability: Mostly robust for realistic use. Auth supports token or private-key with refresh; it fails loudly if neither is supplied (persona-agent-eval.ts:441-446). Concurrency defaults to 1, cost ceilings and assertRealBackend/summarizeBackendIntegrity guards are wired, and the judge marks cells failed (not zero) when the operator produces no turns (persona-agent-eval.ts:412-413). One profile-spec

💰 Value Audit

🟠 llmCall and llmCallWithUsage are near-identical implementations [duplication] ``

The two functions in evals/src/sim/llm-call.ts share almost the entire stream-consumption loop, differing only in whether they accumulate llm_call usage events (llm-call.ts:137-173 vs :180-221). A single llmCall that always reports usage (e.g., usage as an optional field on LlmCallResult) would remove the duplicate code and likely eliminate the per-cell metaSpendCall workaround in evals/src/trading/persona-agent-eval.ts:509-515, because the user-sim turns and judge calls would

🟡 Scoring weights are duplicated between scoreUserSimArtifact and userSimJudge [duplication] ``

The composite formula 0.20 intent + 0.15 constraints + 0.40 committed + 0.15 self-improve + 0.10 productive appears verbatim in evals/src/sim/multishot-user-sim.ts:311-315 and again in :357-361. The comment at :294-297 explicitly notes they must stay in sync. A shared helper (e.g., computeArtifactComposite(rubric, state)) would remove the sync risk.

🟡 Provider mapping in agentEnvForModel is inferred from baseUrl string [better-architecture] ``

In evals/src/trading/persona-agent-eval.ts:347 the provider is decided by cfg.baseUrl.includes('moonshot'). The single source of truth for models is MODEL_CONFIG in evals/src/sim/llm-call.ts:56-75; adding a provider field there would let agentEnvForModel read it directly and avoid a fragile string match if a new provider endpoint is added.

🎯 Usefulness Audit

🟠 kimi-k2 profile may not resolve in the real operator sidecar [integration] ``

Evidence: agentEnvForModel (evals/src/trading/persona-agent-eval.ts:352-358) sets OPENCODE_MODEL_PROVIDER: 'moonshot' and OPENCODE_MODEL: 'moonshot/kimi-k2.6' for the kimi-k2 profile, with SIDECAR_DEFAULT_HARNESS: 'opencode'. However, the sidecar opencode config written during activation (trading-blueprint-lib/src/jobs/activate.rs:122-163) only declares zai-coding-plan and openrouter providers, and the arena's provider table (arena/src/lib/config/aiProviders.ts) also lacks `m

🟡 failOnRegression is exposed but ignored in operator-matrix mode [ergonomics] ``

Evidence: TradingPersonaEvalOptions.failOnRegression is in the shared options block (evals/src/trading/persona-agent-eval.ts:87) and is honored in deterministic mode (persona-agent-eval.ts:232-236), but operator-matrix mode appends to the scorecard and returns regressed: false unconditionally (persona-agent-eval.ts:576) without checking the option or running diffScorecard. Either wire the same regression throw in operator-matrix mode, or move failOnRegression out of the shared type


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260614T124024Z

@drewstone drewstone merged commit 731aeab into main Jun 14, 2026
14 checks passed
@drewstone drewstone deleted the feat/unified-trading-eval branch June 14, 2026 12:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants