chore(deps): bump agent-runtime ^0.50 + agent-eval ^0.91#170
Conversation
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — d41e69ad
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T00:53:09Z
tangletools
left a comment
There was a problem hiding this comment.
🟠 Value Audit — better-approach-exists
| Verdict | better-approach-exists |
| Concerns | 1 (1 strong-concern) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 143.6s (2 bridge agents) |
| Total | 143.6s |
💰 Value — sound
Bumps the root evals package to canonical Tangle substrate versions (^0.91 agent-eval, ^0.50 agent-runtime) and keeps the existing single-point runtime abstraction intact; typecheck:evals passes.
- What it does: Updates package.json:38-41 and package-lock.json to pin
@tangle-network/agent-evalto^0.91.0(from^0.70.0) and@tangle-network/agent-runtimeto^0.50.0(from^0.36.0). The lockfile now resolves agent-eval 0.91.0 and agent-runtime 0.50.0, matching the newer peer-dependency ranges agent-runtime declares (@tangle-network/agent-eval >=0.83.0 <1.0.0, `@tangle-network/sandbox >=0.1.2 <0. - Goals it achieves: Closes the 14-minor fleet skew on agent-runtime and brings the eval harness onto the current substrate line so it receives upstream runtime/eval fixes, features, and compatible peer ranges. The repo's eval import surface (
createOpenAICompatibleBackend,runAgentTaskStream,AgentTaskSpecfrom agent-runtime; campaign/trace APIs from agent-eval) is preserved as non-breaking across this jump. - Assessment: Good change. It is scoped exactly to the package that consumes these dependencies (only ./package.json references them; arena/package.json, sdk-ts/package.json, and the CJS tool package do not). The evals already centralize agent-runtime usage behind
evals/src/sim/llm-call.ts:33-34, so the bump touches a single abstraction boundary rather than scattered call sites. Verification held: after `npm - Better / existing approach: none — this is the right approach. Searched the workspace for package.json files and agent-runtime/agent-eval imports: only the root package depends on these libraries, and only evals/src imports them. There is no duplicate dependency set to consolidate and no alternative abstraction already present that should absorb this bump. A workspace-wide bump would be wrong because the other packages do no
🎯 Usefulness — better-approach-exists
Bumps eval substrate to current runtime/eval and typechecks, but leaves the provisioned trading-agent sandbox pinned to the old versions in activate.rs, so the fleet skew the PR aims to close remains for deployed agents.
- Integration: The bumped runtime/eval is reachable from the eval substrate:
evals/src/sim/llm-call.ts:33-34importscreateOpenAICompatibleBackend,runAgentTaskStream, andAgentTaskSpecfrom@tangle-network/agent-runtime, and many eval modules import@tangle-network/agent-eval(e.g.evals/src/product/chat-sandbox-runner.ts:5,evals/src/trading/lifecycle-runner.ts:3, `evals/src/analysis/rlm-analys - Fit with existing patterns: It fits the established eval pattern: all judge/user-sim LLM calls route through the single
llm-call.tshelper that wraps runtime primitives (evals/src/sim/llm-call.ts:1-34). It does not compete with any existing pattern. The localEvalProfiledivergence inevals/src/profiles/types.ts:36-53is intentional and untouched, which is consistent with the documented design. - Real-world viability: The runtime 0.50 surface used by this repo (
createOpenAICompatibleBackend,runAgentTaskStream,AgentTaskSpec) is stable and the new optional peer deps (playwright,@tangle-network/sandbox) are optional, so install-time breakage is unlikely. Error-path handling inevals/src/sim/llm-call.ts:139-159(backend_error events, AbortController timeout, catch-and-return) is unchanged and will co
🎯 Usefulness Audit
🔴 Root deps bumped but provisioned trading-agent sandbox stays on old runtime/eval [integration] ``
package.json:39,41now pins^0.91.0/^0.50.0, buttrading-blueprint-lib/src/jobs/activate.rs:36-38still emits^0.70.0/^0.36.0for the generatedtrading-agentpackage (activate.rs:72-94). The testtrading_agent_substrate_versions_match_root_packageatactivate.rs:1619-1635explicitly enforces that the generated sandbox matches rootpackage.json; because the constants were not updated, that invariant is broken. UpdateTRADING_AGENT_AGENT_EVAL_VERSIONto^0.91.0and `T
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
…x) + CI lane Follow-up to #169 (model-driven trading) / #170 (deps). ONE entry point — runTradingPersonaEval (evals/src/trading/persona-agent-eval.ts) — that degrades by infra, instead of separate modules: - No operator URL -> DETERMINISTIC mode: the Rust walk-forward backtest -> RunRecords + trace + scorecard (offline; what full-eval/CI run). Unchanged. - operatorUrl present -> OPERATOR-MATRIX mode: runProfileMatrix sweeps the PROFILE axis (operator model variants: kimi-k2/glm-4.7/glm-5.1, pinned into the REAL operator via agentEnv) x (persona x market). Each cell runs the FULL operator simulation (runMultishotUserSim -> real bot_artifacts + tick_side_effects), judged on real artifacts (60%) + objective backtest ground truth (40%) — not prose. Scorecard + assertRealBackend + byProfile/byPersona read straight from the matrix. Multi-round honestly degenerates to 1 (the provision->chat->capture cycle is single-pass; turns live inside each cell). Consolidation: folds the operator-matrix capability INTO the existing bridge file and DELETES the standalone module + the dual --matrix bin flag + the redundant npm script. One surface, one entry point, shared scorecard/profile/ground-truth helpers. The bin auto-degrades by --operator-url; full-eval routes through the same function. CI: new 'Evals typecheck' lane (node 22 + npm ci + tsc -p evals/tsconfig.json), classified on evals/ + package*.json + tsconfig, required in the gate. Deps: agent-runtime ^0.52, agent-knowledge ^1.7 (over #170's ^0.50/^1.5); agent-eval ^0.91. Validated: npm ci clean, tsc 0 errors.
…x) + CI lane Follow-up to #169 (model-driven trading) / #170 (deps). ONE entry point — runTradingPersonaEval (evals/src/trading/persona-agent-eval.ts) — that degrades by infra, instead of separate modules: - No operator URL -> DETERMINISTIC mode: the Rust walk-forward backtest -> RunRecords + trace + scorecard (offline; what full-eval/CI run). Unchanged. - operatorUrl present -> OPERATOR-MATRIX mode: runProfileMatrix sweeps the PROFILE axis (operator model variants: kimi-k2/glm-4.7/glm-5.1, pinned into the REAL operator via agentEnv) x (persona x market). Each cell runs the FULL operator simulation (runMultishotUserSim -> real bot_artifacts + tick_side_effects), judged on real artifacts (60%) + objective backtest ground truth (40%) — not prose. Scorecard + assertRealBackend + byProfile/byPersona read straight from the matrix. Multi-round honestly degenerates to 1 (the provision->chat->capture cycle is single-pass; turns live inside each cell). Consolidation: folds the operator-matrix capability INTO the existing bridge file and DELETES the standalone module + the dual --matrix bin flag + the redundant npm script. One surface, one entry point, shared scorecard/profile/ground-truth helpers. The bin auto-degrades by --operator-url; full-eval routes through the same function. CI: new 'Evals typecheck' lane (node 22 + npm ci + tsc -p evals/tsconfig.json), classified on evals/ + package*.json + tsconfig, required in the gate. Deps: agent-runtime ^0.52, agent-knowledge ^1.7 (over #170's ^0.50/^1.5); agent-eval ^0.91. Validated: npm ci clean, tsc 0 errors.
…x) + CI lane (#171) Follow-up to #169 (model-driven trading) / #170 (deps). ONE entry point — runTradingPersonaEval (evals/src/trading/persona-agent-eval.ts) — that degrades by infra, instead of separate modules: - No operator URL -> DETERMINISTIC mode: the Rust walk-forward backtest -> RunRecords + trace + scorecard (offline; what full-eval/CI run). Unchanged. - operatorUrl present -> OPERATOR-MATRIX mode: runProfileMatrix sweeps the PROFILE axis (operator model variants: kimi-k2/glm-4.7/glm-5.1, pinned into the REAL operator via agentEnv) x (persona x market). Each cell runs the FULL operator simulation (runMultishotUserSim -> real bot_artifacts + tick_side_effects), judged on real artifacts (60%) + objective backtest ground truth (40%) — not prose. Scorecard + assertRealBackend + byProfile/byPersona read straight from the matrix. Multi-round honestly degenerates to 1 (the provision->chat->capture cycle is single-pass; turns live inside each cell). Consolidation: folds the operator-matrix capability INTO the existing bridge file and DELETES the standalone module + the dual --matrix bin flag + the redundant npm script. One surface, one entry point, shared scorecard/profile/ground-truth helpers. The bin auto-degrades by --operator-url; full-eval routes through the same function. CI: new 'Evals typecheck' lane (node 22 + npm ci + tsc -p evals/tsconfig.json), classified on evals/ + package*.json + tsconfig, required in the gate. Deps: agent-runtime ^0.52, agent-knowledge ^1.7 (over #170's ^0.50/^1.5); agent-eval ^0.91. Validated: npm ci clean, tsc 0 errors.
Pure dependency bump of the Tangle substrate pins to the canonical fleet versions.
@tangle-network/agent-runtime:^0.36.0→^0.50.0@tangle-network/agent-eval:^0.70.0→^0.91.0Widest skew in the fleet (14 minors on runtime). The repo's evals import surface is the canonical pair
{createOpenAICompatibleBackend, runAgentTaskStream}+AgentTaskSpec— non-breaking across this jump.Verification:
npm run typecheck:evalspasses clean (exit 0) under the bumped pins. The one-shot judge divergence inevals/src/profiles/types.ts(EvalProfile) is intentional and left untouched. Merges cleanly intomain.