chore(deps): bump agent-runtime ^0.50 + agent-eval ^0.91 by drewstone · Pull Request #170 · tangle-network/ai-trading-blueprint

drewstone · 2026-06-14T00:53:03Z

Pure dependency bump of the Tangle substrate pins to the canonical fleet versions.

@tangle-network/agent-runtime: ^0.36.0 → ^0.50.0
@tangle-network/agent-eval: ^0.70.0 → ^0.91.0

Widest skew in the fleet (14 minors on runtime). The repo's evals import surface is the canonical pair {createOpenAICompatibleBackend, runAgentTaskStream} + AgentTaskSpec — non-breaking across this jump.

Verification: npm run typecheck:evals passes clean (exit 0) under the bumped pins. The one-shot judge divergence in evals/src/profiles/types.ts (EvalProfile) is intentional and left untouched. Merges cleanly into main.

tangletools

✅ Auto-approved PR — `d41e69ad`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T00:53:09Z}

tangletools

🟠 Value Audit — better-approach-exists


Verdict	better-approach-exists
Concerns	1 (1 strong-concern)
Heuristic	0.0s
Duplication	0.0s
Interrogation	143.6s (2 bridge agents)
Total	143.6s

💰 Value — sound

Bumps the root evals package to canonical Tangle substrate versions (^0.91 agent-eval, ^0.50 agent-runtime) and keeps the existing single-point runtime abstraction intact; typecheck:evals passes.

What it does: Updates package.json:38-41 and package-lock.json to pin @tangle-network/agent-eval to ^0.91.0 (from ^0.70.0) and @tangle-network/agent-runtime to ^0.50.0 (from ^0.36.0). The lockfile now resolves agent-eval 0.91.0 and agent-runtime 0.50.0, matching the newer peer-dependency ranges agent-runtime declares (@tangle-network/agent-eval >=0.83.0 <1.0.0, `@tangle-network/sandbox >=0.1.2 <0.
Goals it achieves: Closes the 14-minor fleet skew on agent-runtime and brings the eval harness onto the current substrate line so it receives upstream runtime/eval fixes, features, and compatible peer ranges. The repo's eval import surface (createOpenAICompatibleBackend, runAgentTaskStream, AgentTaskSpec from agent-runtime; campaign/trace APIs from agent-eval) is preserved as non-breaking across this jump.
Assessment: Good change. It is scoped exactly to the package that consumes these dependencies (only ./package.json references them; arena/package.json, sdk-ts/package.json, and the CJS tool package do not). The evals already centralize agent-runtime usage behind evals/src/sim/llm-call.ts:33-34, so the bump touches a single abstraction boundary rather than scattered call sites. Verification held: after `npm
Better / existing approach: none — this is the right approach. Searched the workspace for package.json files and agent-runtime/agent-eval imports: only the root package depends on these libraries, and only evals/src imports them. There is no duplicate dependency set to consolidate and no alternative abstraction already present that should absorb this bump. A workspace-wide bump would be wrong because the other packages do no

🎯 Usefulness — better-approach-exists

Bumps eval substrate to current runtime/eval and typechecks, but leaves the provisioned trading-agent sandbox pinned to the old versions in activate.rs, so the fleet skew the PR aims to close remains for deployed agents.

Integration: The bumped runtime/eval is reachable from the eval substrate: evals/src/sim/llm-call.ts:33-34 imports createOpenAICompatibleBackend, runAgentTaskStream, and AgentTaskSpec from @tangle-network/agent-runtime, and many eval modules import @tangle-network/agent-eval (e.g. evals/src/product/chat-sandbox-runner.ts:5, evals/src/trading/lifecycle-runner.ts:3, `evals/src/analysis/rlm-analys
Fit with existing patterns: It fits the established eval pattern: all judge/user-sim LLM calls route through the single llm-call.ts helper that wraps runtime primitives (evals/src/sim/llm-call.ts:1-34). It does not compete with any existing pattern. The local EvalProfile divergence in evals/src/profiles/types.ts:36-53 is intentional and untouched, which is consistent with the documented design.
Real-world viability: The runtime 0.50 surface used by this repo (createOpenAICompatibleBackend, runAgentTaskStream, AgentTaskSpec) is stable and the new optional peer deps (playwright, @tangle-network/sandbox) are optional, so install-time breakage is unlikely. Error-path handling in evals/src/sim/llm-call.ts:139-159 (backend_error events, AbortController timeout, catch-and-return) is unchanged and will co

🎯 Usefulness Audit

🔴 Root deps bumped but provisioned trading-agent sandbox stays on old runtime/eval [integration] ``

package.json:39,41 now pins ^0.91.0 / ^0.50.0, but trading-blueprint-lib/src/jobs/activate.rs:36-38 still emits ^0.70.0 / ^0.36.0 for the generated trading-agent package (activate.rs:72-94). The test trading_agent_substrate_versions_match_root_package at activate.rs:1619-1635 explicitly enforces that the generated sandbox matches root package.json; because the constants were not updated, that invariant is broken. Update TRADING_AGENT_AGENT_EVAL_VERSION to ^0.91.0 and `T

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass	What it asks
Heuristic	Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication	Do added function/class names already exist elsewhere in the repo?
Value Audit	What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit	Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

_{value-audit · 20260614T005715Z}

…x) + CI lane Follow-up to #169 (model-driven trading) / #170 (deps). ONE entry point — runTradingPersonaEval (evals/src/trading/persona-agent-eval.ts) — that degrades by infra, instead of separate modules: - No operator URL -> DETERMINISTIC mode: the Rust walk-forward backtest -> RunRecords + trace + scorecard (offline; what full-eval/CI run). Unchanged. - operatorUrl present -> OPERATOR-MATRIX mode: runProfileMatrix sweeps the PROFILE axis (operator model variants: kimi-k2/glm-4.7/glm-5.1, pinned into the REAL operator via agentEnv) x (persona x market). Each cell runs the FULL operator simulation (runMultishotUserSim -> real bot_artifacts + tick_side_effects), judged on real artifacts (60%) + objective backtest ground truth (40%) — not prose. Scorecard + assertRealBackend + byProfile/byPersona read straight from the matrix. Multi-round honestly degenerates to 1 (the provision->chat->capture cycle is single-pass; turns live inside each cell). Consolidation: folds the operator-matrix capability INTO the existing bridge file and DELETES the standalone module + the dual --matrix bin flag + the redundant npm script. One surface, one entry point, shared scorecard/profile/ground-truth helpers. The bin auto-degrades by --operator-url; full-eval routes through the same function. CI: new 'Evals typecheck' lane (node 22 + npm ci + tsc -p evals/tsconfig.json), classified on evals/ + package*.json + tsconfig, required in the gate. Deps: agent-runtime ^0.52, agent-knowledge ^1.7 (over #170's ^0.50/^1.5); agent-eval ^0.91. Validated: npm ci clean, tsc 0 errors.

…x) + CI lane (#171) Follow-up to #169 (model-driven trading) / #170 (deps). ONE entry point — runTradingPersonaEval (evals/src/trading/persona-agent-eval.ts) — that degrades by infra, instead of separate modules: - No operator URL -> DETERMINISTIC mode: the Rust walk-forward backtest -> RunRecords + trace + scorecard (offline; what full-eval/CI run). Unchanged. - operatorUrl present -> OPERATOR-MATRIX mode: runProfileMatrix sweeps the PROFILE axis (operator model variants: kimi-k2/glm-4.7/glm-5.1, pinned into the REAL operator via agentEnv) x (persona x market). Each cell runs the FULL operator simulation (runMultishotUserSim -> real bot_artifacts + tick_side_effects), judged on real artifacts (60%) + objective backtest ground truth (40%) — not prose. Scorecard + assertRealBackend + byProfile/byPersona read straight from the matrix. Multi-round honestly degenerates to 1 (the provision->chat->capture cycle is single-pass; turns live inside each cell). Consolidation: folds the operator-matrix capability INTO the existing bridge file and DELETES the standalone module + the dual --matrix bin flag + the redundant npm script. One surface, one entry point, shared scorecard/profile/ground-truth helpers. The bin auto-degrades by --operator-url; full-eval routes through the same function. CI: new 'Evals typecheck' lane (node 22 + npm ci + tsc -p evals/tsconfig.json), classified on evals/ + package*.json + tsconfig, required in the gate. Deps: agent-runtime ^0.52, agent-knowledge ^1.7 (over #170's ^0.50/^1.5); agent-eval ^0.91. Validated: npm ci clean, tsc 0 errors.

chore(deps): bump agent-runtime ^0.50 + agent-eval ^0.91

d41e69a

tangletools approved these changes Jun 14, 2026

View reviewed changes

tangletools reviewed Jun 14, 2026

View reviewed changes

drewstone merged commit 2fc0ed7 into main Jun 14, 2026
13 checks passed

drewstone mentioned this pull request Jun 14, 2026

feat(evals): unified trading matrix eval + CI evals typecheck lane #171

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(deps): bump agent-runtime ^0.50 + agent-eval ^0.91#170

chore(deps): bump agent-runtime ^0.50 + agent-eval ^0.91#170
drewstone merged 1 commit into
mainfrom
chore/bump-substrate-0.50

drewstone commented Jun 14, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented Jun 14, 2026

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — d41e69ad

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

🟠 Value Audit — better-approach-exists

💰 Value — sound

🎯 Usefulness — better-approach-exists

🎯 Usefulness Audit

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ Auto-approved PR — `d41e69ad`