You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-**AI agent runtime no longer blocks tool calls with hidden runtime budgets.** The runtime-enforced tool-call budget introduced in 4.5.28 has been removed because it could block execution tools such as `orders_submit_order` and invalidate agent trading results.
9
+
-**AI agent tool results and committee handoffs are no longer silently token-truncated by the benchmark path.** Large handoffs/tool results now preserve the model-visible evidence instead of replacing it with middle-truncated excerpts.
10
+
-**AI benchmark usage accounting now aggregates raw traces across all committee roles.** The provider benchmark runner no longer relies on a last-writer detail artifact that undercounted multi-agent token usage and cost.
11
+
-**Multi-agent detail parquet files now include all agent rows for stats-file backtests.** Observability artifacts for committee-style strategies include every role instead of only the most recent writer.
12
+
13
+
### Changed
14
+
-**The AI Investment Committee example uses prompt guidance instead of numeric tool-call caps.** The example still asks agents to be concise and targeted, but it does not impose hidden research/follow-up/portfolio tool budgets.
15
+
-**The AI committee provider benchmark plan marks the enforced 4.5.28 results invalid for trading conclusions.** The investigation doc records why hidden tool-call and truncation controls polluted the prior 14-day and three-month results, and documents the safer one-model smoke-test plan for Cerebras, direct DeepSeek Flash, and Together after billing propagation.
Copy file name to clipboardExpand all lines: docs/investigations/2026-05-20_AI_COMMITTEE_PROVIDER_BENCHMARK_PLAN.md
+79-6Lines changed: 79 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,6 +3,79 @@
3
3
**Date:** 2026-05-20
4
4
**Scope:** Benchmarking the LumiBot AI Investment Committee across Gemini, OpenAI, Together AI, Kimi, Qwen, Cerebras, and optional direct DeepSeek models.
5
5
6
+
## Correction: Hidden Safety Rails Invalidated The Enforced Results
7
+
8
+
The enforced 14-day and three-month benchmark results below are not valid
9
+
trading-performance evidence. They used hidden behavior controls that changed
10
+
the thing being measured:
11
+
12
+
- Runtime tool-call budgets returned budget-exceeded payloads instead of
13
+
executing additional tools. This blocked `orders_submit_order`, so models
14
+
that tried to trade could be forced into all-cash results.
15
+
- Prompt-level numeric tool-call budgets changed the agent behavior under test.
16
+
Benchmark prompts may ask agents to be concise and targeted, but they should
17
+
not impose arbitrary tool counts unless the experiment is explicitly about
18
+
constrained agents.
19
+
- Handoff/tool-result truncation changed the evidence available to downstream
20
+
agents. Context problems should be handled with narrower tools, structured
21
+
outputs, provider-appropriate model selection, or clear diagnostic failures,
22
+
not hidden middle truncation.
23
+
- The cost summaries undercounted usage because the compact summary read a
24
+
last-writer agent detail artifact instead of aggregating every committee role
25
+
from raw traces or a combined all-agent detail file.
26
+
27
+
Before spending on another full benchmark, remove those hidden controls, fix
28
+
usage aggregation, and rerun a very small smoke test that verifies order tools
29
+
execute normally.
30
+
31
+
## Post-Correction Smoke: 2026-05-20 Local
32
+
33
+
Fixes applied before rerunning:
34
+
35
+
- Removed runtime tool-call budget enforcement from the agent runtime.
36
+
- Removed prompt-level numeric tool-call budgets from the AI Investment
37
+
Committee example and benchmark runner.
38
+
- Removed handoff/tool-result truncation from the benchmark path.
39
+
- Fixed benchmark usage accounting to aggregate raw trace files across every
40
+
committee role instead of trusting a last-writer detail artifact.
- Estimated cost using static price map: `$0.522267` no-cache,
68
+
`$0.081489` cache-adjusted.
69
+
- Trading result: still `0%` return and cash-only, but this was a model
70
+
decision, not a budget block. Portfolio-manager traces for both days say
71
+
`NO TRADE`; no `orders_submit_order` call was blocked.
72
+
73
+
Next Together smoke should use a cheaper current model first:
74
+
`together_ai/Qwen/Qwen3-235B-A22B-Instruct-2507-tput` or
75
+
`together_ai/openai/gpt-oss-120b`. Kimi should mean Kimi K2.6 only, and only as
76
+
an expensive compatibility/quality sample, not as a cost-sensitive benchmark
77
+
default.
78
+
6
79
## Recommendation
7
80
8
81
Use the AI Investment Committee example as the primary benchmark. It is the right workload because it stresses the exact behavior we care about:
@@ -76,7 +149,7 @@ For fair model comparison, set all four to the same candidate model first. Mixed
76
149
|---|---|---|
77
150
| Qwen3 235B FP8 throughput |`together_ai/Qwen/Qwen3-235B-A22B-Instruct-2507-tput`| Very cheap throughput model. Good candidate for "can a low-cost open model actually trade?" |
78
151
| GPT-OSS 120B on Together |`together_ai/openai/gpt-oss-120b`| Open-weight reasoning baseline via Together. Low cost, tool support listed by Together. |
79
-
| Qwen3.6 Plus |`together_ai/Qwen/Qwen3.6-Plus`| Cheaper broad reasoning candidate. Test only if it passes tool-call smoke. |
152
+
| Qwen3.6 Plus |`together_ai/Qwen/Qwen3.6-Plus`| Cheaper broad reasoning candidate, but Together's current table does not clearly list function-calling support for it. Test only if a tool-call smoke confirms it works. |
80
153
| Kimi K2.6 |`together_ai/moonshotai/Kimi-K2.6`| Agentic/model-swarm positioning, 256K context, function calling listed by Together. Expensive enough that it should not be in the first cost-sensitive finalist set unless smoke quality is clearly strong. |
81
154
| Together DeepSeek V4 Pro |`together_ai/deepseek-ai/DeepSeek-V4-Pro`| Together-hosted DeepSeek option. More expensive than direct DeepSeek and no documented Together V4 Flash option was found, but it avoids sending requests to `api.deepseek.com`. Requires `TOGETHERAI_API_KEY`. |
82
155
@@ -247,7 +320,7 @@ Only run the finalists for two or three months. Recommended finalists likely:
247
320
- Together-hosted DeepSeek V4 Pro is not a cost winner. It is more expensive than direct DeepSeek V4 Pro and far more expensive than direct DeepSeek V4 Flash. Use it only if we specifically want DeepSeek behavior without calling DeepSeek's own API endpoint.
248
321
- Direct DeepSeek V4 Flash is the best raw cost bet, but it has a privacy posture Rob does not like for proprietary trading data. Keep it optional.
249
322
- Gemini 3.5 Flash is the likely closed-model quality/speed baseline. Google published strong tool-use and finance-agent model-card numbers, so it belongs in the benchmark.
250
-
- Kimi K2.5 looked promising in the first smoke because it used tools and placed a bounded order. Kimi K2.6 is still worth a smoke test only if the higher price is justified by quality.
323
+
- Kimi K2.5 looked promising in historical smoke runs because it used tools and placed bounded orders, but it should not be used going forward. If testing Kimi, use Kimi K2.6 only, and treat it as an expensive compatibility/quality sample.
251
324
- Cerebras is worth testing for speed, but the current ADK/LiteLLM path needs a message-normalization fix for `reasoning_content` before `cerebras/gpt-oss-120b` can complete.
252
325
- Qwen throughput is cheap and fast enough to keep testing, but the `list_fred_series` hallucinated tool call means we should watch tool discipline carefully.
253
326
@@ -396,7 +469,7 @@ Important benchmark runner fixes from this phase:
396
469
- The paid benchmark runner now prints JSON `model_start` and `model_finished` events so long runs are observable.
397
470
- Benchmark artifacts can be summarized with `/Users/robertgrzesik/Development/lumibot/scripts/summarize_ai_committee_provider_benchmarks.py`, which reads per-model `result.json` files and writes compact JSON/Markdown comparisons.
398
471
- The runner accepts `--agent-run-timeout-seconds` for slow provider qualifiers. This is a per-agent timeout, not the overall benchmark timeout. Keep the default for fast providers; use a higher value for Qwen/Kimi only if the model is making progress but individual calls exceed the runtime's default safety rail.
399
-
-The AI committee example now asks each role to produce a structured handoff under the strategy parameter `handoff_target_tokens`, default `24000`, and applies a reusable Lumibot token-budget helper at `handoff_max_tokens`, default `32000`, before passing text to the next role. If a model ignores the target, the helper middle-truncates with an explicit notice instead of silently chopping or crashing the strategy. A higher target does not force the model to use the full budget; the prompt explicitly says not to pad the handoff just to fill the token budget.
472
+
-Historical note, now reverted: the AI committee example briefly applied a reusable token-budget helper at `handoff_max_tokens`before passing text to the next role. This was a bad benchmark control because middle truncation changed the evidence seen by downstream agents.
400
473
401
474
Artifacts:
402
475
@@ -410,11 +483,11 @@ Results so far:
410
483
-`together_ai/Qwen/Qwen3-235B-A22B-Instruct-2507-tput`, uncapped: failed with `ContextWindowExceededError` after sending about `2,951,306` tokens into a `262,144` token context window. Root cause was oversized role handoffs in the committee example, not a bad API key.
411
484
-`together_ai/Qwen/Qwen3-235B-A22B-Instruct-2507-tput`, bounded-handoff rerun: hit the one-hour process timeout after `49` agent run summaries / `12` complete committee cycles with no repeated context-window failure. Partial usage: input `2,025,198`, output `41,640`, tool calls `363`, estimated cost `$0.430024`. The handoff contract fixed the failure mode, but Qwen needs a longer timeout to finish the qualifier.
412
485
- Parallel rerun after token-budgeted handoffs showed `deepseek/deepseek-v4-flash` can still exceed context with raw tool results: provider rejected about `7,033,087` requested tokens against a `1,048,576` context window. This exposed a second boundary: tool results, especially raw SEC/companyfacts-style payloads, must also be token-budgeted before entering model context.
413
-
-Runtime fix added after the DeepSeek failure: `lumibot.components.agents.context_budget.budget_text_by_tokens()` is now used at the tool-result boundary. Oversized tool results are replaced with an explicit bounded excerpt and a notice telling the model to call a narrower tool/query when more detail is needed.
414
-
- The first Qwen run after tool-result budgeting still failed with Together's generic `Input validation error` after an earlier 300-second agent timeout. The likely remaining issue is accumulated context/request shape, not credentials. The tool-result budget was changed from a hard-coded 12K token cap to a 4K default with `LUMIBOT_AGENT_TOOL_RESULT_MAX_TOKENS` override for model-specific reruns.
486
+
-Historical note, now reverted: runtime tool-result budgeting was added after the DeepSeek failure, then removed because bounded excerpts changed model-visible evidence and invalidated the benchmark.
487
+
- The first Qwen run after tool-result budgeting still failed with Together's generic `Input validation error` after an earlier 300-second agent timeout. The likely remaining issue was accumulated context/request shape, not credentials.
415
488
- Fixed-budget rerun state: Qwen rerun with 4K tool-result budget and 900-second per-agent timeout started in `/Users/robertgrzesik/Development/lumibot/artifacts/ai_committee_provider_benchmarks/20260520_144512`. DeepSeek, Gemini, Kimi, OpenAI, and Cerebras fixed-budget reruns started in `/Users/robertgrzesik/Development/lumibot/artifacts/ai_committee_provider_benchmarks/20260520_144705`.
416
489
- Cerebras fixed-budget rerun failed immediately with provider billing error: `Payment required to access this resource`. Earlier Cerebras qualifier passed mechanically, so the integration works, but the account/key needs billing credits before Cerebras can be included in the final three-month benchmark.
417
-
-Tool-call discipline fix: even with 4K tool-result caps, DeepSeek used `65` tools in the first evidence call. The AI Investment Committee example now passes prompt-level budgets for research/follow-up/portfolio tool calls (`24` / `8` / `6`), and the runtime enforces those budgets by returning a budget-exceeded notice after the role uses its allowed calls. This keeps the benchmark from measuring uncontrolled tool spraying.
490
+
-Historical note, now reverted: a runtime tool-call enforcement change was added after DeepSeek used `65` tools in one evidence call. That was the wrong fix because it blocked later execution tools and invalidated trading results.
0 commit comments