Skip to content

Commit 0d762a8

Browse files
authored
Merge pull request #1055 from Lumiwealth/version/4.5.29
Release 4.5.29
2 parents 43b627b + 949776c commit 0d762a8

12 files changed

Lines changed: 276 additions & 338 deletions

CHANGELOG.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,18 @@
11
# Changelog
22

3-
## 4.5.29 - Unreleased
3+
## 4.5.29 - 2026-05-20
4+
5+
Deploy marker: 4.5.29 release commit (`deploy 4.5.29`)
6+
7+
### Fixed
8+
- **AI agent runtime no longer blocks tool calls with hidden runtime budgets.** The runtime-enforced tool-call budget introduced in 4.5.28 has been removed because it could block execution tools such as `orders_submit_order` and invalidate agent trading results.
9+
- **AI agent tool results and committee handoffs are no longer silently token-truncated by the benchmark path.** Large handoffs/tool results now preserve the model-visible evidence instead of replacing it with middle-truncated excerpts.
10+
- **AI benchmark usage accounting now aggregates raw traces across all committee roles.** The provider benchmark runner no longer relies on a last-writer detail artifact that undercounted multi-agent token usage and cost.
11+
- **Multi-agent detail parquet files now include all agent rows for stats-file backtests.** Observability artifacts for committee-style strategies include every role instead of only the most recent writer.
12+
13+
### Changed
14+
- **The AI Investment Committee example uses prompt guidance instead of numeric tool-call caps.** The example still asks agents to be concise and targeted, but it does not impose hidden research/follow-up/portfolio tool budgets.
15+
- **The AI committee provider benchmark plan marks the enforced 4.5.28 results invalid for trading conclusions.** The investigation doc records why hidden tool-call and truncation controls polluted the prior 14-day and three-month results, and documents the safer one-model smoke-test plan for Cerebras, direct DeepSeek Flash, and Together after billing propagation.
416

517
## 4.5.28 - 2026-05-20
618

docs/investigations/2026-05-20_AI_COMMITTEE_PROVIDER_BENCHMARK_PLAN.md

Lines changed: 79 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,79 @@
33
**Date:** 2026-05-20
44
**Scope:** Benchmarking the LumiBot AI Investment Committee across Gemini, OpenAI, Together AI, Kimi, Qwen, Cerebras, and optional direct DeepSeek models.
55

6+
## Correction: Hidden Safety Rails Invalidated The Enforced Results
7+
8+
The enforced 14-day and three-month benchmark results below are not valid
9+
trading-performance evidence. They used hidden behavior controls that changed
10+
the thing being measured:
11+
12+
- Runtime tool-call budgets returned budget-exceeded payloads instead of
13+
executing additional tools. This blocked `orders_submit_order`, so models
14+
that tried to trade could be forced into all-cash results.
15+
- Prompt-level numeric tool-call budgets changed the agent behavior under test.
16+
Benchmark prompts may ask agents to be concise and targeted, but they should
17+
not impose arbitrary tool counts unless the experiment is explicitly about
18+
constrained agents.
19+
- Handoff/tool-result truncation changed the evidence available to downstream
20+
agents. Context problems should be handled with narrower tools, structured
21+
outputs, provider-appropriate model selection, or clear diagnostic failures,
22+
not hidden middle truncation.
23+
- The cost summaries undercounted usage because the compact summary read a
24+
last-writer agent detail artifact instead of aggregating every committee role
25+
from raw traces or a combined all-agent detail file.
26+
27+
Before spending on another full benchmark, remove those hidden controls, fix
28+
usage aggregation, and rerun a very small smoke test that verifies order tools
29+
execute normally.
30+
31+
## Post-Correction Smoke: 2026-05-20 Local
32+
33+
Fixes applied before rerunning:
34+
35+
- Removed runtime tool-call budget enforcement from the agent runtime.
36+
- Removed prompt-level numeric tool-call budgets from the AI Investment
37+
Committee example and benchmark runner.
38+
- Removed handoff/tool-result truncation from the benchmark path.
39+
- Fixed benchmark usage accounting to aggregate raw trace files across every
40+
committee role instead of trusting a last-writer detail artifact.
41+
42+
Validation:
43+
44+
- Focused tests passed: `python3 -m pytest tests/test_agent_runtime_provider_keys.py tests/test_ai_investment_committee_example.py`.
45+
- Raw-trace usage aggregation was checked against an old OpenAI artifact and
46+
correctly found `248` traces across `evidence_researcher`,
47+
`bull_researcher`, `bear_researcher`, and `portfolio_manager`, instead of
48+
the broken `62`-call last-agent summary.
49+
50+
Paid smoke attempts:
51+
52+
- Together Kimi K2.5 could not be rerun because Together returned
53+
`Credit limit exceeded` before the first model call. Kimi K2.5 is now a
54+
historical artifact only; do not include it in new benchmark slates.
55+
- After Rob added Together credits, a one-day Qwen3 235B throughput smoke was
56+
attempted with the historical `--max-model-calls 4` retry flag, but Together still returned
57+
`Credit limit exceeded` before the first model call. Do not retry repeatedly;
58+
Together's billing message says balances can take up to five minutes to
59+
update.
60+
- Direct DeepSeek V4 Flash was rerun over `2026-02-12` through `2026-02-14`,
61+
a small window that previously had blocked order attempts.
62+
- Artifact root:
63+
`<repo_root>/artifacts/ai_committee_provider_benchmarks/20260520_204010/deepseek_deepseek-v4-flash`.
64+
- Result: passed mechanically; `8` raw traces, `315` tool calls,
65+
`3,584,209` input tokens, `3,212,672` cached input tokens, `73,136` output
66+
tokens, `30,119` thinking tokens.
67+
- Estimated cost using static price map: `$0.522267` no-cache,
68+
`$0.081489` cache-adjusted.
69+
- Trading result: still `0%` return and cash-only, but this was a model
70+
decision, not a budget block. Portfolio-manager traces for both days say
71+
`NO TRADE`; no `orders_submit_order` call was blocked.
72+
73+
Next Together smoke should use a cheaper current model first:
74+
`together_ai/Qwen/Qwen3-235B-A22B-Instruct-2507-tput` or
75+
`together_ai/openai/gpt-oss-120b`. Kimi should mean Kimi K2.6 only, and only as
76+
an expensive compatibility/quality sample, not as a cost-sensitive benchmark
77+
default.
78+
679
## Recommendation
780

881
Use the AI Investment Committee example as the primary benchmark. It is the right workload because it stresses the exact behavior we care about:
@@ -76,7 +149,7 @@ For fair model comparison, set all four to the same candidate model first. Mixed
76149
|---|---|---|
77150
| Qwen3 235B FP8 throughput | `together_ai/Qwen/Qwen3-235B-A22B-Instruct-2507-tput` | Very cheap throughput model. Good candidate for "can a low-cost open model actually trade?" |
78151
| GPT-OSS 120B on Together | `together_ai/openai/gpt-oss-120b` | Open-weight reasoning baseline via Together. Low cost, tool support listed by Together. |
79-
| Qwen3.6 Plus | `together_ai/Qwen/Qwen3.6-Plus` | Cheaper broad reasoning candidate. Test only if it passes tool-call smoke. |
152+
| Qwen3.6 Plus | `together_ai/Qwen/Qwen3.6-Plus` | Cheaper broad reasoning candidate, but Together's current table does not clearly list function-calling support for it. Test only if a tool-call smoke confirms it works. |
80153
| Kimi K2.6 | `together_ai/moonshotai/Kimi-K2.6` | Agentic/model-swarm positioning, 256K context, function calling listed by Together. Expensive enough that it should not be in the first cost-sensitive finalist set unless smoke quality is clearly strong. |
81154
| Together DeepSeek V4 Pro | `together_ai/deepseek-ai/DeepSeek-V4-Pro` | Together-hosted DeepSeek option. More expensive than direct DeepSeek and no documented Together V4 Flash option was found, but it avoids sending requests to `api.deepseek.com`. Requires `TOGETHERAI_API_KEY`. |
82155

@@ -247,7 +320,7 @@ Only run the finalists for two or three months. Recommended finalists likely:
247320
- Together-hosted DeepSeek V4 Pro is not a cost winner. It is more expensive than direct DeepSeek V4 Pro and far more expensive than direct DeepSeek V4 Flash. Use it only if we specifically want DeepSeek behavior without calling DeepSeek's own API endpoint.
248321
- Direct DeepSeek V4 Flash is the best raw cost bet, but it has a privacy posture Rob does not like for proprietary trading data. Keep it optional.
249322
- Gemini 3.5 Flash is the likely closed-model quality/speed baseline. Google published strong tool-use and finance-agent model-card numbers, so it belongs in the benchmark.
250-
- Kimi K2.5 looked promising in the first smoke because it used tools and placed a bounded order. Kimi K2.6 is still worth a smoke test only if the higher price is justified by quality.
323+
- Kimi K2.5 looked promising in historical smoke runs because it used tools and placed bounded orders, but it should not be used going forward. If testing Kimi, use Kimi K2.6 only, and treat it as an expensive compatibility/quality sample.
251324
- Cerebras is worth testing for speed, but the current ADK/LiteLLM path needs a message-normalization fix for `reasoning_content` before `cerebras/gpt-oss-120b` can complete.
252325
- Qwen throughput is cheap and fast enough to keep testing, but the `list_fred_series` hallucinated tool call means we should watch tool discipline carefully.
253326

@@ -396,7 +469,7 @@ Important benchmark runner fixes from this phase:
396469
- The paid benchmark runner now prints JSON `model_start` and `model_finished` events so long runs are observable.
397470
- Benchmark artifacts can be summarized with `/Users/robertgrzesik/Development/lumibot/scripts/summarize_ai_committee_provider_benchmarks.py`, which reads per-model `result.json` files and writes compact JSON/Markdown comparisons.
398471
- The runner accepts `--agent-run-timeout-seconds` for slow provider qualifiers. This is a per-agent timeout, not the overall benchmark timeout. Keep the default for fast providers; use a higher value for Qwen/Kimi only if the model is making progress but individual calls exceed the runtime's default safety rail.
399-
- The AI committee example now asks each role to produce a structured handoff under the strategy parameter `handoff_target_tokens`, default `24000`, and applies a reusable Lumibot token-budget helper at `handoff_max_tokens`, default `32000`, before passing text to the next role. If a model ignores the target, the helper middle-truncates with an explicit notice instead of silently chopping or crashing the strategy. A higher target does not force the model to use the full budget; the prompt explicitly says not to pad the handoff just to fill the token budget.
472+
- Historical note, now reverted: the AI committee example briefly applied a reusable token-budget helper at `handoff_max_tokens` before passing text to the next role. This was a bad benchmark control because middle truncation changed the evidence seen by downstream agents.
400473

401474
Artifacts:
402475

@@ -410,11 +483,11 @@ Results so far:
410483
- `together_ai/Qwen/Qwen3-235B-A22B-Instruct-2507-tput`, uncapped: failed with `ContextWindowExceededError` after sending about `2,951,306` tokens into a `262,144` token context window. Root cause was oversized role handoffs in the committee example, not a bad API key.
411484
- `together_ai/Qwen/Qwen3-235B-A22B-Instruct-2507-tput`, bounded-handoff rerun: hit the one-hour process timeout after `49` agent run summaries / `12` complete committee cycles with no repeated context-window failure. Partial usage: input `2,025,198`, output `41,640`, tool calls `363`, estimated cost `$0.430024`. The handoff contract fixed the failure mode, but Qwen needs a longer timeout to finish the qualifier.
412485
- Parallel rerun after token-budgeted handoffs showed `deepseek/deepseek-v4-flash` can still exceed context with raw tool results: provider rejected about `7,033,087` requested tokens against a `1,048,576` context window. This exposed a second boundary: tool results, especially raw SEC/companyfacts-style payloads, must also be token-budgeted before entering model context.
413-
- Runtime fix added after the DeepSeek failure: `lumibot.components.agents.context_budget.budget_text_by_tokens()` is now used at the tool-result boundary. Oversized tool results are replaced with an explicit bounded excerpt and a notice telling the model to call a narrower tool/query when more detail is needed.
414-
- The first Qwen run after tool-result budgeting still failed with Together's generic `Input validation error` after an earlier 300-second agent timeout. The likely remaining issue is accumulated context/request shape, not credentials. The tool-result budget was changed from a hard-coded 12K token cap to a 4K default with `LUMIBOT_AGENT_TOOL_RESULT_MAX_TOKENS` override for model-specific reruns.
486+
- Historical note, now reverted: runtime tool-result budgeting was added after the DeepSeek failure, then removed because bounded excerpts changed model-visible evidence and invalidated the benchmark.
487+
- The first Qwen run after tool-result budgeting still failed with Together's generic `Input validation error` after an earlier 300-second agent timeout. The likely remaining issue was accumulated context/request shape, not credentials.
415488
- Fixed-budget rerun state: Qwen rerun with 4K tool-result budget and 900-second per-agent timeout started in `/Users/robertgrzesik/Development/lumibot/artifacts/ai_committee_provider_benchmarks/20260520_144512`. DeepSeek, Gemini, Kimi, OpenAI, and Cerebras fixed-budget reruns started in `/Users/robertgrzesik/Development/lumibot/artifacts/ai_committee_provider_benchmarks/20260520_144705`.
416489
- Cerebras fixed-budget rerun failed immediately with provider billing error: `Payment required to access this resource`. Earlier Cerebras qualifier passed mechanically, so the integration works, but the account/key needs billing credits before Cerebras can be included in the final three-month benchmark.
417-
- Tool-call discipline fix: even with 4K tool-result caps, DeepSeek used `65` tools in the first evidence call. The AI Investment Committee example now passes prompt-level budgets for research/follow-up/portfolio tool calls (`24` / `8` / `6`), and the runtime enforces those budgets by returning a budget-exceeded notice after the role uses its allowed calls. This keeps the benchmark from measuring uncontrolled tool spraying.
490+
- Historical note, now reverted: a runtime tool-call enforcement change was added after DeepSeek used `65` tools in one evidence call. That was the wrong fix because it blocked later execution tools and invalidated trading results.
418491
- Enforced-budget 14-day qualifier artifacts:
419492
- Summary JSON: `/Users/robertgrzesik/Development/lumibot/artifacts/ai_committee_provider_benchmarks/enforced_14d_compact_summary.json`.
420493
- Summary Markdown: `/Users/robertgrzesik/Development/lumibot/artifacts/ai_committee_provider_benchmarks/enforced_14d_summary.md`.

docsrc/agents.rst

Lines changed: 19 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -222,22 +222,19 @@ LumiBot handles all the common instructions internally through its base prompt.
222222
223223
Do not repeat instructions about position sizing, time safety, or tool usage. LumiBot already covers those in the base prompt.
224224

225-
Token-Budgeted Agent Handoffs
226-
-----------------------------
225+
Agent Handoffs
226+
--------------
227227

228228
Multi-agent strategies often pass one agent's output into the next agent. For
229229
example, an evidence researcher may hand a research pack to a bull researcher,
230230
then a bear researcher, then a portfolio manager. These handoffs should be
231-
large enough to preserve useful evidence, but they must stay inside the model's
232-
context window.
231+
large enough to preserve useful evidence while still being concise enough for
232+
the next model call.
233233

234-
Use prompt instructions for the normal behavior and token budgeting as the
235-
safety rail:
234+
Prefer prompt instructions and structured output requests:
236235

237236
.. code-block:: python
238237
239-
from lumibot.components.agents.context_budget import budget_text_by_tokens
240-
241238
result = self.agents["evidence_researcher"].run(
242239
task_prompt=(
243240
"Build a structured evidence handoff. "
@@ -247,51 +244,31 @@ safety rail:
247244
context={"handoff_target_tokens": 24000},
248245
)
249246
250-
evidence_pack = budget_text_by_tokens(
251-
result.summary or result.text,
252-
max_tokens=32000,
253-
label="evidence_pack",
254-
).text
247+
evidence_pack = result.summary or result.text
255248
256249
``handoff_target_tokens`` is the prompt target. It does not force the model to
257250
use that many tokens. It tells the model the upper bound for a complete,
258251
structured handoff. A good model can still return 5,000 or 8,000 tokens when
259252
that is enough.
260253

261-
``max_tokens`` is the hard safety rail before the next agent sees the handoff.
262-
If the text exceeds the budget, LumiBot keeps the beginning and end and inserts
263-
an explicit notice that token-budget truncation happened. This prevents a
264-
single verbose agent from pushing the next agent beyond a provider context
265-
window.
266-
267-
LumiBot also applies a token budget to very large tool results before they are
268-
sent back into the model. Full trace artifacts still record that the tool was
269-
called, but the model sees a bounded excerpt with an explicit truncation notice
270-
and can call a narrower tool or query if it needs more detail. This matters for
271-
large SEC company-facts payloads, filings, news bodies, and other raw data that
272-
can otherwise consume an entire provider context window.
273-
274-
The default tool-result budget is 4,000 estimated tokens per tool result. You
275-
can override it with ``LUMIBOT_AGENT_TOOL_RESULT_MAX_TOKENS`` when benchmarking
276-
models with unusually large context windows, but keep in mind that many tool
277-
results can accumulate inside one agent turn.
254+
Do not silently truncate handoffs or tool results in order to make a backtest
255+
fit a provider context window. Silent truncation changes the evidence the next
256+
agent sees and can turn a trading-quality benchmark into a benchmark of the
257+
truncation policy. If a handoff is too large, prefer narrower tools, better
258+
role prompts, provider-appropriate model selection, or a clear failure with
259+
diagnostics.
278260

279261
For 128K-context models, think about the combined context, not just one
280262
handoff. If the portfolio manager receives evidence, bull, and bear handoffs,
281263
three 32K-token handoffs can already consume roughly 96K tokens before the
282264
system prompt, tool schemas, runtime context, and the portfolio manager's own
283-
output. Larger budgets can be reasonable for bigger-context models, but they
284-
should be chosen intentionally.
285-
286-
Prompt-level tool discipline matters as much as token budgeting. Multi-agent
287-
strategies should tell research agents how many tool calls are reasonable for a
288-
turn, because dozens of individually bounded tool results can still create a
289-
large context. The AI Investment Committee example exposes
290-
``max_research_tool_calls``, ``max_followup_tool_calls``, and
291-
``max_portfolio_tool_calls`` strategy parameters for this reason. LumiBot also
292-
enforces these context budgets at runtime: once a role exceeds its configured
293-
tool-call count, additional tool calls return a budget-exceeded notice instead
294-
of executing.
265+
output.
266+
267+
Do not add hidden runtime tool-call budgets to trading benchmarks. Blocking
268+
tools can invalidate results by preventing execution tools, such as order
269+
submission, from running. If you need to control paid benchmark spend, use an
270+
explicit outer run cap such as ``LUMIBOT_AGENT_MAX_MODEL_CALLS`` and treat the
271+
run as failed when the cap is reached.
295272

296273
DuckDB and Time-Series Data
297274
----------------------------

0 commit comments

Comments
 (0)