Skip to content

feat(task): enforce a per-run cost budget for agents and sub-agents#3703

Merged
ak684 merged 1 commit into
mainfrom
alona/sdk-subagent-budget
Jun 14, 2026
Merged

feat(task): enforce a per-run cost budget for agents and sub-agents#3703
ak684 merged 1 commit into
mainfrom
alona/sdk-subagent-budget

Conversation

@ak684

@ak684 ak684 commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

HUMAN:

Agents and sub-agents only had an iteration cap, not a cost ceiling — this adds a per-run budget so a verbose run or a sub-agent fan-out can't run away on spend. Reviewed the change and the deterministic + real-LLM tests, including a spawned sub-agent halting on its budget.

  • A human has tested these changes.

AGENT:


Why

A run was bounded only by the iteration cap (max_iteration_per_run), which limits step count but not spend. A verbose-but-productive agent, or a fan-out of sub-agents, could run away on cost with no hard ceiling. (Metrics.max_budget_per_task existed but was dead — stored/merged, never enforced.)

Summary

  • Add an optional max_budget_per_run (USD) on LocalConversation, enforced in the run loop next to the iteration cap (run() + arun()), preserving a FINISHED status set on the final step.
  • Wire it through AgentDefinition.max_budget_per_run + TaskManager so spawned sub-agents inherit the parent's budget or override it from their definition.
  • Surface a sub-agent run that ends in ERROR (budget or iteration cap) to the parent task, instead of reporting an empty "completed" result.

How to Test

Deterministic (no API):

uv run pytest tests/sdk/conversation/local/test_agent_status_transition.py  # run-loop halt + FINISHED-preserve
uv run pytest tests/tools/task tests/sdk/subagent                            # wiring + surfacing
uv run pytest tests/sdk tests/tools                                          # 5516 passed, 0 failed
uv run ruff check && uv run pre-commit run pyright --all-files               # clean

Real-LLM (proxy; cost a few cents), proving spend triggers the halt naturally:

  1. Enforcement — a spawned sub-agent with AgentDefinition(max_budget_per_run=$0.05)
    ran real terminal steps; real accumulated_cost reached $0.0529 and the run
    halted: MaxBudgetReached → status ERROR.
  2. Sub-agent flow surfacing — through the full TaskManager.start_task path, the
    parent Task reported:
    error / "Agent reached maximum budget limit ($0.0500); accumulated cost $0.0524."
    (previously this was a silent empty "completed").

Type

  • Feature

Notes

  • Design: the budget lives on the conversation (mirroring max_iteration_per_run) and is checked against conversation_stats.get_combined_metrics().accumulated_cost (total across the agent + condenser LLMs). The dormant per-LLM
    Metrics.max_budget_per_task is left untouched — aggregating it across a run's LLMs is awkward, and a run-level limit is the natural home.
  • Heads-up (intentional, pre-existing-gap fix): the parent-surfacing change also covers the iteration cap, which was equally silent before. So a sub-agent that hits max_iteration_per_run now surfaces as a task error too. Happy to scope this to budget-only if preferred.
  • Back-compat: max_budget_per_run is appended as the last named __init__ param (before **_) so no existing positional argument shifts; default None = no budget (no behavior change for existing runs). No LocalConversation is constructed positionally anywhere in the SDK/agent-server.

Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22-slim Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:029f0cc-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-029f0cc-python \
  ghcr.io/openhands/agent-server:029f0cc-python

All tags pushed for this build

ghcr.io/openhands/agent-server:029f0cc-golang-amd64
ghcr.io/openhands/agent-server:029f0cc806075773b32e3e1bc10dfd398aee1b5e-golang-amd64
ghcr.io/openhands/agent-server:alona-sdk-subagent-budget-golang-amd64
ghcr.io/openhands/agent-server:029f0cc-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:029f0cc-golang-arm64
ghcr.io/openhands/agent-server:029f0cc806075773b32e3e1bc10dfd398aee1b5e-golang-arm64
ghcr.io/openhands/agent-server:alona-sdk-subagent-budget-golang-arm64
ghcr.io/openhands/agent-server:029f0cc-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:029f0cc-java-amd64
ghcr.io/openhands/agent-server:029f0cc806075773b32e3e1bc10dfd398aee1b5e-java-amd64
ghcr.io/openhands/agent-server:alona-sdk-subagent-budget-java-amd64
ghcr.io/openhands/agent-server:029f0cc-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:029f0cc-java-arm64
ghcr.io/openhands/agent-server:029f0cc806075773b32e3e1bc10dfd398aee1b5e-java-arm64
ghcr.io/openhands/agent-server:alona-sdk-subagent-budget-java-arm64
ghcr.io/openhands/agent-server:029f0cc-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:029f0cc-python-amd64
ghcr.io/openhands/agent-server:029f0cc806075773b32e3e1bc10dfd398aee1b5e-python-amd64
ghcr.io/openhands/agent-server:alona-sdk-subagent-budget-python-amd64
ghcr.io/openhands/agent-server:029f0cc-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:029f0cc-python-arm64
ghcr.io/openhands/agent-server:029f0cc806075773b32e3e1bc10dfd398aee1b5e-python-arm64
ghcr.io/openhands/agent-server:alona-sdk-subagent-budget-python-arm64
ghcr.io/openhands/agent-server:029f0cc-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:029f0cc-golang
ghcr.io/openhands/agent-server:029f0cc806075773b32e3e1bc10dfd398aee1b5e-golang
ghcr.io/openhands/agent-server:alona-sdk-subagent-budget-golang
ghcr.io/openhands/agent-server:029f0cc-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:029f0cc-java
ghcr.io/openhands/agent-server:029f0cc806075773b32e3e1bc10dfd398aee1b5e-java
ghcr.io/openhands/agent-server:alona-sdk-subagent-budget-java
ghcr.io/openhands/agent-server:029f0cc-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:029f0cc-python
ghcr.io/openhands/agent-server:029f0cc806075773b32e3e1bc10dfd398aee1b5e-python
ghcr.io/openhands/agent-server:alona-sdk-subagent-budget-python
ghcr.io/openhands/agent-server:029f0cc-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

  • Each variant tag (e.g., 029f0cc-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 029f0cc-python-amd64) are also available if needed

The iteration cap bounds step count but not spend, so a verbose-but-productive
run (or a fan-out of sub-agents) could run away on cost. Add an optional
max_budget_per_run (USD) on LocalConversation, enforced in the run loop next to
the iteration cap (preserving FINISHED), and wire it through
AgentDefinition.max_budget_per_run + TaskManager so spawned sub-agents inherit or
override a budget.

Also surface a sub-agent run that ends in ERROR (budget or iteration cap) to the
parent task instead of reporting an empty 'completed' result.
@github-actions

Copy link
Copy Markdown
Contributor

Python API breakage checks — ✅ PASSED

Result:PASSED

Action log

@github-actions

Copy link
Copy Markdown
Contributor

REST API breakage checks (OpenAPI) — ✅ PASSED

Result:PASSED

Action log

@github-actions

Copy link
Copy Markdown
Contributor

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-sdk/openhands/sdk/conversation/impl
   local_conversation.py7736192%95, 433–434, 467, 484, 629, 675, 744, 760, 836, 1125–1126, 1203–1204, 1207, 1335, 1338–1339, 1363, 1396–1397, 1400, 1406, 1487, 1494, 1497, 1500, 1504–1505, 1509–1510, 1513, 1520, 1545, 1549, 1552, 1571, 1623, 1626, 1665, 1672–1673, 1681, 1685–1687, 1694, 1806, 1811, 1921, 1923, 1927–1928, 1939–1940, 1965, 2160, 2164, 2234, 2241–2242
openhands-sdk/openhands/sdk/subagent
   schema.py1501292%62, 75, 94, 109, 139, 149, 181, 188–190, 285, 288
openhands-tools/openhands/tools/task
   manager.py17912032%82–84, 88–90, 100–101, 103–104, 108, 118, 122, 125, 128–133, 135, 141–142, 146, 150–153, 156–160, 182–183, 185–186, 191, 196, 203–205, 210–213, 222, 227, 234, 247–248, 250, 256, 261–262, 264, 274, 279, 285, 297–298, 300–303, 305, 323–324, 328–329, 331–332, 335, 337, 340, 343, 347–348, 350–353, 355–357, 361, 365–366, 368–373, 375–376, 378, 384, 389–391, 397–398, 402–404, 406, 409, 411–412, 423–424, 428, 435–436, 444, 448, 453, 455–456
TOTAL31411885371% 

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ QA Report: PASS WITH ISSUES

Functional QA passed: real SDK runs showed per-run budget enforcement for local conversations and sub-agents, with no functional regressions found; one non-functional CI check is failing for the human-only PR description field.

Does this PR achieve its stated goal?

Yes. The PR set out to enforce max_budget_per_run for agents and sub-agents, wire it through file-based/sub-agent definitions, and surface budget-limited sub-agent runs as task errors. I exercised those paths with real SDK conversations using the configured LLM proxy: on main, a tiny budget was ignored and the same workflows finished successfully; on this PR, the same workflows stopped with MaxBudgetReached, and TaskManager.start_task returned an errored task with the budget message. A finish-only conversation that spent more than the tiny budget still remained finished, matching the PR’s stated preservation behavior.

Phase Result
Environment Setup uv sync --dev completed earlier; no tests/linters run locally.
CI Status ⚠️ All refreshed checks were green except PR Description Check / Validate PR description failing and this QA job still in progress.
Functional Verification ✅ Local conversation budget halt, sub-agent budget inheritance/error surfacing, file-agent budget parsing, finish preservation, and deterministic custom-tool smoke all behaved as expected.
Functional Verification

Test 1: Local conversation budget enforcement with a real terminal task

Step 1 — Reproduce / establish baseline without the fix:
On main (6bf874e7), ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_real_budget.py:

status finished
accumulated_cost 0.04793500
error_codes []
error_details []
event_types ['SystemPromptEvent', 'MessageEvent', 'ActionEvent', 'ObservationEvent', 'MessageEvent']

This confirms the old behavior: even with max_budget_per_run=0.000001, the real LLM+terminal run spent above the budget and still finished, with no budget error.

Step 2 — Apply the PR's changes:
Checked out alona/sdk-subagent-budget at 029f0cc806075773b32e3e1bc10dfd398aee1b5e.

Step 3 — Re-run with the fix in place:
Ran the same command on the PR branch:

status error
accumulated_cost 0.02418500
error_codes ['MaxBudgetReached']
error_details ['Agent reached maximum budget limit ($0.0000); accumulated cost $0.0242.']
event_types ['SystemPromptEvent', 'MessageEvent', 'ActionEvent', 'ObservationEvent', 'ConversationErrorEvent']

This shows the run now halts after the first real tool-using step once accumulated cost exceeds the run budget.

Test 2: Sub-agent budget inheritance and task error surfacing

Step 1 — Reproduce / establish baseline without the fix:
On main, ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_task_budget.py:

task_status completed
task_error None
task_result Command ran successfully: `qa-subagent-budget`
sub_status finished
sub_budget <missing>
sub_cost 0.01108100
sub_errors []

This confirms the previous sub-agent behavior: the parent’s budget argument was not present on the sub-conversation, and TaskManager.start_task reported completion despite spend above the tiny budget.

Step 2 — Apply the PR's changes:
Checked out the PR commit again.

Step 3 — Re-run with the fix in place:
Ran the same command on the PR branch:

task_status error
task_error Agent reached maximum budget limit ($0.0000); accumulated cost $0.0242.
task_result None
sub_status error
sub_budget 1e-06
sub_cost 0.02422000
sub_errors [('MaxBudgetReached', 'Agent reached maximum budget limit ($0.0000); accumulated cost $0.0242.')]

This verifies the budget is inherited by the spawned sub-conversation and the parent-facing Task now surfaces the budget stop as an error instead of an empty success.

Test 3: File-based sub-agent budget frontmatter

Step 1 — Reproduce / establish baseline without the fix:
On main, ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_agent_definition_budget.py:

name budgeted
max_budget_attr <missing>
metadata_has_budget True
metadata_budget 0.123

This shows max_budget_per_run was previously treated as unstructured metadata, not as a typed agent definition field.

Step 2 — Apply the PR's changes:
Checked out the PR commit again.

Step 3 — Re-run with the fix in place:
Ran the same command on the PR branch:

name budgeted
max_budget_attr 0.123
metadata_has_budget False
metadata_budget None

This verifies a real file-agent definition now exposes the budget as a typed value and no longer leaves it in generic metadata.

Test 4: FINISHED status is preserved when the agent completes over budget

On the PR branch, ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_finished_budget.py with a no-tool prompt and max_budget_per_run=0.000001:

status finished
accumulated_cost 0.01843500
error_codes []
message_events ['user', 'agent']

This confirms the PR’s stated behavior that an agent which completes on the step is not converted to an error solely because that final LLM call puts cost above the budget.

Test 5: Existing deterministic custom tool execution still works

On the PR branch, ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_smoke_tool.py, a deterministic TestLLM conversation with a registered ClientTool:

status ConversationExecutionStatus.FINISHED
final_events 5
errors []

This smoke-checks the changed run loop with a custom tool path and no budget configured, confirming the default None budget does not break normal deterministic SDK execution.

Issues Found

  • ⚠️ Non-functional CI issue: PR Description Check / Validate PR description is failing because the human-only PR description section is not completed. I did not edit it; a human contributor needs to update the HUMAN: note and checkbox in their own words.
  • Functional QA issues: None.

Verdict: PASS WITH ISSUES.

This QA review was created by an AI agent (OpenHands) on behalf of the user.

@ak684 ak684 requested a review from all-hands-bot June 14, 2026 20:52

all-hands-bot commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

🟢 Good taste — Elegant, simple solution that follows existing patterns in the codebase.

Summary

This PR adds a per-run cost budget feature for agents and sub-agents. The implementation:

  • Adds max_budget_per_run parameter to LocalConversation
  • Checks accumulated cost against the budget at each iteration step
  • Preserves FINISHED status if the agent completes before budget check triggers
  • Propagates budgets from agent definitions to sub-conversations
  • Surfaces run-limit errors from sub-agents to parent tasks

Analysis

[local_conversation.py]

  • Budget check is placed correctly before the iteration check — this is the right priority order
  • The guard self._state.execution_status != ConversationExecutionStatus.FINISHED correctly prevents overriding a successful finish
  • _budget_exceeded_detail() and _emit_run_limit_error() are clean helper methods

[schema.py]

  • _extract_max_budget_per_run() handles bool/int/float/string types appropriately
  • Field is excluded from metadata via _METADATA_FIELDS (good)

[manager.py]

  • Budget inheritance chain is correct: definition value → parent value → None
  • Error surfacing in _run_task() now properly distinguishes ERROR vs successful completion
  • _run_error_detail() extracts the last error event for parent visibility

[Tests]

  • test_execution_status_error_on_max_budget: Pre-seeds spend to bypass TestLLM cost limitation — pragmatic approach ✅
  • test_finished_preserved_even_when_over_budget: Correctly verifies the FINISHED-over-budget edge case
  • Task manager tests cover both definition-sourced and inherited budgets

No Issues Found

  • No breaking changes to existing APIs
  • No type safety issues
  • No complexity concerns
  • No security concerns

[RISK ASSESSMENT]

  • ⚠️ Risk Assessment: 🟢 LOW
    This is a feature addition with no impact on existing behavior. The budget only applies when explicitly set and defaults to None (disabled).

VERDICT:
Worth merging — Clean implementation of a useful cost-control feature.

KEY INSIGHT:
The design correctly handles the edge case where an agent finishes successfully after accumulating costs — the FINISHED status is preserved rather than being overwritten by the budget error.


This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

@ak684 ak684 merged commit 84f1f9c into main Jun 14, 2026
42 of 43 checks passed
@ak684 ak684 deleted the alona/sdk-subagent-budget branch June 14, 2026 21:01

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ QA Report: PASS

Verified the per-run budget feature through real SDK conversations, TaskManager sub-agent execution, file-based agent loading, and one real-LLM tool run; the PR achieves its stated goal.

Does this PR achieve its stated goal?

Yes. The PR set out to enforce a per-run USD budget for agents and sub-agents, wire it through file-based/registered sub-agents, and surface sub-agent budget/iteration-limit failures to the parent task. I exercised those paths directly: over-budget local runs now stop with MaxBudgetReached, finished runs remain finished, file-based max_budget_per_run loads as a first-class field, sub-agents inherit/override budgets, and sub-agent run-limit failures now return task_status: "error" with the budget detail instead of an empty completed result.

Phase Result
Environment Setup ✅ Project bootstrap/dependency sync completed via the repo make build / uv sync --dev flow.
CI Status gh pr checks: 35 successful, 2 skipped, 1 pending (QA Changes by OpenHands/qa-changes); 0 failing.
Functional Verification ✅ Exercised SDK conversation, file-based agent definition, TaskManager sub-agent, inherited budget, and real LLM spend paths.
Functional Verification

Test 1: LocalConversation stops on per-run cost budget

Step 1 — Establish baseline without the fix:
Ran git checkout --detach origin/main && OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_budget_run.py:

{
  "accepted_budget_arg": true,
  "has_max_budget_attr": false,
  "status": "error",
  "step_calls": [1, 2, 3],
  "error_codes": ["MaxIterationsReached"],
  "spent": 5.0
}

This shows the old SDK swallowed the unknown max_budget_per_run argument but did not enforce it; the run continued until the iteration cap.

Step 2 — Apply the PR changes:
Checked out 029f0cc806075773b32e3e1bc10dfd398aee1b5e.

Step 3 — Re-run with the fix in place:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_budget_run.py:

{
  "has_max_budget_attr": true,
  "max_budget_attr": 1.0,
  "status": "error",
  "step_calls": [1],
  "error_codes": ["MaxBudgetReached"],
  "error_details": ["Agent reached maximum budget limit ($1.0000); accumulated cost $5.0000."],
  "spent": 5.0
}

This confirms the budget is now recognized and stops the run before the iteration cap.

Test 2: Finished runs stay finished even when accumulated cost is above budget

Step 1 — Establish baseline:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_budget_finished.py on origin/main:

{
  "status": "finished",
  "step_calls": [1],
  "error_codes": [],
  "spent": 5.0
}

The old behavior finished because no budget was enforced.

Step 2 — Apply the PR changes:
Checked out 029f0cc806075773b32e3e1bc10dfd398aee1b5e.

Step 3 — Re-run with the fix in place:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_budget_finished.py:

{
  "status": "finished",
  "step_calls": [1],
  "error_codes": [],
  "spent": 5.0
}

This confirms the PR preserves FINISHED when the agent completes on the final/over-budget step, matching the PR description.

Test 3: File-based sub-agent definitions load max_budget_per_run

Step 1 — Establish baseline without the fix:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_agent_definition_budget.py on origin/main:

{
  "has_budget_attr": false,
  "budget": null,
  "metadata": {
    "max_budget_per_run": "2.5",
    "custom_note": "keep-me"
  }
}

This shows max_budget_per_run was only opaque metadata before the PR.

Step 2 — Apply the PR changes:
Checked out 029f0cc806075773b32e3e1bc10dfd398aee1b5e.

Step 3 — Re-run with the fix in place:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_agent_definition_budget.py:

{
  "has_budget_attr": true,
  "budget": 2.5,
  "metadata": {
    "custom_note": "keep-me"
  }
}

This confirms file-based sub-agent frontmatter now exposes the budget field and does not leave it duplicated in metadata.

Test 4: TaskManager enforces sub-agent override budget and surfaces failure

Step 1 — Establish baseline without the fix:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_subagent_budget.py on origin/main:

{
  "definition_has_budget": false,
  "parent_has_budget": false,
  "subagent_budget": null,
  "subagent_status": "error",
  "subagent_step_calls": [1, 2],
  "task_status": "completed",
  "task_result": "",
  "task_error": null
}

This reproduces the gap described in the PR: the sub-agent hit a run-limit error, but the parent task saw an empty completed result.

Step 2 — Apply the PR changes:
Checked out 029f0cc806075773b32e3e1bc10dfd398aee1b5e.

Step 3 — Re-run with the fix in place:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_subagent_budget.py:

{
  "definition_has_budget": true,
  "definition_budget": 1.0,
  "parent_budget": 7.0,
  "subagent_budget": 1.0,
  "subagent_status": "error",
  "subagent_step_calls": [1],
  "task_status": "error",
  "task_error": "Agent reached maximum budget limit ($1.0000); accumulated cost $5.0000."
}

This confirms a sub-agent definition budget overrides the parent budget and the parent task now receives the budget error.

Test 5: TaskManager inherits parent budget when definition has no override

Step 1 — Establish baseline without the fix:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_subagent_inherit_budget.py on origin/main:

{
  "parent_has_budget": false,
  "definition_budget": null,
  "subagent_budget": null,
  "subagent_status": "error",
  "subagent_step_calls": [1, 2],
  "task_status": "completed",
  "task_result": "",
  "task_error": null
}

The parent budget was not a real conversation setting, so nothing was inherited.

Step 2 — Apply the PR changes:
Checked out 029f0cc806075773b32e3e1bc10dfd398aee1b5e.

Step 3 — Re-run with the fix in place:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_subagent_inherit_budget.py:

{
  "parent_has_budget": true,
  "parent_budget": 1.0,
  "definition_budget": null,
  "subagent_budget": 1.0,
  "subagent_status": "error",
  "subagent_step_calls": [1],
  "task_status": "error",
  "task_error": "Agent reached maximum budget limit ($1.0000); accumulated cost $5.0000."
}

This confirms spawned sub-agents inherit the parent's budget when the definition does not override it.

Test 6: Real LLM spend triggers the budget halt naturally

Step 1 — Establish baseline:
The deterministic baseline above established that pre-PR conversations had no budget field/enforcement. I then exercised the PR with an actual LLM call and terminal tool action to ensure real accumulated cost, not only synthetic metrics, drives the halt.

Step 2 — Apply the PR changes:
Checked out 029f0cc806075773b32e3e1bc10dfd398aee1b5e with LLM_MODEL, LLM_BASE_URL, and LLM_API_KEY set.

Step 3 — Run with real LLM/tool execution:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_real_llm_budget.py:

{
  "status": "error",
  "error_codes": ["MaxBudgetReached"],
  "error_details": ["Agent reached maximum budget limit ($0.0000); accumulated cost $0.0086."],
  "spent": 0.008635,
  "event_types": ["SystemPromptEvent", "MessageEvent", "ActionEvent", "ObservationEvent", "ConversationErrorEvent"]
}

This confirms a real user-style SDK conversation with an LLM and terminal tool call accumulates spend and halts on the new budget ceiling.

Issues Found

None.

Verdict: PASS

This review was created by an AI agent (OpenHands) on behalf of the user.

@VascoSch92 VascoSch92 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ak684 this should be evaluate before being merged.

I think this doesn't solve the problem as we are not saying the subagent that has no budget left.

Moreover, this should just be enforced in the main convo and not just for subagent. Then the subagent convo ihnerit this budget. Because we have different now different behaviour for main convo and subagent convo

@VascoSch92

Copy link
Copy Markdown
Member

There are a couple of problems with the implementation

  1. The budget is unreachable through the public Conversation factory (likely oversight)

LocalConversation.__init__ gained max_budget_per_run, but Conversation.__new__ (conversation.py) was not updated, it has no **kwargs and forwards every argument explicitly (max_iteration_per_run is forwarded at lines 190 and 211, but there is no budget equivalent). So the parameter is a hard error through the documented entry point:

  Conversation(...) RAISED TypeError:
      Conversation.__new__() got an unexpected keyword argument 'max_budget_per_run'
  LocalConversation(...) ACCEPTED -> max_budget_per_run=2.5

Net effect: a top-level/parent agent budget can only be set by instantiating LocalConversation directly or via sub-agent frontmatter, not via Conversation(...). Given the symmetry with max_iteration_per_run, this looks unintended.

  1. A STUCK sub-agent is reported to the parent as a successful (empty) task

_run_task (manager.py:361) only special-cases ERROR; every other terminal status falls through to the success branch (set_result(get_agent_final_response(...))). Sub-agents have stuck detection on by default, so a stuck sub-agent ends STUCK, get_agent_final_response returns "", and the parent is told the task completed:

  [control] standalone sub-conversation final status = stuck
  [via _run_task] task.status = completed
  [via _run_task] task.error  = None
  [via _run_task] task.result = '' 
  -> parent agent observation text = 'Task completed with no result.' (is_error=False)

This is the same "empty success" failure the commit's own comment says it is fixing — it just isn't caught for the stuck path (and, by inspection of the same branch, any non-ERROR terminal status such as PAUSED).

  1. The budget silently does nothing when cost isn't computed (limitation, please document)

_budget_exceeded_detail keys off get_combined_metrics().accumulated_cost. For any LLM where litellm has no pricing (custom/proxy/self-hosted models) this stays 0.0, so the cap never triggers — unlike the iteration cap, which is always effective:

  accumulated_cost after run = 0.0
  _budget_exceeded_detail()  = None
  final status               = error
  error codes emitted        = ['MaxIterationsReached']   # budget of $0.0001 never fired
  1. Sub-agent inherits the full budget, not the remaining. effective_max_budget (manager.py:256) gives each sub-agent the parent's full max_budget_per_run. The parent does eventually account for sub-agent spend (via _update_parent_metrics → get_combined_metrics), so it is a lagging aggregate cap — but because the check is post-step, multiple task calls in one parent step (or a single over-budget sub) can overshoot before the parent re-checks. Fine if intended as a per-run soft ceiling; worth a doc sentence so it isn't read as a hard total cap.
  2. RemoteConversation has no budget plumbing or server-side enforcement.
  3. We are not showing the budget to the agent. So it cannot know what to do. Moreover, the agent doesn't have an idea of what is money or budget (different LLMs have different cost)

Reproduction script

  import logging, tempfile
  from collections.abc import Sequence
  from typing import ClassVar

  logging.disable(logging.CRITICAL)  # silence SDK log noise
  
  from openhands.sdk import Agent
  from openhands.sdk.conversation import Conversation
  from openhands.sdk.conversation.impl.local_conversation import LocalConversation
  from openhands.sdk.conversation.state import ConversationExecutionStatus as Status
  from openhands.sdk.event.conversation_error import ConversationErrorEvent
  from openhands.sdk.llm import ImageContent, Message, MessageToolCall, TextContent
  from openhands.sdk.llm.utils.metrics import Metrics
  from openhands.sdk.testing import TestLLM
  from openhands.sdk.tool import ( 
      Action, Observation, Tool, ToolDefinition, ToolExecutor, register_tool,
  )
  from openhands.tools.task.manager import Task, TaskManager, TaskStatus


  # --- a tool that always returns the same observation (to drive a stuck loop) ---
  class A(Action):
      command: str
  class O(Observation):
      result: str
      @property
      def to_llm_content(self) -> Sequence[TextContent | ImageContent]:
          return [TextContent(text=self.result)]
  class Exec(ToolExecutor[A, O]):
      def __call__(self, action: A, conversation=None) -> O:
          return O(result="same-observation")
  class LoopTool(ToolDefinition[A, O]):
      name: ClassVar[str] = "test_tool"
      @classmethod
      def create(cls, conv_state=None, *, executor, **p):
          return [cls(description="t", action_type=A, observation_type=O, executor=executor)]

  register_tool("test_tool", LoopTool.create(executor=Exec())[0])
  
  def user(text): return Message(role="user", content=[TextContent(text=text)])
  def call():  # one identical tool call
      return Message(role="assistant", content=[TextContent(text="")],
                     tool_calls=[MessageToolCall(id="c", name="test_tool",
                                 arguments='{"command":"x"}', origin="completion")])
  def err_codes(events):
      return [e.code for e in events if isinstance(e, ConversationErrorEvent)]

  
  # ============================================================ BUG 1
  with tempfile.TemporaryDirectory() as d:
      agent = Agent(llm=TestLLM.from_messages([]), tools=[])
      raised = None
      try:  
          Conversation(agent=agent, workspace=d, max_budget_per_run=2.5, visualizer=None)
      except TypeError as e:
          raised = str(e)
      assert raised is not None, "NOT REPRODUCED: public factory accepted the kwarg"
      assert "max_budget_per_run" in raised
      # control: the underlying class *does* accept it
      lc = LocalConversation(agent=Agent(llm=TestLLM.from_messages([]), tools=[]),
                             workspace=d, max_budget_per_run=2.5, visualizer=None,
                             delete_on_close=False)
      assert lc.max_budget_per_run == 2.5
      print("BUG 1 REPRODUCED: public Conversation(...) rejects max_budget_per_run")
      print(f"          TypeError: {raised}")


  # ============================================================ BUG 2
  with tempfile.TemporaryDirectory() as d:
      conv = LocalConversation(
          agent=Agent(llm=TestLLM.from_messages([call() for _ in range(10)]),
                      tools=[Tool(name="test_tool")]),
          workspace=d, visualizer=None, delete_on_close=False,
          stuck_detection=False, max_iteration_per_run=3, max_budget_per_run=0.0001)
      conv.send_message(user("go"))
      conv.run()
      spent = conv.conversation_stats.get_combined_metrics().accumulated_cost
      codes = err_codes(conv.state.events)
      assert spent == 0.0, f"NOT REPRODUCED: cost was tracked ({spent})"
      assert conv._budget_exceeded_detail() is None, "NOT REPRODUCED: budget fired"
      assert "MaxBudgetReached" not in codes and "MaxIterationsReached" in codes, codes

      # positive control: when cost IS present, the same check DOES fire ->
      # proving the no-op is caused purely by accumulated_cost staying 0.
      conv.conversation_stats.usage_to_metrics["seed"] = Metrics(accumulated_cost=9.0)
      assert conv._budget_exceeded_detail() is not None, "control failed: budget should fire at $9"
      print("BUG 2 REPRODUCED: $0.0001 budget never fired (cost stayed 0.0); "
            f"only stopped by {codes}")
      print("          control: after seeding cost=$9, _budget_exceeded_detail() now fires")
  

  # ============================================================ BUG 3
  with tempfile.TemporaryDirectory() as d:
      # control: this scenario genuinely ends STUCK on its own
      ctrl = LocalConversation( 
          agent=Agent(llm=TestLLM.from_messages([call() for _ in range(12)]),
                      tools=[Tool(name="test_tool")]),
          workspace=d, visualizer=None, delete_on_close=False, max_iteration_per_run=30)
      ctrl.send_message(user("loop")); ctrl.run()
      assert ctrl.state.execution_status == Status.STUCK, ctrl.state.execution_status

      # same scenario through the patched TaskManager._run_task path
      parent = LocalConversation(agent=Agent(llm=TestLLM.from_messages([]), tools=[]),
                                 workspace=d, visualizer=None, delete_on_close=False)
      mgr = TaskManager(); mgr._ensure_parent(parent)
      sub = LocalConversation(
          agent=Agent(llm=TestLLM.from_messages([call() for _ in range(12)]),
                      tools=[Tool(name="test_tool")]),
          workspace=d, visualizer=None, delete_on_close=False, max_iteration_per_run=30)
      task = Task(id="task_00000001", conversation_id=sub.id, conversation=sub,
                  status=TaskStatus.RUNNING)
      mgr._tasks[task.id] = task
      done = mgr._run_task(task, "loop")

      assert done.status == TaskStatus.COMPLETED, f"NOT REPRODUCED: status={done.status}"
      assert done.error is None, f"NOT REPRODUCED: error={done.error}"
      assert done.result == "", f"NOT REPRODUCED: result={done.result!r}"
      print("BUG 3 REPRODUCED: a sub-agent that ends STUCK is reported to the parent as "
            f"status={done.status.value}, error=None, result='' "
            "(parent sees 'Task completed with no result.')")
  
  print("\nAll three assertions held -> all three issues reproduced on this commit.")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants