small changes by lorenss-m · Pull Request #415 · hud-evals/hud-python

lorenss-m · 2026-06-03T03:40:38Z

Note

High Risk
Large breaking public API and agent/eval orchestration changes across auth-adjacent gateway clients, remote SSH execution, and computer-control paths.

Overview
This is a breaking SDK reshape around environments, tasks, and rollouts. The README and top-level hud exports now center on Environment + @env.task(), Variant / Taskset, and await agent(run) with rewards on run.trace, replacing the older hud.eval() / EvalContext / env.scenario story.

Agents are rebuilt on a slim Agent ABC and a shared ToolAgent loop keyed off a live Run. MCPAgent, lazy _runtime activation, and hud.trace() go away; patches and pretty errors load eagerly from hud/__init__.py. Provider agents (Claude, Gemini, OpenAI) wire native tools to capability clients — SSH for shell/editor, RFB for computer use, MCP proxy tools for discovered env tools — instead of forwarding through generic env MCP tool handlers.

New agent paths include optional BrowserUseAgent (CDP via browser-use) and ClaudeSDKAgent (remote claude CLI over SSH, with a local computer-use MCP bridge for RFB). create_agent now builds agents from config objects rather than Agent.create(). Unit tests were added for Claude/Gemini computer tool dispatch.

^{Reviewed by Cursor Bugbot for commit cc7bb2d. Bugbot is set up for automated code reviews on this repo. Configure here.}

…agent-f-l

v6's outstanding commits refactor the old eval subsystem (manager/context/ instrument) and defer import-time patching via activate_runtime(). This branch already replaced that subsystem (Sandbox/Taskset/Variant) and rewrote the agent base, so none of v6's changes apply cleanly or usefully here. Recorded as an "ours" merge: v6 is marked merged, branch tree is kept verbatim.

cursor

Cursor Bugbot has reviewed your changes and found 7 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit cc7bb2d. Configure here.}

cursor · 2026-06-05T22:00:43Z

+        self.max_output_tokens = config.max_output_tokens
+        self.thinking_level = config.thinking_level
+        self.include_thoughts = config.include_thoughts
+        self.excluded_predefined_functions = list(config.excluded_predefined_functions)


Config exclusions never applied

Medium Severity

GeminiConfig.excluded_predefined_functions is copied onto GeminiAgent but never passed when constructing GeminiComputerTool, so the tool always advertises the full predefined computer-use set regardless of user configuration.

Additional Locations (1)

hud/agents/tool_agent.py#L177-L180

^{Reviewed by Cursor Bugbot for commit cc7bb2d. Configure here.}

cursor · 2026-06-05T22:00:43Z

+                scroll_x=sx,
+                scroll_y=sy,
+            )
+            return await self.screenshot()


Scroll magnitude default shrunk

Medium Severity

Gemini scroll actions now default magnitude to 3 VNC wheel clicks instead of the previous default of 800, so omitted or typical magnitudes produce far less scrolling than before after the RFB refactor.

^{Reviewed by Cursor Bugbot for commit cc7bb2d. Configure here.}

cursor · 2026-06-05T22:00:43Z

+        )
+        if sibling_docs:
+            return [tool_result_msg, BetaMessageParam(role="user", content=sibling_docs)]
+        return tool_result_msg


Citation docs split wrongly

Medium Severity

When citations are enabled, citation document blocks are returned as a separate user message instead of living in the same user turn as the matching tool_result, diverging from the prior Anthropic message shape.

^{Reviewed by Cursor Bugbot for commit cc7bb2d. Configure here.}

cursor · 2026-06-05T22:00:43Z

+        }
+        betas: list[str] | Omit = list(required_betas) if required_betas else Omit()
        tool_choice = BetaToolChoiceAutoParam(type="auto", disable_parallel_tool_use=True)
+        tools = cast("list[BetaToolUnionParam]", list(state.params))


Tool search defer dropped

Medium Severity

Large MCP tool catalogs no longer get defer_loading when ClaudeToolSearchTool is configured, because the threshold logic that marked generic function tools was removed from get_response.

^{Reviewed by Cursor Bugbot for commit cc7bb2d. Configure here.}

cursor · 2026-06-05T22:00:43Z

+
+        run_cmd = self._build_cli_command(
+            prompt=prompt, max_steps=max_steps, system_prompt=system_prompt,
+            mcp_config_path=mcp_config_path,


Prompt file never used

Medium Severity

The agent writes the task prompt to .hud_prompt.txt over SFTP but still passes the full prompt on the claude CLI command line, so long prompts remain subject to shell length and quoting limits the file was meant to avoid.

Additional Locations (1)

hud/agents/claude/sdk/agent.py#L214-L215

^{Reviewed by Cursor Bugbot for commit cc7bb2d. Configure here.}

cursor · 2026-06-05T22:00:43Z

+                        response=response,
+                        parts=parts or None,
+                    ),
+                ),


Computer URL field dropped

Medium Severity

Gemini computer-use tool results no longer include the required url field (and related metadata) on FunctionResponse, because formatting was centralized without the browser-specific fields the old GeminiComputerTool.format_result added.

^{Reviewed by Cursor Bugbot for commit cc7bb2d. Configure here.}

cursor · 2026-06-05T22:00:43Z

+    "GeminiGlobTool",
+    "GeminiListTool",
+    "GeminiMCPProxyTool",
+    "GeminiMemoryTool",


Exported missing memory tool

Low Severity

__all__ still lists GeminiMemoryTool after memory.py was deleted in the same refactor, so importing that name from hud.agents.gemini.tools fails despite being part of the public export list.

^{Reviewed by Cursor Bugbot for commit cc7bb2d. Configure here.}

lorenss-m added 22 commits May 26, 2026 22:51

restructure + claude [in progress, openai/gemini not done]

e1d420c

rfb + runnable test [in progress}

e285d66

refactor openai + gemini

beecc36

fx

8181d2e

imp and warmup

f33c7ee

mm fix

3056a9f

claude sdk

1751b40

fx win outputs

ae04127

fx

9b0dec6

fx

3921da2

add bu fix claude

145759a

additions

ea185ce

fxs

fda0479

add impl tinker api support + reward system

3a11712

Merge branch 'v6' of https://github.com/hud-evals/hud-python into v6-…

429ec15

…agent-f-l

fix rollouts

d4b85b8

fix running

c07895e

add eval flows

c21f27d

telem

6563750

add legacy improvements, cleanup

542b7d4

cleanup

026fd9d

cleanup

52623b1

lorenss-m changed the title ~~V6 agent f l~~ small changes Jun 3, 2026

lorenss-m and others added 7 commits June 3, 2026 09:59

fxs

3684598

better legacy compat

b3fdb38

tests time

9b44b85

fxs

4ba5a0f

fix tests

29a0fb1

full tests and cleanup

4dcf91d

jdchawla29 marked this pull request as ready for review June 5, 2026 21:59

jdchawla29 merged commit 2a356e3 into v6 Jun 5, 2026
1 of 5 checks passed

cursor Bot reviewed Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

small changes#415

small changes#415
jdchawla29 merged 29 commits into
v6from
v6-agent-f-l

lorenss-m commented Jun 3, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 5, 2026

Uh oh!

cursor Bot Jun 5, 2026

Uh oh!

cursor Bot Jun 5, 2026

Uh oh!

cursor Bot Jun 5, 2026

Uh oh!

cursor Bot Jun 5, 2026

Uh oh!

cursor Bot Jun 5, 2026

Uh oh!

cursor Bot Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lorenss-m commented Jun 3, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 5, 2026

Choose a reason for hiding this comment

Config exclusions never applied

Uh oh!

cursor Bot Jun 5, 2026

Choose a reason for hiding this comment

Scroll magnitude default shrunk

Uh oh!

cursor Bot Jun 5, 2026

Choose a reason for hiding this comment

Citation docs split wrongly

Uh oh!

cursor Bot Jun 5, 2026

Choose a reason for hiding this comment

Tool search defer dropped

Uh oh!

cursor Bot Jun 5, 2026

Choose a reason for hiding this comment

Prompt file never used

Uh oh!

cursor Bot Jun 5, 2026

Choose a reason for hiding this comment

Computer URL field dropped

Uh oh!

cursor Bot Jun 5, 2026

Choose a reason for hiding this comment

Exported missing memory tool

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lorenss-m commented Jun 3, 2026 •

edited by cursor Bot

Loading