V6 contrainer mgmt by lorenss-m · Pull Request #416 · hud-evals/hud-python

lorenss-m · 2026-06-06T23:42:17Z

Note

High Risk
Renames the wire RPC method and changes session persistence semantics (breaking for non-SDK clients), rewires Harbor export/convert grading paths, and removes hud push.

Overview
This PR tightens v6 container workflows: the control channel now uses tasks.grade (replacing tasks.evaluate), and an env can keep a started task across disconnect so hud task start and a later hud task grade (or reconnect) can split prompt vs scoring—only explicit bye or a new start tears the session down. The client stops sending bye on close to match that behavior.

New CLI: hud task (list / start / grade) plus top-level hud task-start, hud task-grade, hud task-list, with shared load_variants for hud eval and hud build. hud build bakes a variant catalog into the manifest. hud push and its tests are removed (deploy-first path).

Harbor: export now copies the full build context, neutralizes CMD/ENTRYPOINT, adds hud_entrypoint.sh (serve channel + park run via hud task start), and grades with hud task grade in test.sh; Harbor→HUD convert roots Workspace at the challenge WORKDIR via guest_path.

Docs: README reframed around the protocol and docker exec examples; new migrate-v6 guide in the docs nav.

^{Reviewed by Cursor Bugbot for commit bf60f0e. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 4 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit bf60f0e. Configure here.}

cursor · 2026-06-07T01:31:08Z

-                        evaluation = await active_runner.evaluate(params)
-                        active_runner = None
+                        evaluation = await self._active_runner.grade(params)
+                        self._active_runner = None


Shared runner across TCP sessions

High Severity

The held TaskRunner lives on the Environment instance, not per control-channel connection. Concurrent clients can interleave tasks.start and tasks.grade on the same async generator, corrupting or cancelling another client’s parked task.

^{Reviewed by Cursor Bugbot for commit bf60f0e. Configure here.}

cursor · 2026-06-07T01:31:08Z

+            except HudProtocolError:
+                # No held session: run the whole lifecycle here (start then grade).
+                await client.start_task(variant.task, variant.args)
+                return await client.grade({"answer": answer_text})


Grade retries all protocol errors

High Severity

hud task grade catches every HudProtocolError and runs start_task again. Real grading failures from the env (JSON-RPC -32000) are treated like “no session,” replacing a parked run and re-running the task instead of surfacing the error.

^{Reviewed by Cursor Bugbot for commit bf60f0e. Configure here.}

cursor · 2026-06-07T01:31:08Z

+        # Start and disconnect without grading; a persistent env keeps the session
+        # for a later `hud task grade` to resume.
+        async with launch(variant.env) as client:
+            return await client.start_task(variant.task, variant.args)


Start tears down ephemeral env

Medium Severity

hud task start wraps work in launch(), which stops a LocalSandbox when the CLI exits. Without --url or a listener on :8765, the started task and capabilities are destroyed, so a later hud task grade cannot resume agent work in the same substrate.

^{Reviewed by Cursor Bugbot for commit bf60f0e. Configure here.}

cursor · 2026-06-07T01:31:08Z

+        variants = collect_variants(str(env_dir))
+    manifest["variants"] = [
+        {"slug": v.slug or v.default_slug(), "task": v.task, "args": v.args} for v in variants
+    ]


Build hides variant collect errors

Medium Severity

_read_env_manifest uses contextlib.suppress(Exception) around collect_variants, so import or scan failures produce an empty baked manifest["variants"] with no build error.

^{Reviewed by Cursor Bugbot for commit bf60f0e. Configure here.}

3-way merge of the pr415-hardening working tree onto v6 (#416). The telemetry decoupling landed separately; this brings over the agent/tool, eval, environment, client, and CLI hardening changes. Notable conflict resolutions: - env.py: keep the ephemeral, session-local task runner required by the concurrent Taskset contract (settled on `grade` naming). This drops #416's held-session-across-disconnects, which is incompatible with multiple concurrent rollouts sharing one Environment instance. - harbor.py + test_harbor.py: keep #416's harbor rework. - Dropped the `push` command. - workspace.py: keep security hardening (no client-key-path logging; minimal, non-secret sandbox env passthrough). ruff and pyright are clean. No new test failures vs the pr415 source: remaining failures are pre-existing (stale v5-surface contract tests and base edit/memory/deploy tests). Env integration tests pass outside the network sandbox.

lorenss-m added 4 commits June 6, 2026 15:42

cleanup and add task cli

40d5db6

rm push

4c7c5f1

improve readme and convert

55b3ce8

fxs

bf60f0e

lorenss-m marked this pull request as ready for review June 7, 2026 01:29

lorenss-m merged commit 2fb7aef into v6 Jun 7, 2026
3 of 6 checks passed

cursor Bot reviewed Jun 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V6 contrainer mgmt#416

V6 contrainer mgmt#416
lorenss-m merged 4 commits into
v6from
v6-contrainer-mgmt

lorenss-m commented Jun 6, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 7, 2026

Uh oh!

cursor Bot Jun 7, 2026

Uh oh!

cursor Bot Jun 7, 2026

Uh oh!

cursor Bot Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lorenss-m commented Jun 6, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 7, 2026

Choose a reason for hiding this comment

Shared runner across TCP sessions

Uh oh!

cursor Bot Jun 7, 2026

Choose a reason for hiding this comment

Grade retries all protocol errors

Uh oh!

cursor Bot Jun 7, 2026

Choose a reason for hiding this comment

Start tears down ephemeral env

Uh oh!

cursor Bot Jun 7, 2026

Choose a reason for hiding this comment

Build hides variant collect errors

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lorenss-m commented Jun 6, 2026 •

edited by cursor Bot

Loading