Skip to content

V6 contrainer mgmt#416

Merged
lorenss-m merged 4 commits into
v6from
v6-contrainer-mgmt
Jun 7, 2026
Merged

V6 contrainer mgmt#416
lorenss-m merged 4 commits into
v6from
v6-contrainer-mgmt

Conversation

@lorenss-m

@lorenss-m lorenss-m commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Note

High Risk
Renames the wire RPC method and changes session persistence semantics (breaking for non-SDK clients), rewires Harbor export/convert grading paths, and removes hud push.

Overview
This PR tightens v6 container workflows: the control channel now uses tasks.grade (replacing tasks.evaluate), and an env can keep a started task across disconnect so hud task start and a later hud task grade (or reconnect) can split prompt vs scoring—only explicit bye or a new start tears the session down. The client stops sending bye on close to match that behavior.

New CLI: hud task (list / start / grade) plus top-level hud task-start, hud task-grade, hud task-list, with shared load_variants for hud eval and hud build. hud build bakes a variant catalog into the manifest. hud push and its tests are removed (deploy-first path).

Harbor: export now copies the full build context, neutralizes CMD/ENTRYPOINT, adds hud_entrypoint.sh (serve channel + park run via hud task start), and grades with hud task grade in test.sh; Harbor→HUD convert roots Workspace at the challenge WORKDIR via guest_path.

Docs: README reframed around the protocol and docker exec examples; new migrate-v6 guide in the docs nav.

Reviewed by Cursor Bugbot for commit bf60f0e. Bugbot is set up for automated code reviews on this repo. Configure here.

@lorenss-m lorenss-m marked this pull request as ready for review June 7, 2026 01:29
@lorenss-m lorenss-m merged commit 2fb7aef into v6 Jun 7, 2026
3 of 6 checks passed

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 4 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit bf60f0e. Configure here.

Comment thread hud/environment/env.py
evaluation = await active_runner.evaluate(params)
active_runner = None
evaluation = await self._active_runner.grade(params)
self._active_runner = None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shared runner across TCP sessions

High Severity

The held TaskRunner lives on the Environment instance, not per control-channel connection. Concurrent clients can interleave tasks.start and tasks.grade on the same async generator, corrupting or cancelling another client’s parked task.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit bf60f0e. Configure here.

Comment thread hud/cli/task.py
except HudProtocolError:
# No held session: run the whole lifecycle here (start then grade).
await client.start_task(variant.task, variant.args)
return await client.grade({"answer": answer_text})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grade retries all protocol errors

High Severity

hud task grade catches every HudProtocolError and runs start_task again. Real grading failures from the env (JSON-RPC -32000) are treated like “no session,” replacing a parked run and re-running the task instead of surfacing the error.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit bf60f0e. Configure here.

Comment thread hud/cli/task.py
# Start and disconnect without grading; a persistent env keeps the session
# for a later `hud task grade` to resume.
async with launch(variant.env) as client:
return await client.start_task(variant.task, variant.args)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Start tears down ephemeral env

Medium Severity

hud task start wraps work in launch(), which stops a LocalSandbox when the CLI exits. Without --url or a listener on :8765, the started task and capabilities are destroyed, so a later hud task grade cannot resume agent work in the same substrate.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit bf60f0e. Configure here.

Comment thread hud/cli/build.py
variants = collect_variants(str(env_dir))
manifest["variants"] = [
{"slug": v.slug or v.default_slug(), "task": v.task, "args": v.args} for v in variants
]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build hides variant collect errors

Medium Severity

_read_env_manifest uses contextlib.suppress(Exception) around collect_variants, so import or scan failures produce an empty baked manifest["variants"] with no build error.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit bf60f0e. Configure here.

jdchawla29 added a commit that referenced this pull request Jun 8, 2026
3-way merge of the pr415-hardening working tree onto v6 (#416). The telemetry
decoupling landed separately; this brings over the agent/tool, eval,
environment, client, and CLI hardening changes.

Notable conflict resolutions:
- env.py: keep the ephemeral, session-local task runner required by the
  concurrent Taskset contract (settled on `grade` naming). This drops
  #416's held-session-across-disconnects, which is incompatible with
  multiple concurrent rollouts sharing one Environment instance.
- harbor.py + test_harbor.py: keep #416's harbor rework.
- Dropped the `push` command.
- workspace.py: keep security hardening (no client-key-path logging;
  minimal, non-secret sandbox env passthrough).

ruff and pyright are clean. No new test failures vs the pr415 source:
remaining failures are pre-existing (stale v5-surface contract tests and
base edit/memory/deploy tests). Env integration tests pass outside the
network sandbox.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant