V6 contrainer mgmt#416
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 4 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit bf60f0e. Configure here.
| evaluation = await active_runner.evaluate(params) | ||
| active_runner = None | ||
| evaluation = await self._active_runner.grade(params) | ||
| self._active_runner = None |
There was a problem hiding this comment.
Shared runner across TCP sessions
High Severity
The held TaskRunner lives on the Environment instance, not per control-channel connection. Concurrent clients can interleave tasks.start and tasks.grade on the same async generator, corrupting or cancelling another client’s parked task.
Reviewed by Cursor Bugbot for commit bf60f0e. Configure here.
| except HudProtocolError: | ||
| # No held session: run the whole lifecycle here (start then grade). | ||
| await client.start_task(variant.task, variant.args) | ||
| return await client.grade({"answer": answer_text}) |
There was a problem hiding this comment.
Grade retries all protocol errors
High Severity
hud task grade catches every HudProtocolError and runs start_task again. Real grading failures from the env (JSON-RPC -32000) are treated like “no session,” replacing a parked run and re-running the task instead of surfacing the error.
Reviewed by Cursor Bugbot for commit bf60f0e. Configure here.
| # Start and disconnect without grading; a persistent env keeps the session | ||
| # for a later `hud task grade` to resume. | ||
| async with launch(variant.env) as client: | ||
| return await client.start_task(variant.task, variant.args) |
There was a problem hiding this comment.
Start tears down ephemeral env
Medium Severity
hud task start wraps work in launch(), which stops a LocalSandbox when the CLI exits. Without --url or a listener on :8765, the started task and capabilities are destroyed, so a later hud task grade cannot resume agent work in the same substrate.
Reviewed by Cursor Bugbot for commit bf60f0e. Configure here.
| variants = collect_variants(str(env_dir)) | ||
| manifest["variants"] = [ | ||
| {"slug": v.slug or v.default_slug(), "task": v.task, "args": v.args} for v in variants | ||
| ] |
There was a problem hiding this comment.
Build hides variant collect errors
Medium Severity
_read_env_manifest uses contextlib.suppress(Exception) around collect_variants, so import or scan failures produce an empty baked manifest["variants"] with no build error.
Reviewed by Cursor Bugbot for commit bf60f0e. Configure here.
3-way merge of the pr415-hardening working tree onto v6 (#416). The telemetry decoupling landed separately; this brings over the agent/tool, eval, environment, client, and CLI hardening changes. Notable conflict resolutions: - env.py: keep the ephemeral, session-local task runner required by the concurrent Taskset contract (settled on `grade` naming). This drops #416's held-session-across-disconnects, which is incompatible with multiple concurrent rollouts sharing one Environment instance. - harbor.py + test_harbor.py: keep #416's harbor rework. - Dropped the `push` command. - workspace.py: keep security hardening (no client-key-path logging; minimal, non-secret sandbox env passthrough). ruff and pyright are clean. No new test failures vs the pr415 source: remaining failures are pre-existing (stale v5-surface contract tests and base edit/memory/deploy tests). Env integration tests pass outside the network sandbox.


Note
High Risk
Renames the wire RPC method and changes session persistence semantics (breaking for non-SDK clients), rewires Harbor export/convert grading paths, and removes
hud push.Overview
This PR tightens v6 container workflows: the control channel now uses
tasks.grade(replacingtasks.evaluate), and an env can keep a started task across disconnect sohud task startand a laterhud task grade(or reconnect) can split prompt vs scoring—only explicitbyeor a new start tears the session down. The client stops sendingbyeon close to match that behavior.New CLI:
hud task(list/start/grade) plus top-levelhud task-start,hud task-grade,hud task-list, with sharedload_variantsforhud evalandhud build.hud buildbakes a variant catalog into the manifest.hud pushand its tests are removed (deploy-first path).Harbor: export now copies the full build context, neutralizes CMD/ENTRYPOINT, adds
hud_entrypoint.sh(serve channel + park run viahud task start), and grades withhud task gradeintest.sh; Harbor→HUD convert rootsWorkspaceat the challengeWORKDIRviaguest_path.Docs: README reframed around the protocol and docker exec examples; new
migrate-v6guide in the docs nav.Reviewed by Cursor Bugbot for commit bf60f0e. Bugbot is set up for automated code reviews on this repo. Configure here.