Test requirement: validate the Harbor (Terminus 2) harness end-to-end
Follow-up to #218 / PR #220, which add the harbor harness (Harbor framework's Terminus 2 agent driving the browser via the agent-browser CDP CLI). It is registered in the harness registry and smoke-tested on a single task. Before we rely on it, Harbor can now be used to run ClawBench, and it needs a full test pass.
Ask
Run a representative test pass with --harness harbor and confirm it works at scale:
Known caveats to watch
- Use a paid/stronger model — free models were flaky in the smoke test (empty completions).
- Terminus is a terminal agent; its transcript shape differs from
pi/hermes even though actions.jsonl matches.
- Pins:
agent-browser@0.26.0, harbor==0.13.1.
Test requirement: validate the Harbor (Terminus 2) harness end-to-end
Follow-up to #218 / PR #220, which add the
harborharness (Harbor framework's Terminus 2 agent driving the browser via theagent-browserCDP CLI). It is registered in the harness registry and smoke-tested on a single task. Before we rely on it, Harbor can now be used to run ClawBench, and it needs a full test pass.Ask
Run a representative test pass with
--harness harborand confirm it works at scale:clawbench-harborand run a V2 batch:clawbench-batch --models <paid-model> --cases-suite v2 --all-cases --harness harbor --no-judge --max-concurrent 4.actions.jsonl(Terminus actually drives the browser), interception fires on solvable tasks, and there are no systematic 0-action / empty-trajectory failures.hermes: run the same model on the same tasks underhermesandharbor; intercept rates should be broadly comparable (note:harbor'sagent-messages.jsonlis shell-shaped since Terminus drives the browser through a CLI;actions.jsonlis identical because it comes from the recorder)./v1beta/openai→ LiteLLM nativegemini/<model>) works under harbor.Known caveats to watch
pi/hermeseven thoughactions.jsonlmatches.agent-browser@0.26.0,harbor==0.13.1.