Test requirement: validate the Harbor (Terminus 2) harness end-to-end on a full corpus

## Test requirement: validate the Harbor (Terminus 2) harness end-to-end

Follow-up to #218 / PR #220, which add the `harbor` harness (Harbor framework's Terminus 2 agent driving the browser via the `agent-browser` CDP CLI). It is registered in the harness registry and **smoke-tested on a single task**. Before we rely on it, **Harbor can now be used to run ClawBench, and it needs a full test pass.**

### Ask
Run a representative test pass with `--harness harbor` and confirm it works at scale:

- [ ] Build `clawbench-harbor` and run a V2 batch: `clawbench-batch --models <paid-model> --cases-suite v2 --all-cases --harness harbor --no-judge --max-concurrent 4`.
- [ ] **Acceptance:** the large majority of tasks produce a **non-empty `actions.jsonl`** (Terminus actually drives the browser), interception fires on solvable tasks, and there are **no systematic 0-action / empty-trajectory failures**.
- [ ] **Parity check vs `hermes`:** run the same model on the same tasks under `hermes` and `harbor`; intercept rates should be broadly comparable (note: `harbor`'s `agent-messages.jsonl` is shell-shaped since Terminus drives the browser through a CLI; `actions.jsonl` is identical because it comes from the recorder).
- [ ] Confirm the Gemini base_url handling (`/v1beta/openai` → LiteLLM native `gemini/<model>`) works under harbor.

### Known caveats to watch
- Use a **paid/stronger model** — free models were flaky in the smoke test (empty completions).
- Terminus is a terminal agent; its transcript shape differs from `pi`/`hermes` even though `actions.jsonl` matches.
- Pins: `agent-browser@0.26.0`, `harbor==0.13.1`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test requirement: validate the Harbor (Terminus 2) harness end-to-end on a full corpus #222

Test requirement: validate the Harbor (Terminus 2) harness end-to-end

Ask

Known caveats to watch

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Test requirement: validate the Harbor (Terminus 2) harness end-to-end on a full corpus #222

Description

Test requirement: validate the Harbor (Terminus 2) harness end-to-end

Ask

Known caveats to watch

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions