Skip to content

Test requirement: validate the Harbor (Terminus 2) harness end-to-end on a full corpus #222

@reacher-z

Description

@reacher-z

Test requirement: validate the Harbor (Terminus 2) harness end-to-end

Follow-up to #218 / PR #220, which add the harbor harness (Harbor framework's Terminus 2 agent driving the browser via the agent-browser CDP CLI). It is registered in the harness registry and smoke-tested on a single task. Before we rely on it, Harbor can now be used to run ClawBench, and it needs a full test pass.

Ask

Run a representative test pass with --harness harbor and confirm it works at scale:

  • Build clawbench-harbor and run a V2 batch: clawbench-batch --models <paid-model> --cases-suite v2 --all-cases --harness harbor --no-judge --max-concurrent 4.
  • Acceptance: the large majority of tasks produce a non-empty actions.jsonl (Terminus actually drives the browser), interception fires on solvable tasks, and there are no systematic 0-action / empty-trajectory failures.
  • Parity check vs hermes: run the same model on the same tasks under hermes and harbor; intercept rates should be broadly comparable (note: harbor's agent-messages.jsonl is shell-shaped since Terminus drives the browser through a CLI; actions.jsonl is identical because it comes from the recorder).
  • Confirm the Gemini base_url handling (/v1beta/openai → LiteLLM native gemini/<model>) works under harbor.

Known caveats to watch

  • Use a paid/stronger model — free models were flaky in the smoke test (empty completions).
  • Terminus is a terminal agent; its transcript shape differs from pi/hermes even though actions.jsonl matches.
  • Pins: agent-browser@0.26.0, harbor==0.13.1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions