How "done" is determined. Treat the floors as floors, not nice-to-haves.
| Step | Command | When |
|---|---|---|
| Build | go build ./... |
Always, before claiming done |
| Focused tests | /test-affected (or go test -race -count=1 ./<package>/...) |
During iteration |
| Race suite | go test -race -timeout 10m ./... (or /race) |
Floor for runtime/backend changes |
| E2E | go test -tags e2e ./e2e/... |
When the change crosses the api → orchestrator → providers/sandbox/mcp chain |
The race suite is the floor — not a nice-to-have — for any change that touches internal/gateway, internal/router, internal/providers, internal/orchestrator, internal/sandbox, retention/state wiring, or other request execution paths.
E2E build tags: //go:build e2e is always required, plus optional ollama and docker sub-tags. Use PROVIDER_FAKE_KIND=local to skip pricebook preflight on synthetic models.
| Step | Command | When |
|---|---|---|
| Type check | cd ui && bun run typecheck |
First sanity check after any .ts/.tsx edit (tsgo -b, fast) |
| Tests | cd ui && bun run test |
Before claiming done — vitest |
| Watch mode | cd ui && bun run test:watch |
During iteration |
| Snapshot update | cd ui && bun run test -- -u |
When a UI shape change is intentional |
Never bun test — it skips the testing-library DOM setup and panics on document[isPrepared]. Always bun run test.
When updating snapshots, review the diff carefully. Accidental snapshot churn is the most common silent regression vector.
Full matrix in ../skills/tester/SKILL.md.
- Unit: data-to-UI transformation, wire-shape passthrough, error classification, storage scoping, retry/failover decisions, streaming wire shape, agent_loop tool dispatch, MCP tool registration, conditional rendering of critical states.
- E2E: api → orchestrator → providers/sandbox/mcp chain end-to-end; subprocess lifecycle; startup/config semantics; new SSE event sequences operators rely on; public HTTP contract changes that downstream SDKs see.
- Build passes (
go build ./...). - Race suite passes for runtime/backend work.
- Inbound and outbound wire shapes are tested independently.
- New env knobs documented in
.env.exampleAND the relevantdocs/<feature>.md. - Error paths return the right HTTP status with a useful message.
- OTel attributes are populated for new spans.
- New metric labels are emitted through
internal/telemetryguardrails, with tests for closed-set and free-form dimensions when adding a new label. - New event types follow the event-protocol taxonomy and appear in
docs/events.mdwith payload shape.
bun run typecheckclean.bun run testclean.- Main workflow obvious in seconds; current state visible; labels read like product UI.
- Loading, empty, and error states are explicit.
- Layout works at mobile and desktop widths.
- Tests cover the risky logic.
After edits, run git status --short && git diff --stat and confirm the change is cohesive. No accidental drift, no unrelated formatting, no half-staged hunks.
Required when automated tests cannot fake the real-world failure mode. Examples:
- SSE reconnect behavior under network blips.
- Sandbox network egress allowlist (real subprocess + real network).
- OTel exporter wiring against a real collector.
- UI focus management, scroll restoration, multi-tab drag-and-drop.
When manual smoke is needed, state the steps in the change summary so the operator can replay them. "Looks fine" is not a manual smoke report — name what was checked and what was not.