Record once → replay forever → deterministic tests for AI agents.
Agent Cassette is a lightweight record-and-replay harness for agent workflows. It captures structured run traces (LLM calls + tool calls) as they execute so you can replay behavior offline, write regression tests, and measure token/latency impact without hitting external APIs again.
v0: Explicit wrapper–based (Stable & Type-Safe).
v1: (Planned) Network interceptor plugin.
Agent runs are often flaky and expensive:
- Non-determinism: The same prompt can produce different outputs.
- Slow feedback: Integration tests spend 90% of their time waiting on network calls.
- Cost: Repeated debugging burns tokens and money.
Cassette turns "I swear it failed yesterday" into a replayable, immutable artifact.
(See examples/node-red-generator.ts)
This tool is designed for platforms like FlowFuse or Node-RED where AI agents generate executable code. Agent Cassette provides a Regression Testing Harness that ensures:
- Strict Schema Validation: Agents must output valid JSON structures (e.g., correct
wiresandcoordinates). - Semantic Safety: If an agent generates unsafe code (e.g., missing
return msg), the system detects it and swaps in a safe fallback. - Deterministic Replay: We record a "Golden Run" of a complex flow generation. CI pipelines can replay this instantly (0 cost) to prove that model upgrades (e.g., GPT-4o → GPT-5) don't break the JSON schema.
sequenceDiagram
participant User
participant Cassette
participant OpenAI
participant Runtime as FlowFuse/Runtime
rect rgb(240, 248, 255)
Note over User,Runtime: Record Mode (Golden Run)
User->>Cassette: Call Agent
Cassette->>OpenAI: Forward Request
OpenAI-->>Cassette: Return Code
Cassette->>Cassette: Validate Schema
alt Validation Fails
Cassette->>Cassette: Apply Fallback Code
end
Cassette->>Cassette: Save to JSONL
Cassette->>Runtime: Execute Side Effect
Runtime-->>Cassette: Return Status
end
rect rgb(255, 245, 238)
Note over User,Runtime: Replay Mode (CI/Docker)
User->>Cassette: Call Agent
Cassette->>Cassette: Match Semantic Hash
Cassette-->>User: Return Saved Response (0ms)
Note over Runtime: Side Effect SKIPPED
end
Text diagram (if Mermaid doesn't render)
┌─────────────────────────────────────────────────────────────────┐
│ RECORD MODE (Golden Run) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ User ──────► Cassette ──────► OpenAI │
│ │ │ │
│ │◄───────────────┘ (Return Code) │
│ │ │
│ ▼ │
│ Validate Schema ──► [FAIL?] ──► Apply Fallback │
│ │ │
│ ▼ │
│ Save to JSONL │
│ │ │
│ ▼ │
│ Runtime ──► Execute Side Effect │
│ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ REPLAY MODE (CI/Docker) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ User ──────► Cassette │
│ │ │
│ Match Semantic Hash │
│ │ │
│ ▼ │
│ User ◄─────── Return Saved Response (0ms, 0 tokens) │
│ │
│ [Runtime SKIPPED - Safe for Production] │
│ │
└─────────────────────────────────────────────────────────────────┘
Cassette wraps async functions and records {request_identity → result} as JSONL (one JSON object per line). JSONL is append-friendly and crash-safe: if a run dies mid-flight, earlier lines remain valid.
Modes:
| Mode | Behavior |
|---|---|
record |
Call real function, validate result, append entry |
replay |
Match semantic hash, return recorded result (network is mocked) |
passthrough |
Call without recording |
auto |
Replay if cassette exists, otherwise record |
npm install
# Create your local env file
cp .env.example .envRequires OpenAI API Key. Captures the run trace to disk.
export OPENAI_API_KEY="sk-..."
npm run nodered:recordNo API Key required. Instant feedback.
unset OPENAI_API_KEY
npm run nodered:replay(Notice the 0ms latency and 100% token savings)
Prove the code runs anywhere (no local dependencies).
docker build -t agent-cassette .
docker run agent-cassetteWe use ESLint and Prettier to maintain high engineering standards.
# Run Unit Tests
npm test
# Check Code Quality
npm run lint- Architecture: Manual wrapping of specific functions.
- Status: ✅ Stable, Docker-ready, Type-Safe.
- Trade-off: High control, but requires code changes to integrate.
- Goal: "Drop-in" recording without changing application code.
- Strategy: Implement the Proxy Pattern using
undicidispatchers ormswto intercept HTTP traffic at the network layer. - Benefit: Zero-touch integration for existing codebases.
- Goal: Visualize the "Drift."
- Strategy: A Web UI to diff "Record" vs "Replay" traces.
- Benefit: Deeply understand failures (e.g., "Prompt changed on line 4").