Skip to content

Testing

Daniel Babjak edited this page Mar 27, 2026 · 21 revisions

Testing

Test Pyramid

          /\
         /  \      Security (127)
        /    \     - 3 files: injection, audit, invariants
       /------\
      /        \   Governance (30+)
     /          \  - policy, approval, operator, channel
    /------------\
   /              \ Routing & Adversarial (40+)
  /                \- eval, confusion, regression
 /------------------\
|                    | E2E Effectiveness (44)
|                    | - full agent flow, wiring
|--------------------+
|                    | Integration (34)
|                    | - cross-module, finance
|--------------------+
|                    | Unit (~580)
|                    | - individual modules
+--------------------+
Layer Tests Files Token Cost
Unit ~580 28 test files $0.00
Integration 34 test_integration.py $0.00
E2E 44 test_e2e_effectiveness.py $0.00
Security 127 test_security.py, test_security_audit.py, test_security_invariants.py $0.00
Routing 40+ test_routing_eval.py, test_routing_adversarial.py, test_routing_confusion.py $0.00
Governance 30+ test_tool_governance.py, test_policy_regression.py, test_policy_simulation.py, test_approval_queue.py, test_multi_step_approval.py, test_operator_controls.py, test_channel_policy.py $0.00
Memory 30+ test_provenance.py, test_memory_conflicts.py, test_memory_consolidation.py, test_memory_separation.py, test_memory_inspection.py $0.00
Finance 20+ test_budget_policy.py, test_risk_templates.py, test_finance_approval.py, test_proposal_lifecycle.py $0.00
Workspace 15+ test_workspace_persistence.py, test_workspace_recovery.py, test_workspace_limits.py $0.00
Regression 14 test_audit_v2_regressions.py $0.00
Other 30+ test_smoke.py, test_action_envelope.py, test_agent_status.py, test_explanation.py, etc. $0.00
Total 1,260+ current suite $0.00

All tests are offline. No API calls, no network, no Docker needed.

Recent Coverage Additions (v1.5.0)

  • persisted JobPlan handoff record and execution trace coverage
  • workspace join and builder delivery lifecycle coverage
  • repo-aware verification discovery coverage
  • policy-driven post-build review coverage

Running Tests

# All tests
.venv/bin/python -m pytest tests/ -q

# Specific layer
.venv/bin/python -m pytest tests/test_security_audit.py -v
.venv/bin/python -m pytest tests/test_policy_regression.py -v
.venv/bin/python -m pytest tests/test_audit_v2_regressions.py -v

# Single test class
.venv/bin/python -m pytest tests/test_integration.py::TestFinanceIntegration -v

# With coverage
.venv/bin/python -m pytest tests/ --cov=agent --cov-report=term-missing

CI Quality Gates (Hard)

All gates are hard failures — no || true:

Gate Tool Threshold
Lint ruff 0 errors
Type check mypy 0 errors (10 core modules)
Tests pytest 0 failures
Test count pytest >= 1000 tests
Performance timeout < 60 seconds
Architecture grep No duplicate persona, no hardcoded paths, sandbox default = "1"
Security pytest test_security_audit.py passes

Key Test Categories

Security Tests (127)

  • Prompt injection: 10+ attack patterns in EN + SK
  • Hardcoded secrets: AST scan of all .py files
  • SQL safety: parameterized queries only
  • Sandbox isolation: Docker flags, image whitelist
  • API auth: mutation endpoints require Bearer token

Governance Tests (30+)

  • Tool policy: deny-by-default, channel restriction, owner-only, safe mode
  • Approval: multi-step, TTL expiry, same-person dedup
  • Operator controls: disable/enable, lockdown/unlock
  • Channel policy: trust levels, response filtering

Regression Tests (14)

  • Restricted channel enforcement (5 tests)
  • Approval enforcement (2 tests)
  • Deny-by-default completeness (2 tests)
  • Status lifecycle — no stuck states (3 tests)
  • Channel file access policy (1 test)

Clone this wiki locally