Skip to content

Testing

Daniel Babjak edited this page Apr 8, 2026 · 21 revisions

Testing

Agent Life Space is built test-first. The whole suite is offline (no API calls, no Docker required to run the unit/integration layers, no network), runs in under 30 seconds locally, and is enforced as a hard CI gate.

Total: 1762 passed, 4 skipped, 0 failures as of v1.35.0. The 4 skips are legacy semantic-router tests that need the optional sentence_transformers model — they're skipped by default, not failing.


Test pyramid

                       ▲
                       │
                       │   Security (129)
                       │   - injection, audit, invariants, vault, telegram guards, headless CLI
                       │
                       │   Architecture invariants
                       │   - no hardcoded paths, sandbox default = "1", no orchestrator imports back
                       │
                       │   Governance (60+)
                       │   - tool policy, approval queue, multi-step approval, channel policy,
                       │     operator controls, control plane, deployment contracts
                       │
                       │   Routing & Adversarial (40+)
                       │   - eval, confusion matrix, regression, semantic guard
                       │
                       │   E2E effectiveness (44)
                       │   - full agent wiring, e2e flows
                       │
                       │   Integration (34)
                       │   - cross-module flows, finance, control plane jobs
                       │
                       │   Domain (300+)
                       │   - build, review, brain, memory, finance, control plane, vault
                       │
                       │   Unit (~800)
                       │   - individual modules
                       │
                       ▼
Layer Tests Files
Unit ~800 30+ test files (one per source module)
Domain ~300 test_build_domain.py, test_review_domain.py, test_brain_core.py, ...
Integration 34 test_integration.py
E2E 44 test_e2e_effectiveness.py
Security 129 test_security.py, test_security_audit.py, test_security_invariants.py
Routing 40+ test_routing_eval.py, test_routing_adversarial.py, test_routing_confusion.py
Governance 60+ test_tool_governance.py, test_policy_regression.py, test_approval_queue.py, test_multi_step_approval.py, test_operator_controls.py, test_channel_policy.py, test_control_plane*.py, test_deployment_contracts.py
Memory 30+ test_provenance.py, test_memory_*.py
Finance 20+ test_budget_policy.py, test_risk_templates.py, test_finance_approval.py, test_proposal_lifecycle.py
Workspace 15+ test_workspace_persistence.py, test_workspace_recovery.py, test_workspace_limits.py
Logging 50+ test_logger.py, test_log_retention.py
Vault 31 test_vault.py (v2 format, legacy migration, wrong-key fail-fast, crash safety)
Operator API 50+ test_operator_api.py, test_dashboard_settlement.py
Telegram 50+ test_telegram_operator.py, test_telegram_handler.py
Architecture 14 test_architecture_invariants.py
Total 1762 across the whole tests/ tree

All tests are offline. No API calls. Full suite token cost: $0.00.


Running tests

# Whole suite — under 30s
.venv/bin/python -m pytest tests/ -q

# A single test file
.venv/bin/python -m pytest tests/test_vault.py -v

# A single class
.venv/bin/python -m pytest tests/test_brain_core.py::TestTelegramCliProgrammingDenyGuard -v

# A single test
.venv/bin/python -m pytest tests/test_log_retention.py::TestRetentionEnvContractIsUnified::test_legacy_days_env_is_promoted_to_hours -v

# With coverage
.venv/bin/python -m pytest tests/ --cov=agent --cov-report=term-missing

# Skip slow / network tests (none in this repo, but the marker exists)
.venv/bin/python -m pytest tests/ -q -m "not slow"

The default -q mode prints one dot per test. tests/conftest.py configures pytest to:

  • treat DeprecationWarning as error (CI level)
  • enable asyncio mode (so async def test_* works without decorators)
  • use a temp directory for every test that needs filesystem state (no cross-test pollution)

CI gates (hard, no || true)

GitHub Actions runs everything below on every push and PR. Any failure blocks merge.

Gate Tool Threshold
Lint ruff check agent/ tests/ 0 errors
Auto-fix imports ruff check --fix --select I applied automatically
Type check mypy agent --ignore-missing-imports 0 errors across 112 source files
Tests pytest tests/ -q -W error::DeprecationWarning 0 failures
Test count pytest -q final line ≥ 1300 (currently 1762)
Performance timeout 90 pytest tests/ Under 90 seconds wallclock
Architecture invariants pytest tests/test_architecture_invariants.py -v All pass
Security audit pytest tests/test_security_audit.py -v All pass
Review eval (smoke) pytest tests/test_review_eval_smoke.py -q All pass
Review eval (golden) pytest tests/test_review_eval_golden.py -q All pass
Release readiness python -m agent --release-readiness ... ready: true (with AGENT_RELEASE_READINESS_SKIP_LLM_PROBE=1 on CI runners)
Operator typecheck cd operator && npm ci && npm run typecheck 0 errors

The release-readiness gate skips the live LLM probe on CI because the GitHub Actions runners don't have the Claude CLI installed. The skipped probe is still recorded in the readiness payload — see Logging and the AGENT_RELEASE_READINESS_SKIP_LLM_PROBE env var.


Architecture invariants

tests/test_architecture_invariants.py enforces structural rules at CI time. These are the rules that "obviously" should be true but degrade silently if you don't test them:

Test Invariant
test_no_hardcoded_paths No os.path.expanduser("~") or hardcoded /Users/ / /home/ paths in agent/
test_no_duplicate_persona The system prompt template appears in exactly one place
test_sandbox_default_is_one AGENT_SANDBOX_ONLY defaults to "1" in llm_provider.py
test_skip_permissions_is_guarded --dangerously-skip-permissions is always behind a sandbox check
test_no_orchestrator_import_back No module under agent/build/, agent/review/, agent/control/, etc. imports AgentOrchestrator directly
test_security_model_doc_exists docs/SECURITY_MODEL.md is present and covers required sections
test_no_circular_imports The dependency graph has no cycles at module level
test_storage_layers_use_parameterized_queries No raw f"SELECT ... {var}" patterns in storage layers

These tests catch architecture drift early. If a refactor accidentally introduces a circular import or a hardcoded path, CI tells you on the first push.


Security tests (129)

Three files, all run on every commit:

tests/test_security.py (66 tests)

Class Tests
TestPromptInjection 10+ attack patterns (EN + SK), hard block + soft block
TestSafeMode Non-owner restrictions in groups, command list filtering
TestOwnerEnforcement Whitelist enforcement, owner-only commands, identity capture
TestChannelPolicy Trust levels, response filtering per channel
TestApprovalGate Approval-required tools, multi-step paths

tests/test_security_audit.py (50+ tests)

Class Tests
TestNoHardcodedSecrets AST scan for embedded API keys, tokens, passwords
TestSqlSafety All queries use parameterization; dynamic DDL uses whitelist
TestEvalExecBan No eval() / exec() in agent/
TestVaultIntegration Vault is encrypted at rest, fail-fast without key
TestSandboxIsolation Docker flags present, image whitelist enforced
TestApiAuth Mutation endpoints require Bearer; no ?key= fallback
TestLogRedaction Secrets never reach log output
TestSubprocessSafety All subprocess calls quoted, no shell injection
TestEnvVarSecurity No os.environ[KEY] (use .get(KEY, default))

tests/test_security_invariants.py (13 tests)

Architecture-level invariants pulled out from the broader audit suite:

  • Sandbox default = "1"
  • --dangerously-skip-permissions is guarded
  • Tool capability manifest is complete (every executor tool has an entry)
  • HIGH risk tools are owner-only AND safe-mode-blocked
  • EXTERNAL side-effect tools are owner-only
  • Read-only tools have NONE side effects
  • Persona module is the only place that defines the system prompt

Notable test files added in v1.35.0

File Purpose
tests/test_vault.py v2 format spec, legacy v1 migration, wrong-key fail-fast, crash safety (31 tests)
tests/test_log_retention.py tier resolver, retention manager, env contract unification (50+ tests)
tests/test_llm_runtime.py runtime override resolver, brain backend resolution (8 tests)
tests/test_brain_core.py::TestTelegramCliProgrammingDenyGuard 5 scenarios for the brain fail-closed guard
tests/test_brain_core.py::TestExplicitWorkQueueDetector 10 scenarios for the anti-echo work-queue guard
tests/test_brain_core.py::TestShortFollowupGetsHistory regression for "ano" losing conversation history
tests/test_telegram_operator.py::TestTypingIndicatorCleanup typing task leak detection
tests/test_telegram_operator.py::TestAgentCronStop cron stop awaits cancelled tasks
tests/test_review_domain.py::TestReviewStorageSqlHardening SQL injection guard for review storage

Test fixtures

All test fixtures use neutral hostnames (acme-host-*, example.com, agent-test) — no operator-specific identifiers. A fresh clone of the repo has zero personal data baked into tests.

This was a deliberate cleanup in v1.35.0 after the credential leak post-mortem. See docs/SECURITY_INCIDENT_2026-04-07.md.


What we don't test

  • Live LLM calls. Every test uses a MagicMock or a recorded fixture. The release-readiness gate runs a single live probe on operator hosts, not in CI.
  • Docker container execution. Sandbox tests verify the flags passed to docker run, not that an actual container runs. Container-level tests are out-of-scope.
  • Cross-process integration. SQLite WAL behaviour with two processes is tested at the unit level (locking, retry) but not via real concurrent processes.
  • Network ops at scale. Rate limiting is tested with mocked clocks. Real load testing is the operator's job.
  • Telegram Bot API. Mocked. We test our handler, not Telegram's wire protocol.

Adding a new test

  1. Find the right file. Module → tests/test_<module>.py. New cross-cutting concern → new file with a descriptive name (e.g. tests/test_log_retention.py).
  2. Use the existing fixtures. conftest.py has tmp_path, monkeypatch, async support. Don't reinvent.
  3. Name the class after the behaviour. TestVaultV2Format, not TestVault. Each class is one specific contract.
  4. Make it deterministic. No real time (time.time()), no real network, no real Docker. Use freezegun or monkeypatch if you need clock control.
  5. Run it. pytest tests/test_<your_file>.py -v.
  6. Confirm full suite still passes. pytest tests/ -q. CI runs this for you, but better to know locally first.

Test count history

Release Tests
v1.35.0 1762
v1.34.0 ~1734
v1.33.0 ~1700
v1.32.0 ~1670
v1.31.0 1631
v1.30.0 1627
v1.29.0 1617
v1.25.0 1541

Tests grow with features. We don't delete tests when features change — we update them. The suite is the contract, not the documentation.

Clone this wiki locally