-
Notifications
You must be signed in to change notification settings - Fork 0
Testing
Agent Life Space is built test-first. The whole suite is offline (no API calls, no Docker required to run the unit/integration layers, no network), runs in under 30 seconds locally, and is enforced as a hard CI gate.
Total: 1762 passed, 4 skipped, 0 failures as of v1.35.0. The 4 skips are legacy semantic-router tests that need the optional
sentence_transformersmodel — they're skipped by default, not failing.
▲
│
│ Security (129)
│ - injection, audit, invariants, vault, telegram guards, headless CLI
│
│ Architecture invariants
│ - no hardcoded paths, sandbox default = "1", no orchestrator imports back
│
│ Governance (60+)
│ - tool policy, approval queue, multi-step approval, channel policy,
│ operator controls, control plane, deployment contracts
│
│ Routing & Adversarial (40+)
│ - eval, confusion matrix, regression, semantic guard
│
│ E2E effectiveness (44)
│ - full agent wiring, e2e flows
│
│ Integration (34)
│ - cross-module flows, finance, control plane jobs
│
│ Domain (300+)
│ - build, review, brain, memory, finance, control plane, vault
│
│ Unit (~800)
│ - individual modules
│
▼
| Layer | Tests | Files |
|---|---|---|
| Unit | ~800 | 30+ test files (one per source module) |
| Domain | ~300 |
test_build_domain.py, test_review_domain.py, test_brain_core.py, ... |
| Integration | 34 | test_integration.py |
| E2E | 44 | test_e2e_effectiveness.py |
| Security | 129 |
test_security.py, test_security_audit.py, test_security_invariants.py
|
| Routing | 40+ |
test_routing_eval.py, test_routing_adversarial.py, test_routing_confusion.py
|
| Governance | 60+ |
test_tool_governance.py, test_policy_regression.py, test_approval_queue.py, test_multi_step_approval.py, test_operator_controls.py, test_channel_policy.py, test_control_plane*.py, test_deployment_contracts.py
|
| Memory | 30+ |
test_provenance.py, test_memory_*.py
|
| Finance | 20+ |
test_budget_policy.py, test_risk_templates.py, test_finance_approval.py, test_proposal_lifecycle.py
|
| Workspace | 15+ |
test_workspace_persistence.py, test_workspace_recovery.py, test_workspace_limits.py
|
| Logging | 50+ |
test_logger.py, test_log_retention.py
|
| Vault | 31 |
test_vault.py (v2 format, legacy migration, wrong-key fail-fast, crash safety) |
| Operator API | 50+ |
test_operator_api.py, test_dashboard_settlement.py
|
| Telegram | 50+ |
test_telegram_operator.py, test_telegram_handler.py
|
| Architecture | 14 | test_architecture_invariants.py |
| Total | 1762 | across the whole tests/ tree |
All tests are offline. No API calls. Full suite token cost: $0.00.
# Whole suite — under 30s
.venv/bin/python -m pytest tests/ -q
# A single test file
.venv/bin/python -m pytest tests/test_vault.py -v
# A single class
.venv/bin/python -m pytest tests/test_brain_core.py::TestTelegramCliProgrammingDenyGuard -v
# A single test
.venv/bin/python -m pytest tests/test_log_retention.py::TestRetentionEnvContractIsUnified::test_legacy_days_env_is_promoted_to_hours -v
# With coverage
.venv/bin/python -m pytest tests/ --cov=agent --cov-report=term-missing
# Skip slow / network tests (none in this repo, but the marker exists)
.venv/bin/python -m pytest tests/ -q -m "not slow"The default -q mode prints one dot per test. tests/conftest.py configures pytest to:
- treat
DeprecationWarningas error (CI level) - enable asyncio mode (so
async def test_*works without decorators) - use a temp directory for every test that needs filesystem state (no cross-test pollution)
GitHub Actions runs everything below on every push and PR. Any failure blocks merge.
| Gate | Tool | Threshold |
|---|---|---|
| Lint | ruff check agent/ tests/ |
0 errors |
| Auto-fix imports | ruff check --fix --select I |
applied automatically |
| Type check | mypy agent --ignore-missing-imports |
0 errors across 112 source files |
| Tests | pytest tests/ -q -W error::DeprecationWarning |
0 failures |
| Test count |
pytest -q final line |
≥ 1300 (currently 1762) |
| Performance | timeout 90 pytest tests/ |
Under 90 seconds wallclock |
| Architecture invariants | pytest tests/test_architecture_invariants.py -v |
All pass |
| Security audit | pytest tests/test_security_audit.py -v |
All pass |
| Review eval (smoke) | pytest tests/test_review_eval_smoke.py -q |
All pass |
| Review eval (golden) | pytest tests/test_review_eval_golden.py -q |
All pass |
| Release readiness | python -m agent --release-readiness ... |
ready: true (with AGENT_RELEASE_READINESS_SKIP_LLM_PROBE=1 on CI runners) |
| Operator typecheck | cd operator && npm ci && npm run typecheck |
0 errors |
The release-readiness gate skips the live LLM probe on CI because the GitHub Actions runners don't have the Claude CLI installed. The skipped probe is still recorded in the readiness payload — see Logging and the AGENT_RELEASE_READINESS_SKIP_LLM_PROBE env var.
tests/test_architecture_invariants.py enforces structural rules at CI time. These are the rules that "obviously" should be true but degrade silently if you don't test them:
| Test | Invariant |
|---|---|
test_no_hardcoded_paths |
No os.path.expanduser("~") or hardcoded /Users/ / /home/ paths in agent/
|
test_no_duplicate_persona |
The system prompt template appears in exactly one place |
test_sandbox_default_is_one |
AGENT_SANDBOX_ONLY defaults to "1" in llm_provider.py
|
test_skip_permissions_is_guarded |
--dangerously-skip-permissions is always behind a sandbox check |
test_no_orchestrator_import_back |
No module under agent/build/, agent/review/, agent/control/, etc. imports AgentOrchestrator directly |
test_security_model_doc_exists |
docs/SECURITY_MODEL.md is present and covers required sections |
test_no_circular_imports |
The dependency graph has no cycles at module level |
test_storage_layers_use_parameterized_queries |
No raw f"SELECT ... {var}" patterns in storage layers |
These tests catch architecture drift early. If a refactor accidentally introduces a circular import or a hardcoded path, CI tells you on the first push.
Three files, all run on every commit:
| Class | Tests |
|---|---|
TestPromptInjection |
10+ attack patterns (EN + SK), hard block + soft block |
TestSafeMode |
Non-owner restrictions in groups, command list filtering |
TestOwnerEnforcement |
Whitelist enforcement, owner-only commands, identity capture |
TestChannelPolicy |
Trust levels, response filtering per channel |
TestApprovalGate |
Approval-required tools, multi-step paths |
| Class | Tests |
|---|---|
TestNoHardcodedSecrets |
AST scan for embedded API keys, tokens, passwords |
TestSqlSafety |
All queries use parameterization; dynamic DDL uses whitelist |
TestEvalExecBan |
No eval() / exec() in agent/
|
TestVaultIntegration |
Vault is encrypted at rest, fail-fast without key |
TestSandboxIsolation |
Docker flags present, image whitelist enforced |
TestApiAuth |
Mutation endpoints require Bearer; no ?key= fallback |
TestLogRedaction |
Secrets never reach log output |
TestSubprocessSafety |
All subprocess calls quoted, no shell injection |
TestEnvVarSecurity |
No os.environ[KEY] (use .get(KEY, default)) |
Architecture-level invariants pulled out from the broader audit suite:
- Sandbox default =
"1" -
--dangerously-skip-permissionsis guarded - Tool capability manifest is complete (every executor tool has an entry)
- HIGH risk tools are owner-only AND safe-mode-blocked
- EXTERNAL side-effect tools are owner-only
- Read-only tools have NONE side effects
- Persona module is the only place that defines the system prompt
| File | Purpose |
|---|---|
tests/test_vault.py |
v2 format spec, legacy v1 migration, wrong-key fail-fast, crash safety (31 tests) |
tests/test_log_retention.py |
tier resolver, retention manager, env contract unification (50+ tests) |
tests/test_llm_runtime.py |
runtime override resolver, brain backend resolution (8 tests) |
tests/test_brain_core.py::TestTelegramCliProgrammingDenyGuard |
5 scenarios for the brain fail-closed guard |
tests/test_brain_core.py::TestExplicitWorkQueueDetector |
10 scenarios for the anti-echo work-queue guard |
tests/test_brain_core.py::TestShortFollowupGetsHistory |
regression for "ano" losing conversation history |
tests/test_telegram_operator.py::TestTypingIndicatorCleanup |
typing task leak detection |
tests/test_telegram_operator.py::TestAgentCronStop |
cron stop awaits cancelled tasks |
tests/test_review_domain.py::TestReviewStorageSqlHardening |
SQL injection guard for review storage |
All test fixtures use neutral hostnames (acme-host-*, example.com, agent-test) — no operator-specific identifiers. A fresh clone of the repo has zero personal data baked into tests.
This was a deliberate cleanup in v1.35.0 after the credential leak post-mortem. See docs/SECURITY_INCIDENT_2026-04-07.md.
-
Live LLM calls. Every test uses a
MagicMockor a recorded fixture. The release-readiness gate runs a single live probe on operator hosts, not in CI. -
Docker container execution. Sandbox tests verify the flags passed to
docker run, not that an actual container runs. Container-level tests are out-of-scope. - Cross-process integration. SQLite WAL behaviour with two processes is tested at the unit level (locking, retry) but not via real concurrent processes.
- Network ops at scale. Rate limiting is tested with mocked clocks. Real load testing is the operator's job.
- Telegram Bot API. Mocked. We test our handler, not Telegram's wire protocol.
-
Find the right file. Module →
tests/test_<module>.py. New cross-cutting concern → new file with a descriptive name (e.g.tests/test_log_retention.py). -
Use the existing fixtures.
conftest.pyhastmp_path,monkeypatch, async support. Don't reinvent. -
Name the class after the behaviour.
TestVaultV2Format, notTestVault. Each class is one specific contract. -
Make it deterministic. No real time (
time.time()), no real network, no real Docker. Usefreezegunormonkeypatchif you need clock control. -
Run it.
pytest tests/test_<your_file>.py -v. -
Confirm full suite still passes.
pytest tests/ -q. CI runs this for you, but better to know locally first.
| Release | Tests |
|---|---|
| v1.35.0 | 1762 |
| v1.34.0 | ~1734 |
| v1.33.0 | ~1700 |
| v1.32.0 | ~1670 |
| v1.31.0 | 1631 |
| v1.30.0 | 1627 |
| v1.29.0 | 1617 |
| v1.25.0 | 1541 |
Tests grow with features. We don't delete tests when features change — we update them. The suite is the contract, not the documentation.
v1.35.0 · Latest Release
Getting started
Architecture
Subsystems
- Security model
- Vault
- Tiered logging
- Runtime LLM control
- Build pipeline
- Review pipeline
- Finance
- Cron & Maintenance
Development