-
Notifications
You must be signed in to change notification settings - Fork 0
Logging
Agent Life Space writes structured JSON events from every layer (structlog everywhere). Routing those events into useful retention windows is what this page is about.
TL;DR. Two file sinks. Long tier (~30 days) for things you'll want next month: lifecycle, build, finance, audit, security, vault, ERROR/CRITICAL/AUDIT. Short tier (~6 hours) for verbose pipeline diagnostics that you only care about while a bug is hot. A cron loop runs
LogRetentionManager.prune_all()hourly and ages files out by mtime. There is no cleanup script you have to remember to run.
Code: agent/logs/logger.py (router + setup) and agent/logs/retention.py (tier resolver + prune sweep). Tests: tests/test_log_retention.py.
Operators have two retention requirements that pull in opposite directions:
- "What did the agent do last month?" — needs build/finance/audit/security events for ~30 days. These are low-volume, high-value.
- "Why did the brain pipeline behave weird in the last 30 minutes?" — needs every dispatcher hit, every cache lookup, every typing indicator, every poll cycle. Volume is huge, value drops to zero a few hours later.
Putting both into one log file means either keeping everything for 30 days (disk-heavy and grep-unfriendly), or keeping it for 6 hours and losing the audit trail. Tiered logging gives you both.
$AGENT_LOG_DIR/ ← default: <data_dir>/logs
├── long/
│ ├── agent-long.log ← active long-tier file
│ ├── agent-long.log.2026-04-07
│ ├── agent-long.log.2026-04-06
│ └── ...
└── short/
├── agent-short.log ← active short-tier file
├── agent-short.log.2026-04-08-21
├── agent-short.log.2026-04-08-20
└── ...
The long file rotates daily at midnight UTC. The short file rotates hourly. Both are written by Python's stdlib TimedRotatingFileHandler.
The cron-side LogRetentionManager then deletes any file in long/ older than AGENT_LOG_LONG_RETENTION_HOURS and any file in short/ older than AGENT_LOG_SHORT_RETENTION_HOURS. The handler's backupCount is set generously so it never races the prune sweep.
Every log record runs through resolve_tier(level, event) (agent/logs/retention.py). The decision is deterministic — same input, same output, no LLM, no probability.
┌──────────────────────────┐
│ resolve_tier(level, event)│
└────────────┬─────────────┘
│
┌───────────────────────┼─────────────────────────────┐
│ │ │
ERROR / CRITICAL / AUDIT? DEBUG? INFO / WARNING?
│ │ │
long │ │
│ │
event matches event matches a long-tier
a short-tier prefix? prefix (build_*, finance_*,
(brain_*, telegram_poll_*, vault_*, agent_started, ...)?
cache_*, dispatch_*, ...)? │
│ │
short long
│ │
└────── otherwise default = long
Why default to long? Lose nothing by default. If a new event is added to the codebase and nobody updates the prefix tables, it lands in long-tier and gets noticed in the next review. The wrong direction (silent data loss) is much harder to fix.
Defined in _LONG_TIER_EVENTS (frozenset). Any event whose name starts with one of these always goes long:
| Category | Prefixes |
|---|---|
| Lifecycle |
agent_started, agent_stopped, agent_initialized, shutdown_*, startup_*
|
| Build |
build_started, build_completed, build_failed, build_codegen_*, build_acceptance_*, build_delivery_*, codegen_fallback_guard
|
| Review |
review_started, review_completed, review_blocked
|
| Delivery / approvals |
delivery_*, approval_granted, approval_denied, approval_pending
|
| Finance |
finance_proposal, finance_approved, finance_completed, finance_rejected, budget_hard_cap, budget_stop_loss
|
| Vault / security |
vault_*, auth_failure, auth_success, prompt_injection_*, command_blocked_non_owner
|
| Gateway |
gateway_call_*, settlement_*
|
| Persistence / DB |
*_storage_initialized, *_db_recovery_*
|
Defined in _SHORT_TIER_EVENTS. These prefixes go to short-tier even at INFO level:
| Category | Prefixes |
|---|---|
| Brain pipeline |
brain_pipeline_*, brain_cache_*, brain_rag_*, brain_dispatch_*
|
| Telegram polling |
telegram_poll_*, telegram_typing_*
|
| Internal dispatch |
dispatch_internal, dispatcher_*
|
| Semantic |
semantic_router_*, semantic_cache_*
|
| Tool router | tool_router_* |
| Heartbeat |
heartbeat_*, watchdog_tick_*
|
These are exactly the events you want when you're tailing the file during an active debug session, but that you don't want filling up disk after the bug is fixed.
All env vars are read at boot by agent/__main__.py.
| Env var | Default | What |
|---|---|---|
AGENT_LOG_TIERED |
1 |
1 enables tiered logging. 0 falls back to the legacy single-file AgentLogger. |
AGENT_LOG_DIR |
<data_dir>/logs |
Where the tier subdirectories live. __main__.py pins this into the env so the cron sweep agrees. |
AGENT_LOG_LONG_RETENTION_HOURS |
720 (30 days) |
Files in long/ older than this are deleted by the cron sweep. |
AGENT_LOG_SHORT_RETENTION_HOURS |
6 |
Files in short/ older than this are deleted by the cron sweep. |
AGENT_LOG_LONG_RETENTION_DAYS |
(deprecated) | Honoured once for backward compat. Emits a deprecation warning and is internally promoted to hours. Will be removed in a future release. |
Both halves of the system (the rotating handler at boot, and the cron prune sweep at runtime) read from AGENT_LOG_LONG_RETENTION_HOURS. Setting only the deprecated _DAYS variable used to leave them out of sync — handler with one window, sweep with another. The unification was the LOW finding closed in v1.35.0.
Tested in tests/test_log_retention.py::TestRetentionEnvContractIsUnified.
def setup_tiered_logging(
log_dir: str | Path,
*,
long_retention_hours: int = 720,
short_retention_hours: int = 6,
rotate_when: str = "midnight",
) -> dict[str, str]:
# 1. Create long/ and short/ subdirectories.
# 2. Build two TimedRotatingFileHandler instances.
# 3. Wrap them in _TierRouter (a stdlib Handler that fans out to one of two child handlers).
# 4. Drop any pre-existing StreamHandler from the root logger so we don't double-emit to stdout.
# 5. Drop any pre-existing _TierRouter from a previous setup_tiered_logging() call (for tests).
# 6. Add the new _TierRouter to the root logger.
# 7. RECONFIGURE structlog to use stdlib BoundLogger + LoggerFactory so events
# actually go through the root logger and reach the _TierRouter.
# 8. Return the resolved file paths so the caller can log them.Step 7 is the one that bites every operator who tries to roll their own structlog integration. Without it, structlog stays on PrintLoggerFactory and your events go to stdout, never to disk — even though the _TierRouter is wired up correctly. The fix is in v1.35.0; tests in TestSetupTieredLoggingActuallyWritesFiles lock it in.
# Most useful 90% of the time
tail -F .agent_runtime/logs/short/agent-short.log | jq .jq . is optional but pretty-prints the JSON. Without it you get one event per line, which is also fine.
# Build events always go long
grep build_failed .agent_runtime/logs/long/agent-long.log* | jq .grep vault_decryption_failed .agent_runtime/logs/long/agent-long.log* | jq .grep approval_denied .agent_runtime/logs/long/agent-long.log* | jq -r '.timestamp + " " + .reason'grep cron_log_retention_pruned .agent_runtime/logs/long/agent-long.log | jq .The prune sweep emits this event whenever it actually deletes something. If you don't see it, either nothing was old enough, or the sweep is hitting the wrong directory (check AGENT_LOG_DIR matches between __main__.py and the cron output).
# Stop the agent, edit .env, restart
echo 'AGENT_LOG_LONG_RETENTION_HOURS=2160' >> .env # 90 days
systemctl --user restart agent-life-spaceBoth halves of the system read the new value at boot.
-
Deterministic tier resolution. Same
(level, event)always lands in the same tier. No probabilistic routing. -
No double emission. A pre-existing
_TierRouteris removed before adding the new one, so reconfiguring at runtime (e.g. in tests) doesn't duplicate events. -
No lost events on shutdown. The
_TierRouter.close()override closes both inner handlers before tearing itself down — pytest'sResourceWarningcan't catch us mid-write. - No silent disk fill. The cron sweep runs hourly. If the operator forgets to check disk usage for a year, the long-tier window still bounds growth.
-
No leaked secrets. Every event passes through
redact_secrets()before serialization. Keys, tokens, passwords don't reach disk.
- It doesn't ship logs anywhere. Loki, Datadog, Splunk integration is your problem — but the JSON-per-line format is friendly to anything that consumes structured logs.
- It doesn't compress rotated files. We'd rather grep them as-is than fight
zgrep. If disk pressure is real, point an external rotator at the same directory and run after the prune sweep. - It doesn't index events. There's no built-in search API.
grep+jqis all the tooling you need at this scale. - It doesn't replicate. The whole point is sovereign by default. If you want HA logging, set up rsync to a second box on your own.
v1.35.0 · Latest Release
Getting started
Architecture
Subsystems
- Security model
- Vault
- Tiered logging
- Runtime LLM control
- Build pipeline
- Review pipeline
- Finance
- Cron & Maintenance
Development