Skip to content

Logging

Daniel Babjak edited this page Apr 8, 2026 · 1 revision

Tiered logging

Agent Life Space writes structured JSON events from every layer (structlog everywhere). Routing those events into useful retention windows is what this page is about.

TL;DR. Two file sinks. Long tier (~30 days) for things you'll want next month: lifecycle, build, finance, audit, security, vault, ERROR/CRITICAL/AUDIT. Short tier (~6 hours) for verbose pipeline diagnostics that you only care about while a bug is hot. A cron loop runs LogRetentionManager.prune_all() hourly and ages files out by mtime. There is no cleanup script you have to remember to run.

Code: agent/logs/logger.py (router + setup) and agent/logs/retention.py (tier resolver + prune sweep). Tests: tests/test_log_retention.py.


Why two tiers

Operators have two retention requirements that pull in opposite directions:

  1. "What did the agent do last month?" — needs build/finance/audit/security events for ~30 days. These are low-volume, high-value.
  2. "Why did the brain pipeline behave weird in the last 30 minutes?" — needs every dispatcher hit, every cache lookup, every typing indicator, every poll cycle. Volume is huge, value drops to zero a few hours later.

Putting both into one log file means either keeping everything for 30 days (disk-heavy and grep-unfriendly), or keeping it for 6 hours and losing the audit trail. Tiered logging gives you both.


File layout

$AGENT_LOG_DIR/                      ← default: <data_dir>/logs
├── long/
│   ├── agent-long.log               ← active long-tier file
│   ├── agent-long.log.2026-04-07
│   ├── agent-long.log.2026-04-06
│   └── ...
└── short/
    ├── agent-short.log              ← active short-tier file
    ├── agent-short.log.2026-04-08-21
    ├── agent-short.log.2026-04-08-20
    └── ...

The long file rotates daily at midnight UTC. The short file rotates hourly. Both are written by Python's stdlib TimedRotatingFileHandler.

The cron-side LogRetentionManager then deletes any file in long/ older than AGENT_LOG_LONG_RETENTION_HOURS and any file in short/ older than AGENT_LOG_SHORT_RETENTION_HOURS. The handler's backupCount is set generously so it never races the prune sweep.


Tier resolver

Every log record runs through resolve_tier(level, event) (agent/logs/retention.py). The decision is deterministic — same input, same output, no LLM, no probability.

                       ┌──────────────────────────┐
                       │ resolve_tier(level, event)│
                       └────────────┬─────────────┘
                                    │
            ┌───────────────────────┼─────────────────────────────┐
            │                       │                             │
   ERROR / CRITICAL / AUDIT?    DEBUG?                       INFO / WARNING?
            │                       │                             │
          long                      │                             │
                                    │                             │
                            event matches              event matches a long-tier
                            a short-tier prefix?       prefix (build_*, finance_*,
                            (brain_*, telegram_poll_*,  vault_*, agent_started, ...)?
                             cache_*, dispatch_*, ...)?     │
                                    │                       │
                                  short                    long
                                    │                       │
                                    └────── otherwise default = long

Why default to long? Lose nothing by default. If a new event is added to the codebase and nobody updates the prefix tables, it lands in long-tier and gets noticed in the next review. The wrong direction (silent data loss) is much harder to fix.

Long-tier event prefixes

Defined in _LONG_TIER_EVENTS (frozenset). Any event whose name starts with one of these always goes long:

Category Prefixes
Lifecycle agent_started, agent_stopped, agent_initialized, shutdown_*, startup_*
Build build_started, build_completed, build_failed, build_codegen_*, build_acceptance_*, build_delivery_*, codegen_fallback_guard
Review review_started, review_completed, review_blocked
Delivery / approvals delivery_*, approval_granted, approval_denied, approval_pending
Finance finance_proposal, finance_approved, finance_completed, finance_rejected, budget_hard_cap, budget_stop_loss
Vault / security vault_*, auth_failure, auth_success, prompt_injection_*, command_blocked_non_owner
Gateway gateway_call_*, settlement_*
Persistence / DB *_storage_initialized, *_db_recovery_*

Short-tier event prefixes

Defined in _SHORT_TIER_EVENTS. These prefixes go to short-tier even at INFO level:

Category Prefixes
Brain pipeline brain_pipeline_*, brain_cache_*, brain_rag_*, brain_dispatch_*
Telegram polling telegram_poll_*, telegram_typing_*
Internal dispatch dispatch_internal, dispatcher_*
Semantic semantic_router_*, semantic_cache_*
Tool router tool_router_*
Heartbeat heartbeat_*, watchdog_tick_*

These are exactly the events you want when you're tailing the file during an active debug session, but that you don't want filling up disk after the bug is fixed.


Configuration

All env vars are read at boot by agent/__main__.py.

Env var Default What
AGENT_LOG_TIERED 1 1 enables tiered logging. 0 falls back to the legacy single-file AgentLogger.
AGENT_LOG_DIR <data_dir>/logs Where the tier subdirectories live. __main__.py pins this into the env so the cron sweep agrees.
AGENT_LOG_LONG_RETENTION_HOURS 720 (30 days) Files in long/ older than this are deleted by the cron sweep.
AGENT_LOG_SHORT_RETENTION_HOURS 6 Files in short/ older than this are deleted by the cron sweep.
AGENT_LOG_LONG_RETENTION_DAYS (deprecated) Honoured once for backward compat. Emits a deprecation warning and is internally promoted to hours. Will be removed in a future release.

A note on the env contract

Both halves of the system (the rotating handler at boot, and the cron prune sweep at runtime) read from AGENT_LOG_LONG_RETENTION_HOURS. Setting only the deprecated _DAYS variable used to leave them out of sync — handler with one window, sweep with another. The unification was the LOW finding closed in v1.35.0.

Tested in tests/test_log_retention.py::TestRetentionEnvContractIsUnified.


How setup_tiered_logging actually works

def setup_tiered_logging(
    log_dir: str | Path,
    *,
    long_retention_hours: int = 720,
    short_retention_hours: int = 6,
    rotate_when: str = "midnight",
) -> dict[str, str]:
    # 1. Create long/ and short/ subdirectories.
    # 2. Build two TimedRotatingFileHandler instances.
    # 3. Wrap them in _TierRouter (a stdlib Handler that fans out to one of two child handlers).
    # 4. Drop any pre-existing StreamHandler from the root logger so we don't double-emit to stdout.
    # 5. Drop any pre-existing _TierRouter from a previous setup_tiered_logging() call (for tests).
    # 6. Add the new _TierRouter to the root logger.
    # 7. RECONFIGURE structlog to use stdlib BoundLogger + LoggerFactory so events
    #    actually go through the root logger and reach the _TierRouter.
    # 8. Return the resolved file paths so the caller can log them.

Step 7 is the one that bites every operator who tries to roll their own structlog integration. Without it, structlog stays on PrintLoggerFactory and your events go to stdout, never to disk — even though the _TierRouter is wired up correctly. The fix is in v1.35.0; tests in TestSetupTieredLoggingActuallyWritesFiles lock it in.


Operator workflow

Tail the live debug log

# Most useful 90% of the time
tail -F .agent_runtime/logs/short/agent-short.log | jq .

jq . is optional but pretty-prints the JSON. Without it you get one event per line, which is also fine.

Find a build failure from yesterday

# Build events always go long
grep build_failed .agent_runtime/logs/long/agent-long.log* | jq .

Find every wrong-key vault attempt this month

grep vault_decryption_failed .agent_runtime/logs/long/agent-long.log* | jq .

Find every approval that was denied last week

grep approval_denied .agent_runtime/logs/long/agent-long.log* | jq -r '.timestamp + " " + .reason'

See what got pruned in the last hour

grep cron_log_retention_pruned .agent_runtime/logs/long/agent-long.log | jq .

The prune sweep emits this event whenever it actually deletes something. If you don't see it, either nothing was old enough, or the sweep is hitting the wrong directory (check AGENT_LOG_DIR matches between __main__.py and the cron output).

Override retention temporarily

# Stop the agent, edit .env, restart
echo 'AGENT_LOG_LONG_RETENTION_HOURS=2160' >> .env  # 90 days
systemctl --user restart agent-life-space

Both halves of the system read the new value at boot.


Things the tiered logger guarantees

  • Deterministic tier resolution. Same (level, event) always lands in the same tier. No probabilistic routing.
  • No double emission. A pre-existing _TierRouter is removed before adding the new one, so reconfiguring at runtime (e.g. in tests) doesn't duplicate events.
  • No lost events on shutdown. The _TierRouter.close() override closes both inner handlers before tearing itself down — pytest's ResourceWarning can't catch us mid-write.
  • No silent disk fill. The cron sweep runs hourly. If the operator forgets to check disk usage for a year, the long-tier window still bounds growth.
  • No leaked secrets. Every event passes through redact_secrets() before serialization. Keys, tokens, passwords don't reach disk.

Things it doesn't do

  • It doesn't ship logs anywhere. Loki, Datadog, Splunk integration is your problem — but the JSON-per-line format is friendly to anything that consumes structured logs.
  • It doesn't compress rotated files. We'd rather grep them as-is than fight zgrep. If disk pressure is real, point an external rotator at the same directory and run after the prune sweep.
  • It doesn't index events. There's no built-in search API. grep + jq is all the tooling you need at this scale.
  • It doesn't replicate. The whole point is sovereign by default. If you want HA logging, set up rsync to a second box on your own.

Clone this wiki locally