Skip to content
Daniel Babjak edited this page Apr 8, 2026 · 1 revision

Vault

The vault stores everything Agent Life Space considers a secret: API keys, OAuth tokens, wallet mnemonics, third-party credentials, internal HMAC signing keys. It is the only file in the project that is both encrypted and required at boot.

This page is the canonical specification of the on-disk format, the cryptographic primitives, the migration story, and the failure modes. Code: agent/vault/secrets.py. Tests: tests/test_vault.py.

Format version: v2 (single file) as of v1.35.0. The legacy v1 format (raw Fernet token + sidecar salt.bin) is auto-migrated on first open. There is no v0.


Cryptographic primitives

Primitive Choice Source
Cipher Fernet (AES-128-CBC + HMAC-SHA256) cryptography.fernet.Fernet
KDF PBKDF2-HMAC-SHA256, 480 000 iterations cryptography.hazmat.primitives.kdf.pbkdf2.PBKDF2HMAC
Salt 16 bytes from secrets.token_bytes(16), embedded in v2 header stdlib secrets module
Master key Operator-supplied via AGENT_VAULT_KEY env var (≥ 24 bytes recommended) operator
Authenticated payload The whole orjson-serialised secrets dict always

PBKDF2 iteration count tracks the current OWASP recommendation (480k as of 2026). It is a class constant in _derive_fernet; bumping it requires a one-shot vault rotation.

We deliberately don't roll our own AEAD. Fernet is a well-known audited construction with a clear failure mode (InvalidToken on tampered or wrong-key blobs).


On-disk format (v2)

┌────────────────┬─────────────────────┬────────────────────────────────────────────┐
│ b"ALSv2\n"     │  16 bytes salt      │  Fernet token (variable length)            │
│  6 bytes magic │  random per-vault   │  base64-url, includes IV + ciphertext +    │
│                │                     │  HMAC tag + version byte                   │
└────────────────┴─────────────────────┴────────────────────────────────────────────┘
        ↑               ↑                           ↑
        │               │                           │
        │               │                           └── encrypts: orjson.dumps({"NAME": "value", ...})
        │               │
        │               └── input to PBKDF2 along with master_key
        │
        └── version magic. Any future format change picks a new magic and detects v2 by prefix.

The whole thing lives in one file: <AGENT_PROJECT_ROOT>/agent/vault/secrets.enc. There is no salt.bin sidecar in v2. There are no temp files, lock files, or journal files at rest.

Why one file?

The previous v1→v2 migration used a separate salt.bin and a multi-step write sequence (write secrets.enc.tmp → write salt.binos.replace). A SIGKILL between the salt write and the swap could leave the operator with the new salt and the old encrypted blob — unrecoverable on next boot. Codex flagged this as a MED finding.

Single file means the salt and the blob are physically inseparable. There is no order in which the operator can crash and end up with a half-applied write. The two are atomic by construction.


Atomic writes

Every vault write goes through SecretsManager._atomic_write:

def _atomic_write(self, target: Path, data: bytes) -> None:
    tmp = target.with_suffix(target.suffix + ".tmp")
    # Clean up any leftover temp from a prior crash.
    if tmp.exists():
        tmp.unlink()
    fd = os.open(tmp, os.O_WRONLY | os.O_CREAT | os.O_TRUNC, 0o600)
    try:
        # write all bytes
        written = 0
        view = memoryview(data)
        while written < len(view):
            n = os.write(fd, view[written:])
            written += n
        os.fsync(fd)              # contents are durable on disk
    finally:
        os.close(fd)
    os.replace(tmp, target)       # POSIX atomic rename
    # Best-effort fsync of the parent directory so the rename
    # itself is durable across power loss.
    dir_fd = os.open(target.parent, os.O_RDONLY)
    try:
        os.fsync(dir_fd)
    finally:
        os.close(dir_fd)

The parent-directory fsync is the key step that's missing from a naive Path.write_bytes(...) + os.replace(...) pattern. Without it, the rename can be lost on power loss even though the file contents are on disk.

Crash safety table

Crash point On-disk state
Before any write unchanged
Mid os.write secrets.enc.tmp exists with partial bytes; secrets.enc unchanged. The next boot deletes the leftover temp.
After os.write, before fsync(fd) secrets.enc.tmp may have lost the tail; secrets.enc unchanged.
After fsync(fd), before os.replace secrets.enc.tmp is durable but unused; secrets.enc unchanged.
After os.replace, before fsync(dir) New secrets.enc is durable on most filesystems. The dir fsync is the belt-and-braces.
After fsync(dir) Fully committed.

The fundamental invariant: the agent never reads a partially written secrets.enc. Either it sees the previous good blob or the new good blob.


Wrong-key fail-fast

A wrong master key — operator typo on first boot, partial .env rollback, leaked old key being tested — used to silently destroy the vault. The old _load() returned {} on InvalidToken. A subsequent set_secret would re-encrypt the empty dict with the wrong key and overwrite the legacy blob. The legacy data was unrecoverable.

The v1.35.0 fix introduces VaultDecryptionError:

def _load(self, *, allow_missing: bool = True) -> dict[str, str]:
    if not self._secrets_file.exists():
        if allow_missing:
            return {}
        raise VaultDecryptionError(...)

    try:
        raw = self._secrets_file.read_bytes()
        if raw.startswith(self._V2_HEADER):
            blob = raw[len(self._V2_HEADER) + self._V2_SALT_LEN:]
        else:
            blob = raw  # legacy v1
        return cast("dict[str, str]", orjson.loads(self._fernet.decrypt(blob)))

    except InvalidToken as e:
        raise VaultDecryptionError(
            "Vault decryption failed (wrong master key or corrupted "
            "secrets.enc). Refusing to proceed — a write in this "
            "state would silently overwrite the existing blob and "
            "destroy any secrets recoverable with the correct key."
        ) from e

Write paths (set_secret, delete_secret) call _load() directly. A wrong key raises VaultDecryptionError and the encrypted blob on disk is never touched.

Read paths (get_secret, list_secrets, has_secret) use the _safe_load_for_read() helper that catches VaultDecryptionError and returns an empty result. This lets the agent boot with the wrong key, log a clear warning, and let the operator fix .env without crashing.

Test coverage:

TestVaultWrongKeyWriteFailFast::test_wrong_key_set_secret_raises_and_preserves_legacy
TestVaultWrongKeyWriteFailFast::test_wrong_key_delete_secret_also_raises
TestVaultWrongKeyWriteFailFast::test_wrong_key_read_path_returns_none_no_crash

Legacy v1 → v2 migration

When SecretsManager.__init__ opens an existing secrets.enc that does not start with the ALSv2\n magic, it treats the file as legacy v1.

Two legacy variants

Variant Era Salt source
v1 with salt.bin 1.34-era random salt adjacent salt.bin file
v1 without salt.bin pre-1.34 static salt hardcoded _LEGACY_SALT = b"agent-life-space-vault-salt-v1"

The _locate_legacy_salt() helper picks the right salt automatically.

Migration steps

_open_or_init_vault(master_key)
  │
  ├─ secrets.enc starts with ALSv2 → use the embedded salt → done
  │
  ├─ secrets.enc starts with anything else (v1) →
  │     │
  │     ├─ legacy_salt = _locate_legacy_salt()
  │     ├─ legacy_fernet = _derive_fernet(master_key, legacy_salt)
  │     │
  │     ├─ try: plaintext = legacy_fernet.decrypt(raw)
  │     │      └─ InvalidToken → wrong key → return legacy_fernet (read-only)
  │     │                                    NEVER touch the file
  │     │
  │     └─ success → _migrate_to_v2(master_key, plaintext):
  │           │
  │           ├─ new_salt = secrets_module.token_bytes(16)
  │           ├─ new_fernet = _derive_fernet(master_key, new_salt)
  │           ├─ token = new_fernet.encrypt(plaintext)
  │           ├─ v2_blob = ALSv2_HEADER + new_salt + token
  │           ├─ _atomic_write(secrets.enc, v2_blob)   ← single op
  │           ├─ _cleanup_legacy_salt_file()           ← best effort
  │           └─ self._fernet = new_fernet
  │
  └─ secrets.enc does not exist → fresh install → fresh random salt → return new_fernet

_migrate_to_v2 is the only place a vault gets re-encrypted. It is a single atomic write — the same _atomic_write used by every normal set_secret call. If it fails (disk full, IO error), the legacy blob stays intact and the agent runs read-only with the legacy fernet until the operator investigates.

After successful migration:

  • secrets.enc is the new v2 blob.
  • salt.bin is removed (best effort, non-fatal if it fails).
  • A vault_migrated_to_v2_single_file_format log event lands in the long-tier log file.

Test coverage:

TestLegacyV1Compat::test_v1_static_salt_vault_reads_and_migrates
TestLegacyV1Compat::test_v1_random_salt_vault_reads_and_migrates_drops_salt_file
TestLegacyV1Compat::test_v1_wrong_key_does_not_touch_file
TestVaultV2MigrationCrashSafety::test_migration_uses_atomic_swap_no_partial_state
TestVaultV2MigrationCrashSafety::test_migration_failure_leaves_legacy_blob_untouched
TestVaultV2MigrationCrashSafety::test_migration_preserves_multiple_secrets

Public API

from agent.vault.secrets import SecretsManager, VaultDecryptionError

vault = SecretsManager(vault_dir="agent/vault", master_key=os.environ["AGENT_VAULT_KEY"])

vault.set_secret("ANTHROPIC_API_KEY", "sk-ant-...")     # raises VaultDecryptionError on wrong key
value = vault.get_secret("ANTHROPIC_API_KEY")           # returns None on miss / wrong key
vault.delete_secret("ANTHROPIC_API_KEY")                # raises VaultDecryptionError on wrong key
vault.has_secret("ANTHROPIC_API_KEY")                   # bool, tolerates wrong key
vault.list_secrets()                                    # list[str], tolerates wrong key
vault.is_ready                                          # bool: can encrypt/decrypt?

vault.get_audit_log()                                   # in-memory ring buffer (1000 entries)
vault.clear_cache()                                     # drop the in-memory cache
SecretsManager.generate_key()                           # str: a fresh random Fernet key

The audit log records every set / get / get_cached / get_miss / delete / list event with a UTC timestamp. It is bounded to 1000 entries (oldest first eviction).


Operator workflow

First-time setup

# Generate a master key
python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"
# Put it in your .env (gitignored)
echo "AGENT_VAULT_KEY=<paste-generated-key>" >> .env
# Or store in a real password manager and source it before booting

Adding a secret

The vault is unlocked at agent boot. Once running, you have a few options:

Channel How
Python REPL vault = SecretsManager(master_key=...); vault.set_secret(...)
Inside the agent tools that need a secret call agent.vault.get_secret(name)
Setup doctor not a write surface — read only

There is no Telegram command to write the vault, by design. Adding secrets through a chat surface would risk leaking them into a transcript.

Rotating the master key

There is no in-place rotation API. To rotate:

  1. Decrypt the current vault with the existing key (reading any secret will do).
  2. Stop the agent.
  3. Set the new AGENT_VAULT_KEY in .env.
  4. Delete secrets.enc (back it up first).
  5. Boot the agent with the new key — the next set_secret call writes a fresh v2 blob with the new salt.
  6. Re-add each secret.

This is intentionally manual. Vault rotation is rare and the operator should be deliberate about it.

Recovering from a corrupted vault

If secrets.enc is unreadable:

# 1. Check if it's a key issue (most likely)
python -c "from agent.vault.secrets import SecretsManager; \
  m = SecretsManager(vault_dir='agent/vault', master_key=open('.env').read()); \
  print(m.get_secret('AGENT_API_KEY'))"

# 2. If you have a backup of secrets.enc:
mv agent/vault/secrets.enc agent/vault/secrets.enc.broken
cp /backup/secrets.enc agent/vault/

# 3. Last resort — start fresh:
mv agent/vault/secrets.enc agent/vault/secrets.enc.broken
# Boot the agent — it will create an empty v2 vault on first set_secret.
# Re-enter every secret manually.

Things the vault does NOT do

  • It does not store unencrypted at rest. Never has, never will.
  • It does not log secret values. The redact_secrets() processor in agent/logs/logger.py strips known secret-like keys before any log line is written.
  • It does not transmit secrets over the network. The agent reads from the vault on demand, uses the value once, and the in-memory cache is bounded.
  • It does not allow remote unlock. The master key must be present in the process environment at boot. There is no unlock HTTP endpoint.
  • It does not have a recovery code. Lose the master key, lose the data. Back up your .env to a real password manager.

Clone this wiki locally