Skip to content

Latest commit

 

History

History
237 lines (165 loc) · 11 KB

File metadata and controls

237 lines (165 loc) · 11 KB

Secrets Rotation and Key Fencing for Zero Downtime

🧭 Quick Return to Map

You are in a sub-page of Cloud_Serverless.
To reorient, go back here:

Think of this page as a desk within a ward.
If you need the full triage and all prescriptions, return to the Emergency Room lobby.

A practical runbook to rotate API keys, signing secrets, and JWKS without breaking RAG pipelines or agent flows. Implements dual-accept windows, kid tagging, cache fences, and rollback so rotations are invisible to users and stable under load.

When to use this page

  • Incoming 401/403 spikes during deploys or scale-out.
  • Webhook verification fails after a provider rotates its signing secret.
  • Mixed keys across regions or functions after a partial rollout.
  • Long-lived workers holding stale secrets in memory.
  • SOC or policy requires periodic rotation with audit.

Open these first

Acceptance targets

  • Zero user-visible errors during the rotation window. No p95 increase.
  • 0 auth failures in synthetic probes across all regions and paths.
  • Old secret fully revoked within the planned TTL and verified by logs.
  • For RAG: no citation loss and ΔS(question, retrieved) ≤ 0.45 on the probe set after rotation.

The two-keys-live pattern

Always rotate with two keys in play: the current key used for signing and a next key accepted for verification. Tag every credential with a key id so validation does not guess.

Key rules

  1. Always include a stable kid on every token or signature.
  2. Verify against active_kids = {current, next}. Sign with current only.
  3. Flip in one step by promoting next → current when error rate is flat.
  4. Revoke the old key after caches drain and long jobs finish.

Minimal KV schema

{
  "secrets_epoch": "2025-08-27T12:00:00Z",
  "active": { "kid": "k_2025_08a", "secret_ref": "sm://wfgyp/prod/k_2025_08a" },
  "next":   { "kid": "k_2025_09a", "secret_ref": "sm://wfgyp/prod/k_2025_09a" },
  "accept_kids": ["k_2025_08a", "k_2025_09a"],
  "revoke_after_s": 259200, 
  "notes": "roll weekly; accept_kids must be length 2 max"
}

HTTP header examples

Authorization: Bearer <access_token_with_kid>
X-Key-Id: k_2025_08a

For JOSE tokens, put kid in the header. For HMAC signatures, include X-Key-Id.


Zero-downtime rotation timeline

T-24h

  • Create next in your secret manager.
  • Update accept_kids = {current, next}.
  • Push config to all regions and functions. Do not sign with next yet.

T-0

  • Start signing with current. Keep accepting both.
  • Warm all regions. Run probes for 10 minutes.
  • Flip signer to next when probes are flat.

T+1h to T+72h

  • Keep accepting both while streams drain and long jobs finish.
  • Watch auth errors by kid.
  • After TTL, remove current from accept_kids and revoke it in the secret manager.

Rollback

  • If errors rise after the flip, restore signer to the previous current. Leave accept set unchanged.

Long-lived workers and caches

  • Serverless functions may cache secrets in memory. Read secrets on each invocation or attach an If-Modified-Since check with a short TTL.
  • Edge and CDN caches can pin config. Add secrets_epoch to cache keys or use a config version header.
  • For JWKS, publish both public keys and set short cache-control. Clients must refresh on kid miss.

Open: Edge Cache Invalidation


Webhook secret rotation

  • Accept two signatures. Verify first with kid if provided. If not, attempt both only inside the rotation window.
  • Return a 2xx with a deprecation header to prompt senders to upgrade keys.
  • Log the kid used for every valid request and alert when legacy kid drops below a threshold.

Third-party provider keys

  • Inject provider API keys from a secret manager at request time. Never bake into build artifacts.
  • Keep a per-provider accept_kids and test with small traffic shadow before promoting.
  • If a provider rotates without kid, schedule a short freeze window and fall back to a secondary provider if available.

Related ops: Bootstrap Ordering · Pre-Deploy Collapse


Observability you must add

  • Counters of 401/403 by route, region, and kid.
  • Ratio of requests verified by current vs next.
  • Time to first successful verification with next in each region.
  • Number of long-lived executions running past the flip time.
  • JWKS cache hit and miss per kid.

CI policy to prevent unsafe rotations

Fail the build when any of the following is true:

  1. accept_kids does not contain exactly two values.
  2. next.secret_ref is missing or unreadable.
  3. revoke_after_s is not set or exceeds the policy maximum.
  4. Routes that verify webhooks are not wired to read X-Key-Id or JOSE kid.

Copy-paste verifier for serverless functions

On cold start:
1. Fetch LIMITS.json and SECRETS.json.
2. Pin {accept_kids, active.kid} to memory with TTL = 60s.
3. Expose verify(req):
   - extract kid from header or token header.
   - if kid in accept_kids → choose secret by kid and verify.
   - else refresh secrets and retry once, then fail.

On each request:
- If now - secrets_cache_ts > 60s → refresh in the background.
- Emit metric: {route, region, kid_used, verified=true|false}.

Typical failure patterns → exact fix

  • Only one key accepted during rollout Old workers keep signing while new verifiers reject. Add dual-accept and reduce cache TTL. Open: Deployment Deadlock

  • Webhook fails because sender changed secret before receivers Allow two secrets, prefer kid when present. Add retry with jitter and short backoff. Open: Bootstrap Ordering

  • Large headers during rotation Multiple auth headers overflow 8–16 KB. Collapse to one X-Key-Id and one signature. Open: Prompt Injection

  • Silent partial revocation Some regions still accept old key due to pinned caches. Force refresh and invalidate edge caches by secrets_epoch. Open: Edge Cache Invalidation


Verification checklist

  • Blue-green probe calls succeed with both keys before the flip.
  • After flip, 95 percent of traffic verifies with next within 10 minutes.
  • After revoke, zero successful verifications with old kid.
  • RAG probe answers remain unchanged and pass ΔS ≤ 0.45.

🔗 Quick-Start Downloads (60 sec)

Tool Link 3-Step Setup
WFGY 1.0 PDF Engine Paper 1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS) TXTOS.txt 1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

Explore More

Layer Page What it’s for
⭐ Proof WFGY Recognition Map External citations, integrations, and ecosystem proof
⚙️ Engine WFGY 1.0 Original PDF tension engine and early logic sketch (legacy reference)
⚙️ Engine WFGY 2.0 Production tension kernel for RAG and agent systems
⚙️ Engine WFGY 3.0 TXT based Singularity tension engine (131 S class set)
🗺️ Map Problem Map 1.0 Flagship 16 problem RAG failure taxonomy and fix map
🗺️ Map Problem Map 2.0 Global Debug Card for RAG and agent pipeline diagnosis
🗺️ Map Problem Map 3.0 Global AI troubleshooting atlas and failure pattern map
🧰 App TXT OS .txt semantic OS with fast bootstrap
🧰 App Blah Blah Blah Abstract and paradox Q&A built on TXT OS
🧰 App Blur Blur Blur Text to image generation with semantic control
🏡 Onboarding Starter Village Guided entry point for new users

If this repository helped, starring it improves discovery so more builders can find the docs and tools.
GitHub Repo stars

Next page to write: ProblemMap/GlobalFixMap/Cloud_Serverless/multi_region_routing.md