Invalidate stale warm pods on env_vars change#80
Conversation
…Pods
Adds the building blocks for env_vars invalidation. envVarsHash() produces
a stable 16-char sha256 over the encrypted agent.env_vars map (sorted
keys); empty / null / non-object inputs collapse to 'empty'. buildMeta()
stamps the result on every Sandbox CR via the lap.env_vars_hash label so
warm pods carry the env baseline they were spawned with.
deleteStaleWarmPods({ agent_id, keepHash }) lists Sandbox CRs for the
agent, filters to warm pods (litellm-warm-task-id present, no
litellm-session-id), and deletes the ones whose hash doesn't match.
Returns the names of deleted sandboxes. Session-claimed pods are never
touched — those sessions finish with the env they were born with.
Deletes the Sandbox CR (and sibling NodePort Service), not just the bare
Pod, because deleting a Pod alone would leave the Sandbox CR behind and
the controller would re-create the Pod with the stale PodSpec.
After a successful prisma.agent.update, when body.env_vars was supplied, compute the new envVarsHash and fire deleteStaleWarmPods in the background. The PATCH response doesn't wait on the cluster API — the worker's warm-pool reconciler refills within a few seconds, and any failure is logged with console.warn rather than surfaced to the user (the user already got their update saved; warm-pool refill is async infra work). PATCH bodies that don't touch env_vars skip the call entirely.
Greptile SummaryThis PR closes a correctness gap where warm K8s sandbox pods spawned with an old
Confidence Score: 3/5The warm-pod invalidation path has two correctness gaps that can leave stale pods alive or tear down active sessions under normal operating conditions. The sequential deletion loop in
|
| Filename | Overview |
|---|---|
| src/server/k8s.ts | Adds envVarsHash, LABEL_ENV_VARS_HASH stamping in buildMeta, and deleteStaleWarmPods; two issues: sequential-delete loop leaves remaining stale pods when any single delete throws, and a TOCTOU race where a pod claimed between list and delete gets torn down mid-session. |
| src/app/api/v1/managed_agents/agents/[agent_id]/route.ts | Wires deleteStaleWarmPods fire-and-forget into the PATCH handler on env_vars change; logic is correct for the K8s path but directly couples the route to @/server/k8s, bypassing the backend dispatcher. |
Reviews (1): Last reviewed commit: "feat(agents PATCH): invalidate stale war..." | Re-trigger Greptile
| const deleted: string[] = []; | ||
| for (const item of items) { | ||
| const name = item.metadata?.name; | ||
| if (!name) continue; | ||
| const labels = item.metadata?.labels ?? {}; | ||
| // Only warm pods: tagged with a warm_task_id and NOT claimed by a session. | ||
| const isWarm = Boolean(labels[LABEL_WARM_TASK_ID]) && !labels[LABEL_SESSION_ID]; | ||
| if (!isWarm) continue; | ||
| // Keep pods that already carry the current hash. Pods spawned before | ||
| // this label was introduced have an empty/undefined hash, so they fall | ||
| // through and get recycled on the next env_vars PATCH — acceptable | ||
| // one-time churn. | ||
| if (labels[LABEL_ENV_VARS_HASH] === keepHash) continue; | ||
| await Promise.all([deleteSandbox(name), deleteService(name)]); | ||
| deleted.push(name); | ||
| } | ||
| return { deleted }; |
There was a problem hiding this comment.
TOCTOU race: session-claimed pod can be deleted mid-loop
deleteStaleWarmPods reads labels at list time (line 570) and uses that stale snapshot to decide whether a pod is "warm". Between the list call and the individual deleteSandbox call, the warm-pool manager could claim a pod for a new session (adding LABEL_SESSION_ID to the Sandbox's labels). The pod appears unclaimed in the snapshot, so the delete runs and tears down an active session's Sandbox. The per-pod delete in the loop should re-fetch the Sandbox or use a deleteOption precondition to guard against this window.
| const deleted: string[] = []; | ||
| for (const item of items) { | ||
| const name = item.metadata?.name; | ||
| if (!name) continue; | ||
| const labels = item.metadata?.labels ?? {}; | ||
| // Only warm pods: tagged with a warm_task_id and NOT claimed by a session. | ||
| const isWarm = Boolean(labels[LABEL_WARM_TASK_ID]) && !labels[LABEL_SESSION_ID]; | ||
| if (!isWarm) continue; | ||
| // Keep pods that already carry the current hash. Pods spawned before | ||
| // this label was introduced have an empty/undefined hash, so they fall | ||
| // through and get recycled on the next env_vars PATCH — acceptable | ||
| // one-time churn. | ||
| if (labels[LABEL_ENV_VARS_HASH] === keepHash) continue; | ||
| await Promise.all([deleteSandbox(name), deleteService(name)]); | ||
| deleted.push(name); | ||
| } | ||
| return { deleted }; |
There was a problem hiding this comment.
Partial invalidation on deletion error
The for...of loop awaits each pod deletion sequentially. If deleteSandbox or deleteService throws (any non-404 K8s error — rate-limit, transient apiserver blip, RBAC denial), the error propagates out of the loop immediately and every remaining stale pod is left untouched. The caller catches the error with console.warn and moves on, leaving the warm pool partially invalidated. Subsequent sessions can still bind to surviving stale pods. Wrapping each iteration in its own try/catch (or collecting errors and continuing) would ensure all pods are attempted.
| if (body.env_vars !== undefined) { | ||
| const newHash = envVarsHash(updated.env_vars); | ||
| void deleteStaleWarmPods({ agent_id, keepHash: newHash }).catch((err) => { | ||
| console.warn( | ||
| `[env_vars patch] failed to invalidate stale warm pods for ${agent_id}:`, | ||
| err, | ||
| ); | ||
| }); | ||
| } |
There was a problem hiding this comment.
K8s-specific import bypasses the backend dispatcher
deleteStaleWarmPods is imported directly from @/server/k8s, bypassing whatever backend-agnostic dispatcher @/server/sandbox.ts exposes. On an ECS/Fargate deployment, customApi() tries to load a kubeconfig that doesn't exist. The error is silently swallowed by the .catch(console.warn), so warm-pod invalidation silently becomes a no-op for non-K8s deployments.
Summary
lap.env_vars_hashlabel to every warm pod at spawn (sha256 of the agent's env_vars in canonical form).Why
buildContainerEnvalready injectsagent.env_varsinto the pod spec at pod-create time. But pre-existing warm pods keep stale env (K8s container env is immutable). Sessions binding to stale warm pods don't see the new vars — observable to users as "I added GITHUB_TOKEN but the agent says it's not set". This PR closes that window deterministically.Design choice
Chose hydrate-at-spawn + invalidate-on-change over a runtime POST /env on the harness because:
kubectl describe pod.process.envanyway — invalidate-and-recycle is the only way to actually guarantee correctness.Cost: one-time ~30s warm-pool refill after each env_vars edit. In-flight sessions need a Restart to pick up new vars (intentional — quietly swapping tokens mid-conversation would be more surprising than not).
Canonical form
Hash is over the encrypted form of
env_vars(i.e. what's stored inprisma.agent.env_vars). The spawn-time hash (inbuildMeta) and the invalidation-time hash (in the PATCH route) both read the encrypted shape — no decrypt step in either place. Keeps the canonical representation single-sourced and avoids leaking decryption work into label computation.Tests
The repo currently has no test infrastructure configured in
package.json(notestscript, no jest/vitest dep). Skipping the suggested unit tests rather than scaffolding test infra from scratch in this PR. Manual verification steps below cover the same surface.Test plan
litellm-session-idlabel set) is NOT deleted.env_varsUNCHANGED (e.g. just anameupdate) should not delete any pods.lap.env_vars_hashlabel; they'll be treated as "not matching" and deleted on the next env_vars PATCH. Acceptable one-time churn.