Skip to content

Invalidate stale warm pods on env_vars change#80

Open
ishaan-berri wants to merge 2 commits into
mainfrom
litellm_env-vars-hydration
Open

Invalidate stale warm pods on env_vars change#80
ishaan-berri wants to merge 2 commits into
mainfrom
litellm_env-vars-hydration

Conversation

@ishaan-berri
Copy link
Copy Markdown
Contributor

Summary

  • Adds an lap.env_vars_hash label to every warm pod at spawn (sha256 of the agent's env_vars in canonical form).
  • On PATCH /agents/:id with env_vars changed, deletes warm pods whose hash label doesn't match the new value.
  • The worker's warm-pool reconciler refills with fresh pods that have the current env baked in.

Why

buildContainerEnv already injects agent.env_vars into the pod spec at pod-create time. But pre-existing warm pods keep stale env (K8s container env is immutable). Sessions binding to stale warm pods don't see the new vars — observable to users as "I added GITHUB_TOKEN but the agent says it's not set". This PR closes that window deterministically.

Design choice

Chose hydrate-at-spawn + invalidate-on-change over a runtime POST /env on the harness because:

  • Pod env determined at one point in time (pod-create) — auditable via kubectl describe pod.
  • No new HTTP surface on sandbox pods.
  • Already-running subprocesses can't see a runtime-mutated process.env anyway — invalidate-and-recycle is the only way to actually guarantee correctness.
  • Agent-pinned warm pools stay agent-pinned; no cross-agent leak risk introduced.

Cost: one-time ~30s warm-pool refill after each env_vars edit. In-flight sessions need a Restart to pick up new vars (intentional — quietly swapping tokens mid-conversation would be more surprising than not).

Canonical form

Hash is over the encrypted form of env_vars (i.e. what's stored in prisma.agent.env_vars). The spawn-time hash (in buildMeta) and the invalidation-time hash (in the PATCH route) both read the encrypted shape — no decrypt step in either place. Keeps the canonical representation single-sourced and avoids leaking decryption work into label computation.

Tests

The repo currently has no test infrastructure configured in package.json (no test script, no jest/vitest dep). Skipping the suggested unit tests rather than scaffolding test infra from scratch in this PR. Manual verification steps below cover the same surface.

Test plan

  • Create an agent, spawn a session — confirm the pod has the agent's current env_vars.
  • PATCH the agent with new env_vars. Within ~5s, all idle warm pods for that agent should be deleted; new ones should spawn with the new vars.
  • Verify a pod that's already claimed by a session (litellm-session-id label set) is NOT deleted.
  • PATCH with env_vars UNCHANGED (e.g. just a name update) should not delete any pods.
  • Existing pods migration: pods spawned before this PR don't have the lap.env_vars_hash label; they'll be treated as "not matching" and deleted on the next env_vars PATCH. Acceptable one-time churn.

…Pods

Adds the building blocks for env_vars invalidation. envVarsHash() produces
a stable 16-char sha256 over the encrypted agent.env_vars map (sorted
keys); empty / null / non-object inputs collapse to 'empty'. buildMeta()
stamps the result on every Sandbox CR via the lap.env_vars_hash label so
warm pods carry the env baseline they were spawned with.

deleteStaleWarmPods({ agent_id, keepHash }) lists Sandbox CRs for the
agent, filters to warm pods (litellm-warm-task-id present, no
litellm-session-id), and deletes the ones whose hash doesn't match.
Returns the names of deleted sandboxes. Session-claimed pods are never
touched — those sessions finish with the env they were born with.

Deletes the Sandbox CR (and sibling NodePort Service), not just the bare
Pod, because deleting a Pod alone would leave the Sandbox CR behind and
the controller would re-create the Pod with the stale PodSpec.
After a successful prisma.agent.update, when body.env_vars was supplied,
compute the new envVarsHash and fire deleteStaleWarmPods in the
background. The PATCH response doesn't wait on the cluster API — the
worker's warm-pool reconciler refills within a few seconds, and any
failure is logged with console.warn rather than surfaced to the user
(the user already got their update saved; warm-pool refill is async
infra work). PATCH bodies that don't touch env_vars skip the call
entirely.
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 13, 2026

Greptile Summary

This PR closes a correctness gap where warm K8s sandbox pods spawned with an old env_vars payload would continue serving sessions after an agent's env_vars were updated via PATCH, because container env is immutable once a pod is running.

  • Adds envVarsHash (SHA-256 over sorted encrypted entries, truncated to 16 hex chars) and stamps every newly-spawned Sandbox CR with a lap.env_vars_hash label at pod-create time via buildMeta.
  • Introduces deleteStaleWarmPods, called fire-and-forget from the PATCH route after a successful DB update, which lists warm Sandbox CRs for the agent and deletes any whose hash label doesn't match the new value; session-claimed pods are explicitly skipped.

Confidence Score: 3/5

The warm-pod invalidation path has two correctness gaps that can leave stale pods alive or tear down active sessions under normal operating conditions.

The sequential deletion loop in deleteStaleWarmPods propagates the first K8s error out of the for...of immediately, abandoning every remaining stale pod. A TOCTOU race also allows a pod claimed for a new session to be deleted mid-session.

src/server/k8s.ts — specifically the deleteStaleWarmPods loop (lines 566-582).

Important Files Changed

Filename Overview
src/server/k8s.ts Adds envVarsHash, LABEL_ENV_VARS_HASH stamping in buildMeta, and deleteStaleWarmPods; two issues: sequential-delete loop leaves remaining stale pods when any single delete throws, and a TOCTOU race where a pod claimed between list and delete gets torn down mid-session.
src/app/api/v1/managed_agents/agents/[agent_id]/route.ts Wires deleteStaleWarmPods fire-and-forget into the PATCH handler on env_vars change; logic is correct for the K8s path but directly couples the route to @/server/k8s, bypassing the backend dispatcher.

Reviews (1): Last reviewed commit: "feat(agents PATCH): invalidate stale war..." | Re-trigger Greptile

Comment thread src/server/k8s.ts
Comment on lines +566 to +582
const deleted: string[] = [];
for (const item of items) {
const name = item.metadata?.name;
if (!name) continue;
const labels = item.metadata?.labels ?? {};
// Only warm pods: tagged with a warm_task_id and NOT claimed by a session.
const isWarm = Boolean(labels[LABEL_WARM_TASK_ID]) && !labels[LABEL_SESSION_ID];
if (!isWarm) continue;
// Keep pods that already carry the current hash. Pods spawned before
// this label was introduced have an empty/undefined hash, so they fall
// through and get recycled on the next env_vars PATCH — acceptable
// one-time churn.
if (labels[LABEL_ENV_VARS_HASH] === keepHash) continue;
await Promise.all([deleteSandbox(name), deleteService(name)]);
deleted.push(name);
}
return { deleted };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 TOCTOU race: session-claimed pod can be deleted mid-loop

deleteStaleWarmPods reads labels at list time (line 570) and uses that stale snapshot to decide whether a pod is "warm". Between the list call and the individual deleteSandbox call, the warm-pool manager could claim a pod for a new session (adding LABEL_SESSION_ID to the Sandbox's labels). The pod appears unclaimed in the snapshot, so the delete runs and tears down an active session's Sandbox. The per-pod delete in the loop should re-fetch the Sandbox or use a deleteOption precondition to guard against this window.

Comment thread src/server/k8s.ts
Comment on lines +566 to +582
const deleted: string[] = [];
for (const item of items) {
const name = item.metadata?.name;
if (!name) continue;
const labels = item.metadata?.labels ?? {};
// Only warm pods: tagged with a warm_task_id and NOT claimed by a session.
const isWarm = Boolean(labels[LABEL_WARM_TASK_ID]) && !labels[LABEL_SESSION_ID];
if (!isWarm) continue;
// Keep pods that already carry the current hash. Pods spawned before
// this label was introduced have an empty/undefined hash, so they fall
// through and get recycled on the next env_vars PATCH — acceptable
// one-time churn.
if (labels[LABEL_ENV_VARS_HASH] === keepHash) continue;
await Promise.all([deleteSandbox(name), deleteService(name)]);
deleted.push(name);
}
return { deleted };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Partial invalidation on deletion error

The for...of loop awaits each pod deletion sequentially. If deleteSandbox or deleteService throws (any non-404 K8s error — rate-limit, transient apiserver blip, RBAC denial), the error propagates out of the loop immediately and every remaining stale pod is left untouched. The caller catches the error with console.warn and moves on, leaving the warm pool partially invalidated. Subsequent sessions can still bind to surviving stale pods. Wrapping each iteration in its own try/catch (or collecting errors and continuing) would ensure all pods are attempted.

Comment on lines +85 to +93
if (body.env_vars !== undefined) {
const newHash = envVarsHash(updated.env_vars);
void deleteStaleWarmPods({ agent_id, keepHash: newHash }).catch((err) => {
console.warn(
`[env_vars patch] failed to invalidate stale warm pods for ${agent_id}:`,
err,
);
});
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 K8s-specific import bypasses the backend dispatcher

deleteStaleWarmPods is imported directly from @/server/k8s, bypassing whatever backend-agnostic dispatcher @/server/sandbox.ts exposes. On an ECS/Fargate deployment, customApi() tries to load a kubeconfig that doesn't exist. The error is silently swallowed by the .catch(console.warn), so warm-pod invalidation silently becomes a no-op for non-K8s deployments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant