Invalidate stale warm pods on env_vars change by ishaan-berri · Pull Request #80 · BerriAI/litellm-agent-platform

ishaan-berri · 2026-05-13T20:08:55Z

Summary

Adds an lap.env_vars_hash label to every warm pod at spawn (sha256 of the agent's env_vars in canonical form).
On PATCH /agents/:id with env_vars changed, deletes warm pods whose hash label doesn't match the new value.
The worker's warm-pool reconciler refills with fresh pods that have the current env baked in.

Why

buildContainerEnv already injects agent.env_vars into the pod spec at pod-create time. But pre-existing warm pods keep stale env (K8s container env is immutable). Sessions binding to stale warm pods don't see the new vars — observable to users as "I added GITHUB_TOKEN but the agent says it's not set". This PR closes that window deterministically.

Design choice

Chose hydrate-at-spawn + invalidate-on-change over a runtime POST /env on the harness because:

Pod env determined at one point in time (pod-create) — auditable via kubectl describe pod.
No new HTTP surface on sandbox pods.
Already-running subprocesses can't see a runtime-mutated process.env anyway — invalidate-and-recycle is the only way to actually guarantee correctness.
Agent-pinned warm pools stay agent-pinned; no cross-agent leak risk introduced.

Cost: one-time ~30s warm-pool refill after each env_vars edit. In-flight sessions need a Restart to pick up new vars (intentional — quietly swapping tokens mid-conversation would be more surprising than not).

Canonical form

Hash is over the encrypted form of env_vars (i.e. what's stored in prisma.agent.env_vars). The spawn-time hash (in buildMeta) and the invalidation-time hash (in the PATCH route) both read the encrypted shape — no decrypt step in either place. Keeps the canonical representation single-sourced and avoids leaking decryption work into label computation.

Tests

The repo currently has no test infrastructure configured in package.json (no test script, no jest/vitest dep). Skipping the suggested unit tests rather than scaffolding test infra from scratch in this PR. Manual verification steps below cover the same surface.

Test plan

Create an agent, spawn a session — confirm the pod has the agent's current env_vars.
PATCH the agent with new env_vars. Within ~5s, all idle warm pods for that agent should be deleted; new ones should spawn with the new vars.
Verify a pod that's already claimed by a session (litellm-session-id label set) is NOT deleted.
PATCH with env_vars UNCHANGED (e.g. just a name update) should not delete any pods.
Existing pods migration: pods spawned before this PR don't have the lap.env_vars_hash label; they'll be treated as "not matching" and deleted on the next env_vars PATCH. Acceptable one-time churn.

…Pods Adds the building blocks for env_vars invalidation. envVarsHash() produces a stable 16-char sha256 over the encrypted agent.env_vars map (sorted keys); empty / null / non-object inputs collapse to 'empty'. buildMeta() stamps the result on every Sandbox CR via the lap.env_vars_hash label so warm pods carry the env baseline they were spawned with. deleteStaleWarmPods({ agent_id, keepHash }) lists Sandbox CRs for the agent, filters to warm pods (litellm-warm-task-id present, no litellm-session-id), and deletes the ones whose hash doesn't match. Returns the names of deleted sandboxes. Session-claimed pods are never touched — those sessions finish with the env they were born with. Deletes the Sandbox CR (and sibling NodePort Service), not just the bare Pod, because deleting a Pod alone would leave the Sandbox CR behind and the controller would re-create the Pod with the stale PodSpec.

After a successful prisma.agent.update, when body.env_vars was supplied, compute the new envVarsHash and fire deleteStaleWarmPods in the background. The PATCH response doesn't wait on the cluster API — the worker's warm-pool reconciler refills within a few seconds, and any failure is logged with console.warn rather than surfaced to the user (the user already got their update saved; warm-pool refill is async infra work). PATCH bodies that don't touch env_vars skip the call entirely.

greptile-apps · 2026-05-13T20:12:53Z

Greptile Summary

This PR closes a correctness gap where warm K8s sandbox pods spawned with an old env_vars payload would continue serving sessions after an agent's env_vars were updated via PATCH, because container env is immutable once a pod is running.

Adds envVarsHash (SHA-256 over sorted encrypted entries, truncated to 16 hex chars) and stamps every newly-spawned Sandbox CR with a lap.env_vars_hash label at pod-create time via buildMeta.
Introduces deleteStaleWarmPods, called fire-and-forget from the PATCH route after a successful DB update, which lists warm Sandbox CRs for the agent and deletes any whose hash label doesn't match the new value; session-claimed pods are explicitly skipped.

Confidence Score: 3/5

The warm-pod invalidation path has two correctness gaps that can leave stale pods alive or tear down active sessions under normal operating conditions.

The sequential deletion loop in deleteStaleWarmPods propagates the first K8s error out of the for...of immediately, abandoning every remaining stale pod. A TOCTOU race also allows a pod claimed for a new session to be deleted mid-session.

src/server/k8s.ts — specifically the deleteStaleWarmPods loop (lines 566-582).

Important Files Changed

Filename	Overview
src/server/k8s.ts	Adds `envVarsHash`, `LABEL_ENV_VARS_HASH` stamping in `buildMeta`, and `deleteStaleWarmPods`; two issues: sequential-delete loop leaves remaining stale pods when any single delete throws, and a TOCTOU race where a pod claimed between list and delete gets torn down mid-session.
src/app/api/v1/managed_agents/agents/[agent_id]/route.ts	Wires `deleteStaleWarmPods` fire-and-forget into the PATCH handler on `env_vars` change; logic is correct for the K8s path but directly couples the route to `@/server/k8s`, bypassing the backend dispatcher.

_{Reviews (1): Last reviewed commit: "feat(agents PATCH): invalidate stale war..." | Re-trigger Greptile}

greptile-apps · 2026-05-13T20:12:57Z

+  const deleted: string[] = [];
+  for (const item of items) {
+    const name = item.metadata?.name;
+    if (!name) continue;
+    const labels = item.metadata?.labels ?? {};
+    // Only warm pods: tagged with a warm_task_id and NOT claimed by a session.
+    const isWarm = Boolean(labels[LABEL_WARM_TASK_ID]) && !labels[LABEL_SESSION_ID];
+    if (!isWarm) continue;
+    // Keep pods that already carry the current hash. Pods spawned before
+    // this label was introduced have an empty/undefined hash, so they fall
+    // through and get recycled on the next env_vars PATCH — acceptable
+    // one-time churn.
+    if (labels[LABEL_ENV_VARS_HASH] === keepHash) continue;
+    await Promise.all([deleteSandbox(name), deleteService(name)]);
+    deleted.push(name);
+  }
+  return { deleted };


TOCTOU race: session-claimed pod can be deleted mid-loop

deleteStaleWarmPods reads labels at list time (line 570) and uses that stale snapshot to decide whether a pod is "warm". Between the list call and the individual deleteSandbox call, the warm-pool manager could claim a pod for a new session (adding LABEL_SESSION_ID to the Sandbox's labels). The pod appears unclaimed in the snapshot, so the delete runs and tears down an active session's Sandbox. The per-pod delete in the loop should re-fetch the Sandbox or use a deleteOption precondition to guard against this window.

greptile-apps · 2026-05-13T20:12:58Z

+  const deleted: string[] = [];
+  for (const item of items) {
+    const name = item.metadata?.name;
+    if (!name) continue;
+    const labels = item.metadata?.labels ?? {};
+    // Only warm pods: tagged with a warm_task_id and NOT claimed by a session.
+    const isWarm = Boolean(labels[LABEL_WARM_TASK_ID]) && !labels[LABEL_SESSION_ID];
+    if (!isWarm) continue;
+    // Keep pods that already carry the current hash. Pods spawned before
+    // this label was introduced have an empty/undefined hash, so they fall
+    // through and get recycled on the next env_vars PATCH — acceptable
+    // one-time churn.
+    if (labels[LABEL_ENV_VARS_HASH] === keepHash) continue;
+    await Promise.all([deleteSandbox(name), deleteService(name)]);
+    deleted.push(name);
+  }
+  return { deleted };


Partial invalidation on deletion error

The for...of loop awaits each pod deletion sequentially. If deleteSandbox or deleteService throws (any non-404 K8s error — rate-limit, transient apiserver blip, RBAC denial), the error propagates out of the loop immediately and every remaining stale pod is left untouched. The caller catches the error with console.warn and moves on, leaving the warm pool partially invalidated. Subsequent sessions can still bind to surviving stale pods. Wrapping each iteration in its own try/catch (or collecting errors and continuing) would ensure all pods are attempted.

greptile-apps · 2026-05-13T20:12:59Z

+  if (body.env_vars !== undefined) {
+    const newHash = envVarsHash(updated.env_vars);
+    void deleteStaleWarmPods({ agent_id, keepHash: newHash }).catch((err) => {
+      console.warn(
+        `[env_vars patch] failed to invalidate stale warm pods for ${agent_id}:`,
+        err,
+      );
+    });
+  }


K8s-specific import bypasses the backend dispatcher

deleteStaleWarmPods is imported directly from @/server/k8s, bypassing whatever backend-agnostic dispatcher @/server/sandbox.ts exposes. On an ECS/Fargate deployment, customApi() tries to load a kubeconfig that doesn't exist. The error is silently swallowed by the .catch(console.warn), so warm-pod invalidation silently becomes a no-op for non-K8s deployments.

ishaan-berri added 2 commits May 13, 2026 13:08

greptile-apps Bot reviewed May 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalidate stale warm pods on env_vars change#80

Invalidate stale warm pods on env_vars change#80
ishaan-berri wants to merge 2 commits into
mainfrom
litellm_env-vars-hydration

ishaan-berri commented May 13, 2026

Uh oh!

greptile-apps Bot commented May 13, 2026

Important Files Changed

Uh oh!

greptile-apps Bot May 13, 2026

Uh oh!

greptile-apps Bot May 13, 2026

Uh oh!

greptile-apps Bot May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ishaan-berri commented May 13, 2026

Summary

Why

Design choice

Canonical form

Tests

Test plan

Uh oh!

greptile-apps Bot commented May 13, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Uh oh!

greptile-apps Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant