Skip to content

Commit d681a33

Browse files
AlexsJonesclaude
andcommitted
fix: prevent reconcile race from overriding Succeeded AgentRuns as Failed
Resolves a race in the AgentRun controller where a run that had just been committed as phase=Succeeded could be flipped to phase=Failed with error="Job not found" by a stale reconcile that fired immediately after the Job was deleted (Jobs are deleted post-success to kill lingering sidecars). User-visible impact pre-fix: runs that actually succeeded — tokens counted, result markers parsed, response surfaced — would sometimes show phase=Failed in the UI with error="Job not found", despite being functionally successful. Timing-dependent and confusing to users. Root cause: the existing guard re-read the AgentRun via r.Client (the watch-based cached client) to check whether the phase had already been transitioned. When the concurrent reconcile fired immediately after succeedRun committed the status, the watch cache hadn't yet observed the status update, so the guard saw phase=Running and fell through to failRun("Job not found") — clobbering the Succeeded status. Fix: - Add APIReader client.Reader field to AgentRunReconciler, wired via mgr.GetAPIReader() in cmd/controller/main.go. APIReader bypasses the cache and reads directly from the apiserver. - The "Job not found" guard now reads fresh status via APIReader before deciding to fail the run. If the live status shows the run is already terminal (Succeeded, Failed, or in PostRunning), we don't override it. - Widened the guard to also preserve Failed phase, so a more-specific error (e.g. "agent container OOMKilled") isn't clobbered with the generic "Job not found". - Nil-safe fallback to r.Client keeps every existing test-file construction of &AgentRunReconciler{} working unchanged. Adds 4 targeted unit tests in internal/controller/agentrun_race_test.go: - DoesNotOverrideSucceeded: the exact regression - DoesNotOverrideFailed: preserves specific error messages - RequeuesForPostRunning: non-terminal PostRunning still requeues - FailsWhenReallyRunning: genuine stuck-Running still fails Adds a new Cypress spec (adhoc-lmstudio-deterministic-answer.cy.ts) that exposed this race: creates an ad-hoc LM Studio + qwen3.5-9b instance with k8s-ops, dispatches "How many namespaces are there?" via the UI, and strictly asserts phase=Succeeded + substantive response text (mentions "namespace", contains a digit, >60 chars, not a bare preamble). This was the first UI test doing a strict fresh-read phase assertion after a terminal transition and is what caught the bug. Supporting changes: - createLMStudioInstance Cypress helper now accepts an optional skills list (needed so the agent can actually execute kubectl). - schedule-create-via-ui.cy.ts: should("be.visible") → should("exist") for table-cell assertions that CSS-clip when many test rows are present. Verified regression-free: - go vet ./... clean - go test ./... all green (8 packages, including 4 new tests) - test-api-smoke.sh all green (9 endpoints) - test-lmstudio-response-regression.sh all green (3 scenarios) - Cypress: 19/19 specs, 24/24 tests green against sympozium serve Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent ef44cc9 commit d681a33

File tree

6 files changed

+350
-18
lines changed

6 files changed

+350
-18
lines changed

cmd/controller/main.go

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,7 @@ func main() {
119119

120120
agentRunReconciler := &controller.AgentRunReconciler{
121121
Client: mgr.GetClient(),
122+
APIReader: mgr.GetAPIReader(),
122123
Scheme: mgr.GetScheme(),
123124
Log: ctrl.Log.WithName("controllers").WithName("AgentRun"),
124125
PodBuilder: podBuilder,

internal/controller/agentrun_controller.go

Lines changed: 23 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,10 @@ const DefaultRunHistoryLimit = 50
7070
// It watches AgentRun CRDs and reconciles them into Kubernetes Jobs/Pods.
7171
type AgentRunReconciler struct {
7272
client.Client
73+
// APIReader bypasses the controller cache for reads — needed when we
74+
// must see status mutations committed by a concurrent reconcile that
75+
// the watch-based cache may not yet have observed.
76+
APIReader client.Reader
7377
Scheme *runtime.Scheme
7478
Log logr.Logger
7579
PodBuilder *orchestrator.PodBuilder
@@ -492,15 +496,26 @@ func (r *AgentRunReconciler) reconcileRunning(ctx context.Context, log logr.Logg
492496
}
493497
if err := r.Get(ctx, jobName, job); err != nil {
494498
if errors.IsNotFound(err) {
495-
// Guard against the race where the Job was already deleted (to kill
496-
// sidecars) and a concurrent reconcile of startPostRun has already
497-
// transitioned the phase to PostRunning or Succeeded. Do a fresh Get
498-
// from the API server (not the cache) to check the actual phase before
499-
// deciding to fail the run.
499+
// Guard against the race where the Job was already deleted (to
500+
// kill sidecars) and a concurrent reconcile has already
501+
// transitioned the phase to a terminal state. Read with the
502+
// non-cached APIReader (the watch cache may not have the
503+
// status update yet) — if the run is already terminal, don't
504+
// override it with "Job not found".
500505
fresh := &sympoziumv1alpha1.AgentRun{}
501-
if getErr := r.Get(ctx, client.ObjectKeyFromObject(agentRun), fresh); getErr == nil {
502-
if fresh.Status.Phase == sympoziumv1alpha1.AgentRunPhasePostRunning ||
503-
fresh.Status.Phase == sympoziumv1alpha1.AgentRunPhaseSucceeded {
506+
reader := client.Reader(r.APIReader)
507+
if reader == nil {
508+
reader = r.Client
509+
}
510+
if getErr := reader.Get(ctx, client.ObjectKeyFromObject(agentRun), fresh); getErr == nil {
511+
switch fresh.Status.Phase {
512+
case sympoziumv1alpha1.AgentRunPhaseSucceeded,
513+
sympoziumv1alpha1.AgentRunPhaseFailed:
514+
// Already terminal — don't override.
515+
return ctrl.Result{}, nil
516+
case sympoziumv1alpha1.AgentRunPhasePostRunning:
517+
// PostRun container is still executing — let the
518+
// PostRunning reconcile path handle it.
504519
return ctrl.Result{RequeueAfter: 5 * time.Second}, nil
505520
}
506521
}
Lines changed: 189 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,189 @@
1+
package controller
2+
3+
import (
4+
"context"
5+
"testing"
6+
7+
"github.com/go-logr/logr"
8+
batchv1 "k8s.io/api/batch/v1"
9+
corev1 "k8s.io/api/core/v1"
10+
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
11+
"k8s.io/apimachinery/pkg/runtime"
12+
"sigs.k8s.io/controller-runtime/pkg/client"
13+
"sigs.k8s.io/controller-runtime/pkg/client/fake"
14+
15+
sympoziumv1alpha1 "github.com/sympozium-ai/sympozium/api/v1alpha1"
16+
)
17+
18+
// newAgentRunTestReconciler builds an AgentRunReconciler backed by a fake
19+
// client. Both Client and APIReader point at the same fake so tests can mutate
20+
// objects via either field.
21+
func newAgentRunTestReconciler(t *testing.T, objs ...client.Object) *AgentRunReconciler {
22+
t.Helper()
23+
24+
scheme := runtime.NewScheme()
25+
_ = corev1.AddToScheme(scheme)
26+
_ = batchv1.AddToScheme(scheme)
27+
if err := sympoziumv1alpha1.AddToScheme(scheme); err != nil {
28+
t.Fatalf("add sympozium scheme: %v", err)
29+
}
30+
31+
cl := fake.NewClientBuilder().
32+
WithScheme(scheme).
33+
WithObjects(objs...).
34+
WithStatusSubresource(&sympoziumv1alpha1.AgentRun{}).
35+
Build()
36+
37+
return &AgentRunReconciler{
38+
Client: cl,
39+
APIReader: cl,
40+
Scheme: scheme,
41+
Log: logr.Discard(),
42+
}
43+
}
44+
45+
// TestReconcileRunning_JobNotFoundGuard_DoesNotOverrideSucceeded is the
46+
// regression guard for the race in which `succeedRun` had already committed
47+
// status.phase=Succeeded (and deleted the Job to kill sidecars) when a stale
48+
// reconcile of the same AgentRun fires, can't find the Job, and would
49+
// previously flip the phase to Failed with "Job not found". With APIReader
50+
// in place, the guard sees the real phase and refuses to override.
51+
func TestReconcileRunning_JobNotFoundGuard_DoesNotOverrideSucceeded(t *testing.T) {
52+
run := &sympoziumv1alpha1.AgentRun{
53+
ObjectMeta: metav1.ObjectMeta{Name: "regression-run", Namespace: "default"},
54+
Spec: sympoziumv1alpha1.AgentRunSpec{
55+
InstanceRef: "regression-inst",
56+
},
57+
Status: sympoziumv1alpha1.AgentRunStatus{
58+
Phase: sympoziumv1alpha1.AgentRunPhaseSucceeded,
59+
JobName: "regression-run-job",
60+
},
61+
}
62+
// Note: no Job object is seeded — simulating the post-cleanup race.
63+
r := newAgentRunTestReconciler(t, run)
64+
65+
// The guard inside reconcileRunning runs via r.Get(Job) failing with
66+
// NotFound. We call the public Reconcile entrypoint using the run's
67+
// key via the phase dispatcher — but to keep the test focused we call
68+
// reconcileRunning directly with a pointer-copy of the object.
69+
running := run.DeepCopy()
70+
running.Status.Phase = sympoziumv1alpha1.AgentRunPhaseRunning // pretend this reconcile still sees Running
71+
res, err := r.reconcileRunning(context.Background(), logr.Discard(), running)
72+
if err != nil {
73+
t.Fatalf("reconcileRunning returned error: %v", err)
74+
}
75+
if res.RequeueAfter != 0 {
76+
t.Errorf("expected no requeue for terminal-phase guard; got RequeueAfter=%v", res.RequeueAfter)
77+
}
78+
79+
// Most important assertion: the stored AgentRun status MUST still be
80+
// Succeeded — not overridden to Failed.
81+
var stored sympoziumv1alpha1.AgentRun
82+
if err := r.Client.Get(context.Background(), client.ObjectKeyFromObject(run), &stored); err != nil {
83+
t.Fatalf("get stored: %v", err)
84+
}
85+
if stored.Status.Phase != sympoziumv1alpha1.AgentRunPhaseSucceeded {
86+
t.Errorf(
87+
"REGRESSION: stored phase overridden to %q (expected Succeeded)",
88+
stored.Status.Phase,
89+
)
90+
}
91+
if stored.Status.Error != "" {
92+
t.Errorf(
93+
"REGRESSION: stored error populated to %q (should stay empty on Succeeded run)",
94+
stored.Status.Error,
95+
)
96+
}
97+
}
98+
99+
// TestReconcileRunning_JobNotFoundGuard_DoesNotOverrideFailed: similarly,
100+
// a run that has already been marked Failed by another code path (with a
101+
// more specific error message) must not be clobbered with "Job not found".
102+
func TestReconcileRunning_JobNotFoundGuard_DoesNotOverrideFailed(t *testing.T) {
103+
run := &sympoziumv1alpha1.AgentRun{
104+
ObjectMeta: metav1.ObjectMeta{Name: "failed-run", Namespace: "default"},
105+
Spec: sympoziumv1alpha1.AgentRunSpec{InstanceRef: "x"},
106+
Status: sympoziumv1alpha1.AgentRunStatus{
107+
Phase: sympoziumv1alpha1.AgentRunPhaseFailed,
108+
Error: "agent container exited with code 137 (OOMKilled)",
109+
JobName: "failed-run-job",
110+
},
111+
}
112+
r := newAgentRunTestReconciler(t, run)
113+
114+
stale := run.DeepCopy()
115+
stale.Status.Phase = sympoziumv1alpha1.AgentRunPhaseRunning
116+
_, err := r.reconcileRunning(context.Background(), logr.Discard(), stale)
117+
if err != nil {
118+
t.Fatalf("reconcileRunning returned error: %v", err)
119+
}
120+
121+
var stored sympoziumv1alpha1.AgentRun
122+
if err := r.Client.Get(context.Background(), client.ObjectKeyFromObject(run), &stored); err != nil {
123+
t.Fatalf("get stored: %v", err)
124+
}
125+
if stored.Status.Error != "agent container exited with code 137 (OOMKilled)" {
126+
t.Errorf(
127+
"REGRESSION: stored error overridden to %q (expected OOMKilled message to be preserved)",
128+
stored.Status.Error,
129+
)
130+
}
131+
}
132+
133+
// TestReconcileRunning_JobNotFoundGuard_RequeuesForPostRunning: if the run
134+
// moved to PostRunning (the lifecycle hook is still executing), we should
135+
// requeue rather than silently drop — otherwise the postRun progress is
136+
// only driven by watches, which are best-effort.
137+
func TestReconcileRunning_JobNotFoundGuard_RequeuesForPostRunning(t *testing.T) {
138+
run := &sympoziumv1alpha1.AgentRun{
139+
ObjectMeta: metav1.ObjectMeta{Name: "postrun-run", Namespace: "default"},
140+
Spec: sympoziumv1alpha1.AgentRunSpec{InstanceRef: "x"},
141+
Status: sympoziumv1alpha1.AgentRunStatus{
142+
Phase: sympoziumv1alpha1.AgentRunPhasePostRunning,
143+
JobName: "postrun-run-job",
144+
},
145+
}
146+
r := newAgentRunTestReconciler(t, run)
147+
148+
stale := run.DeepCopy()
149+
stale.Status.Phase = sympoziumv1alpha1.AgentRunPhaseRunning
150+
res, err := r.reconcileRunning(context.Background(), logr.Discard(), stale)
151+
if err != nil {
152+
t.Fatalf("reconcileRunning returned error: %v", err)
153+
}
154+
if res.RequeueAfter == 0 {
155+
t.Errorf("expected RequeueAfter>0 for PostRunning guard; got %v", res.RequeueAfter)
156+
}
157+
}
158+
159+
// TestReconcileRunning_JobNotFoundGuard_FailsWhenReallyRunning: when the
160+
// phase really is still Running (not an overtaken race), the guard must
161+
// still call failRun with the "Job not found" message — otherwise stuck
162+
// runs would loop forever.
163+
func TestReconcileRunning_JobNotFoundGuard_FailsWhenReallyRunning(t *testing.T) {
164+
run := &sympoziumv1alpha1.AgentRun{
165+
ObjectMeta: metav1.ObjectMeta{Name: "stuck-run", Namespace: "default"},
166+
Spec: sympoziumv1alpha1.AgentRunSpec{InstanceRef: "x"},
167+
Status: sympoziumv1alpha1.AgentRunStatus{
168+
Phase: sympoziumv1alpha1.AgentRunPhaseRunning,
169+
JobName: "stuck-run-job",
170+
},
171+
}
172+
r := newAgentRunTestReconciler(t, run)
173+
174+
_, err := r.reconcileRunning(context.Background(), logr.Discard(), run.DeepCopy())
175+
if err != nil {
176+
t.Fatalf("reconcileRunning returned error: %v", err)
177+
}
178+
179+
var stored sympoziumv1alpha1.AgentRun
180+
if err := r.Client.Get(context.Background(), client.ObjectKeyFromObject(run), &stored); err != nil {
181+
t.Fatalf("get stored: %v", err)
182+
}
183+
if stored.Status.Phase != sympoziumv1alpha1.AgentRunPhaseFailed {
184+
t.Errorf("expected phase Failed, got %q", stored.Status.Phase)
185+
}
186+
if stored.Status.Error != "Job not found" {
187+
t.Errorf("expected error 'Job not found', got %q", stored.Status.Error)
188+
}
189+
}
Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
// Full regression: create an ad-hoc LM Studio instance, dispatch a
2+
// deterministic question via the "New Run" dialog, and verify the
3+
// run-detail page renders a substantive answer from qwen3.5-9b — not
4+
// just a preamble, not empty, and clearly mentioning the thing it was
5+
// asked about (namespaces).
6+
//
7+
// This is the primary UX guard against the class of regressions we
8+
// chased: reasoning models emitting content into non-standard fields,
9+
// terminal turns being empty, response never surfacing in the UI.
10+
11+
const INSTANCE = `cy-adhoc-nsq-${Date.now()}`;
12+
let RUN_NAME = "";
13+
14+
describe("Ad-hoc LM Studio — deterministic answer end to end", () => {
15+
before(() => {
16+
// Create the instance with the correct pod-reachable LM Studio URL.
17+
// (The wizard defaults to http://localhost:1234 which doesn't work
18+
// from inside kind pods; the wizard's node-mode flow is covered in
19+
// a separate spec.)
20+
cy.createLMStudioInstance(INSTANCE, { skills: ["k8s-ops", "memory"] });
21+
});
22+
23+
after(() => {
24+
if (RUN_NAME) cy.deleteRun(RUN_NAME);
25+
cy.deleteInstance(INSTANCE);
26+
});
27+
28+
it("asks 'how many namespaces' via the UI and renders the answer", () => {
29+
// ── Step 1: dispatch the question via the "New Run" dialog on /runs ───
30+
cy.visit("/runs");
31+
cy.contains("button", "New Run", { timeout: 20000 }).click();
32+
33+
// Select our instance in the dropdown.
34+
cy.get("[role='dialog']").find("button[role='combobox']").click({ force: true });
35+
cy.get("[data-radix-popper-content-wrapper]").contains(INSTANCE).click({ force: true });
36+
37+
// Fill in the task. k8s-ops + execute_command is one of the default
38+
// skills wired into the instance via createLMStudioInstance, so the
39+
// model can actually answer from real cluster state.
40+
cy.get("[role='dialog']")
41+
.find("textarea")
42+
.clear()
43+
.type(
44+
"How many namespaces are there in this Kubernetes cluster? " +
45+
"Use kubectl via execute_command to find out, then answer with " +
46+
"the count and list them.",
47+
);
48+
49+
// Give the run a generous timeout (local inference is slow).
50+
cy.get("[role='dialog']").find("input[placeholder='5m']").clear().type("6m");
51+
52+
// Submit.
53+
cy.get("[role='dialog']").contains("button", "Create Run").click({ force: true });
54+
cy.get("[role='dialog']").should("not.exist", { timeout: 20000 });
55+
56+
// ── Step 2: find the run we just created via its UI row ───────────────
57+
cy.contains("td", INSTANCE, { timeout: 20000 })
58+
.parents("tr")
59+
.within(() => {
60+
cy.get("a[href^='/runs/']")
61+
.first()
62+
.invoke("attr", "href")
63+
.then((href) => {
64+
const m = /\/runs\/([^/?#]+)/.exec(href || "");
65+
expect(m, `expected /runs/<name> in href: ${href}`).to.not.be.null;
66+
RUN_NAME = m![1];
67+
});
68+
});
69+
70+
// ── Step 3: wait for terminal phase + assert Succeeded with context ───
71+
cy.then(() => cy.waitForRunTerminal(RUN_NAME, 6 * 60 * 1000));
72+
cy.then(() =>
73+
cy.request({
74+
url: `/api/v1/runs/${RUN_NAME}?namespace=default`,
75+
headers: {
76+
Authorization: `Bearer ${Cypress.env("API_TOKEN") || ""}`,
77+
},
78+
}).then((resp) => {
79+
const phase = resp.body?.status?.phase as string;
80+
const err = resp.body?.status?.error as string | undefined;
81+
expect(
82+
phase,
83+
`run ${RUN_NAME} should have Succeeded (error: ${err || "n/a"})`,
84+
).to.eq("Succeeded");
85+
}),
86+
);
87+
88+
// ── Step 4: open the run detail and assert the answer is substantive ──
89+
cy.then(() => cy.visit(`/runs/${RUN_NAME}`));
90+
cy.contains("Succeeded", { timeout: 20000 }).should("be.visible");
91+
cy.contains("button", "Result", { timeout: 20000 }).click({ force: true });
92+
93+
// Structural assertions — qwen3.5 paraphrases freely, so we don't
94+
// match an exact string:
95+
// - response is substantive (not just a preamble)
96+
// - it mentions "namespace" (the thing we asked about)
97+
// - it contains at least one digit (the count)
98+
// - "No result available" MUST NOT be shown
99+
cy.contains("No result available").should("not.exist");
100+
cy.get("[role='tabpanel']", { timeout: 20000 })
101+
.invoke("text")
102+
.then((raw) => {
103+
const text = raw.replace(/\s+/g, " ").trim();
104+
expect(
105+
text.length,
106+
`response should be substantive (>60 chars), got ${text.length}`,
107+
).to.be.greaterThan(60);
108+
expect(text, "response should mention namespaces").to.match(/namespace/i);
109+
expect(text, "response should contain a numeric count").to.match(/\d/);
110+
const isBarePreamble =
111+
/^(i'll|i will|let me|let's start|i'm going to)/i.test(text) && text.length < 120;
112+
expect(
113+
isBarePreamble,
114+
`response looks like a preamble only: ${text.slice(0, 140)}`,
115+
).to.be.false;
116+
});
117+
});
118+
});
119+
120+
export {};

web/cypress/e2e/schedule-create-via-ui.cy.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -58,8 +58,8 @@ describe("Schedule — create via UI", () => {
5858
});
5959

6060
cy.visit("/schedules");
61-
cy.contains(SCHEDULE, { timeout: 20000 }).should("be.visible");
62-
cy.contains(/\*\/5/).should("be.visible");
61+
cy.contains(SCHEDULE, { timeout: 20000 }).should("exist");
62+
cy.contains(/\*\/5/).should("exist");
6363
});
6464
});
6565

0 commit comments

Comments
 (0)