🐛 Fix stuck namespace finalizers + improve E2E reliability and observability by clubanderson · Pull Request #285 · llm-d-incubation/llm-d-fast-model-actuation

clubanderson · 2026-02-25T16:02:14Z

Summary

Consolidates #282 and fixes the root cause of stuck-Terminating namespaces in the OpenShift E2E workflow.

Finalizer fix (root cause)

Strip dual-pods.llm-d.ai/* finalizers from pods before namespace deletion in both pre-test and post-test cleanup
Replace infinite while polling loop with kubectl wait --for=delete --timeout=120s and hard error exit
Only remove dual-pods finalizers — other controllers' finalizers are preserved

Reliability improvements (from #282)

Hardcode Go 1.24.x instead of fragile sed extraction from go.mod
Replace kubectl get crd verification with kubectl wait --for=condition=Established (proper CRD readiness)
Stop deleting CRDs in cleanup — they are cluster-scoped and shared across concurrent test runs

Observability improvements (from #282)

Add workflow summary step showing gate decision (run vs skip)
Add always-run "Dump cluster state" diagnostics step (namespace resources, pod details, events, controller logs)
Separate diagnostics dump from scale-down so logs are captured on both success and failure

Problem

The dual-pods controller adds a dual-pods.llm-d.ai/provider finalizer to launcher pods. When the controller is uninstalled during cleanup, this finalizer is never removed. Completed launcher pods block namespace deletion indefinitely.

On pok-prod-sa, 3 FMA namespaces were stuck Terminating:

fma-e2e-pr-282 — 21 hours
fma-e2e-pr-262 — 7 days
fma-e2e-pr-22367943591 — 19 hours

All had the same root cause: a single completed launcher pod with dual-pods.llm-d.ai/provider finalizer.

Test plan

Verify workflow YAML is valid
E2E run passed on OpenShift (pok-prod-sa)
Confirm namespace cleanup no longer hangs on subsequent runs

Closes #282.

The dual-pods controller adds a `dual-pods.llm-d.ai/provider` finalizer to launcher pods, but this finalizer is not removed when the controller is scaled down or uninstalled. This leaves completed pods blocking namespace deletion indefinitely — the "Create namespace" step spins forever on "Namespace still terminating..." Three fixes: 1. Strip dual-pods finalizers from all pods before deleting the namespace (both pre-test cleanup and post-test cleanup) 2. Add a 60s timeout to the namespace termination wait loop with automatic finalizer removal as a fallback, plus a hard 120s error exit 3. Strip finalizers in the "Scale down on failure" step so the namespace can be cleaned up on the next run This was observed on pok-prod-sa where 3 FMA namespaces were stuck Terminating for up to 7 days due to this finalizer. Signed-off-by: Andrew Anderson <andy@clubanderson.com>

Copilot

Pull request overview

Updates the OpenShift E2E GitHub Actions workflow to avoid hangs when re-creating PR test namespaces that are stuck in Terminating due to orphaned pod finalizers left behind by the dual-pods controller.

Changes:

Adds pod-finalizer stripping prior to namespace deletion in both pre-test and post-test cleanup paths.
Adds a bounded wait loop (60s remediate, 120s fail) when waiting for a previously-used namespace to fully delete.
Adds finalizer stripping in the “scale down on failure” path to reduce the chance of the next run being blocked by orphaned finalizers.

Copilot · 2026-02-25T16:06:14Z

.github/workflows/ci-e2e-openshift.yaml

+              kubectl patch pod "$pod" -n "$FMA_NAMESPACE" \
+                -p '{"metadata":{"finalizers":null}}' --type=merge 2>/dev/null || true


The patch sets metadata.finalizers to null, which removes all pod finalizers, not just dual-pods.llm-d.ai/provider. That can unintentionally bypass other controllers' cleanup logic. Prefer removing only the specific finalizer (e.g., JSON patch the finalizers list after filtering out dual-pods.llm-d.ai/provider, or patch only pods that actually have that finalizer).

Suggested change

kubectl patch pod "$pod" -n "$FMA_NAMESPACE" \

-p '{"metadata":{"finalizers":null}}' --type=merge 2>/dev/null || true

kubectl get pod "$pod" -n "$FMA_NAMESPACE" -o json 2>/dev/null \

| jq 'if .metadata.finalizers then .metadata.finalizers |= map(select(. != "dual-pods.llm-d.ai/provider")) else . end' 2>/dev/null \

| kubectl replace -f - 2>/dev/null || true

I think that Copilot has raised a valid concern. Finalizers are a general-purpose feature of Kubernetes, and we do not know what other things might put finalizers on these Pods for cleanup purposes that need to be respected.

A little more broadly than just the one finalizer, I think that it would be valid to remove all the finalizers that start with "dual-pods.llm-d.ai".

Copilot · 2026-02-25T16:06:14Z

.github/workflows/ci-e2e-openshift.yaml

+              if [ "$ELAPSED" -ge 60 ]; then
+                echo "Namespace stuck terminating after 60s — force-removing finalizers..."
+                for pod in $(kubectl get pods -n "$FMA_NAMESPACE" -o jsonpath='{.items[*].metadata.name}' 2>/dev/null); do
+                  kubectl patch pod "$pod" -n "$FMA_NAMESPACE" \
+                    -p '{"metadata":{"finalizers":null}}' --type=merge 2>/dev/null || true
+                done
+              fi


After 60s this loop will patch all pods every 5s until the namespace is deleted or 120s is reached. That can generate a lot of API traffic for namespaces with many pods. Consider ensuring the finalizer-stripping runs only once (e.g., a boolean flag) and/or only targeting pods that contain the dual-pods.llm-d.ai/provider finalizer.

That can generate a lot of API traffic for namespaces with many pods.

Bogus concern. This namespace is used just for this test, we know that it will have few Pods (unless something has gone horribly wrong in an FMA controller, and in this case they still all need to have finalizer(s) removed).

Copilot · 2026-02-25T16:06:15Z

.github/workflows/ci-e2e-openshift.yaml

+              fi
+              if [ "$ELAPSED" -ge 120 ]; then
+                echo "::error::Namespace $FMA_NAMESPACE still terminating after 120s"
+                kubectl get namespace "$FMA_NAMESPACE" -o json | jq '.status'


This uses jq for debugging output, but jq is not installed in the earlier “Install tools” step. Because Actions’ default bash runs with -e -o pipefail, a missing jq will abort the step before the intended exit 1, and you’ll lose the namespace status output. Either ensure jq is installed on the runner or avoid it (e.g., use kubectl get namespace ... -o yaml/jsonpath).

Suggested change

kubectl get namespace "$FMA_NAMESPACE" -o json | jq '.status'

kubectl get namespace "$FMA_NAMESPACE" -o yaml || true

I do not know how to check what is installed on the runner used here.

Copilot · 2026-02-25T16:06:15Z

.github/workflows/ci-e2e-openshift.yaml

+            kubectl patch pod "$pod" -n "$FMA_NAMESPACE" \
+              -p '{"metadata":{"finalizers":null}}' --type=merge 2>/dev/null || true


Same as earlier: patching {"metadata":{"finalizers":null}} removes all pod finalizers, not just the dual-pods one. Narrow this to removing only dual-pods.llm-d.ai/provider (or only patch pods that contain that finalizer) to avoid bypassing unrelated finalizers.

Suggested change

kubectl patch pod "$pod" -n "$FMA_NAMESPACE" \

-p '{"metadata":{"finalizers":null}}' --type=merge 2>/dev/null || true

# Only patch pods that actually have the dual-pods.llm-d.ai/provider finalizer

if kubectl get pod "$pod" -n "$FMA_NAMESPACE" -o jsonpath='{.metadata.finalizers}' 2>/dev/null | grep -q 'dual-pods.llm-d.ai/provider'; then

kubectl patch pod "$pod" -n "$FMA_NAMESPACE" \

-p '{"metadata":{"finalizers":null}}' --type=merge 2>/dev/null || true

fi

Copilot · 2026-02-25T16:06:15Z

.github/workflows/ci-e2e-openshift.yaml

+          # Strip dual-pods finalizers so the namespace can be cleaned up on the next run.
+          # Without this, scaling down the controller leaves orphaned finalizers that block
+          # namespace deletion indefinitely.
+          echo "Stripping dual-pods finalizers from pods (prevents stuck namespace on next cleanup)..."
+          for pod in $(kubectl get pods -n "$FMA_NAMESPACE" -o jsonpath='{.items[*].metadata.name}' 2>/dev/null); do
+            kubectl patch pod "$pod" -n "$FMA_NAMESPACE" \
+              -p '{"metadata":{"finalizers":null}}' --type=merge 2>/dev/null || true
+          done


This finalizer stripping uses metadata.finalizers: null, which removes all finalizers from all pods. If the goal is only to prevent orphaned dual-pods.llm-d.ai/provider finalizers, it’s safer to remove only that entry from the finalizers list (and/or only for pods that have it).

MikeSpreitzer · 2026-02-25T19:27:18Z

.github/workflows/ci-e2e-openshift.yaml

+                  kubectl patch pod "$pod" -n "$FMA_NAMESPACE" \
+                    -p '{"metadata":{"finalizers":null}}' --type=merge 2>/dev/null || true


I do not see how repeating the same remediation done before the wait is going to help. I think that this can be greatly simplified. Instead of a custom loop, just do a kubectl wait --for=delete with the desired total timeout, and handle failure of that one command.

- Only remove dual-pods.llm-d.ai/* finalizers instead of nulling all finalizers, to avoid bypassing other controllers' cleanup logic (Mike + Copilot feedback) - Replace custom polling loop with kubectl wait --for=delete + single remediation pass (Mike's suggestion) - Replace jq with kubectl -o yaml for namespace status output since jq may not be installed on the runner (Copilot feedback) Signed-off-by: Andy Anderson <andy@clubanderson.com> Signed-off-by: Andrew Anderson <andy@clubanderson.com>

MikeSpreitzer · 2026-02-25T21:25:55Z

.github/workflows/ci-e2e-openshift.yaml

          echo "Test failed - scaling down controller to free resources while preserving for debugging..."
          kubectl scale deployment "$FMA_RELEASE_NAME-dual-pods-controller" -n "$FMA_NAMESPACE" --replicas=0 || true

+          # Strip dual-pods.llm-d.ai/* finalizers so the namespace can be cleaned up on the next run.


I do not see the need to do this here, in the "Scale down controller on failure" job --- which does NOT delete the Namespace and DOES do state-dumping of the Pods.

MikeSpreitzer · 2026-02-25T21:35:12Z

.github/workflows/ci-e2e-openshift.yaml

-            while kubectl get namespace "$FMA_NAMESPACE" &>/dev/null; do
-              echo "Namespace still terminating..."
-              sleep 2
+            echo "Namespace $FMA_NAMESPACE still exists — stripping dual-pods.llm-d.ai finalizers..."


@copilot: I do not see why this finalizer stripping should be done again. It was already done in the earlier job step named "Clean up resources for this PR".

MikeSpreitzer · 2026-02-26T04:48:48Z

.github/workflows/ci-e2e-openshift.yaml

+            echo "  Stripping dual-pods.llm-d.ai finalizers from pods..."
+            for pod in $(kubectl get pods -n "$FMA_NAMESPACE" -o jsonpath='{.items[*].metadata.name}' 2>/dev/null); do
+              FINALIZERS=$(kubectl get pod "$pod" -n "$FMA_NAMESPACE" -o jsonpath='{.metadata.finalizers[*]}' 2>/dev/null) || true
+              if echo "$FINALIZERS" | grep -q 'dual-pods.llm-d.ai'; then


@copilot: The comment above had the right filter. The regexp here should be ^dual-pods\.llm-d\.ai/.

MikeSpreitzer · 2026-02-26T04:50:24Z

.github/workflows/ci-e2e-openshift.yaml

+            for pod in $(kubectl get pods -n "$FMA_NAMESPACE" -o jsonpath='{.items[*].metadata.name}' 2>/dev/null); do
+              FINALIZERS=$(kubectl get pod "$pod" -n "$FMA_NAMESPACE" -o jsonpath='{.metadata.finalizers[*]}' 2>/dev/null) || true
+              if echo "$FINALIZERS" | grep -q 'dual-pods.llm-d.ai'; then
+                KEEP=$(echo "$FINALIZERS" | tr ' ' '\n' | grep -v '^dual-pods\.llm-d\.ai' | paste -sd, -)


@copilot: The paste here and the tr ',' '\n' later cancel each other out; it would be simpler to omit them.

MikeSpreitzer

@copilot: I have finished a review.

- Simplify grep/paste/tr pipeline per Mike's feedback (paste+tr cancel out) - Use proper regex ^dual-pods\.llm-d\.ai/ for finalizer matching - Remove duplicate finalizer stripping from "Create namespace" step (already done in cleanup step above, just kubectl wait here) - Remove finalizer stripping from "Scale down on failure" step (preserves state for debugging, cleanup handles it on next run) Signed-off-by: Andrew Anderson <andy@clubanderson.com>

- Add workflow summary step showing gate decision - Hardcode Go 1.24.x instead of extracting from go.mod - Replace kubectl get crd with kubectl wait --for=condition=Established - Stop deleting CRDs in cleanup (cluster-scoped, shared across runs) - Add always-run "Dump cluster state" diagnostics step - Separate scale-down from diagnostics dump Signed-off-by: Andrew Anderson <andy@clubanderson.com>

MikeSpreitzer · 2026-02-26T15:52:52Z

.github/workflows/ci-e2e-openshift.yaml

        uses: actions/setup-go@7a3fe6cf4cb3a834922a1244abfce67bcef6a0c5 # v6.2.0
        with:
-          go-version: "${{ env.GO_VERSION }}"
+          go-version: "1.24.x"


@copilot: I would, in general, build with the latest Go available (while not insisting that this has to be tracked closely) unless that surfaces a problem. At the moment, according to https://go.dev/ , that is 1.26.0.

Some people recommend that ordinary users should let other people take the inevitable arrows from patch level zero. That leaves us with 1.25.7 at the moment.

MikeSpreitzer · 2026-02-26T15:56:13Z

.github/workflows/ci-e2e-openshift.yaml

+          for pod in $(kubectl get pods -n "$FMA_NAMESPACE" -o jsonpath='{.items[*].metadata.name}' 2>/dev/null); do
+            FINALIZERS=$(kubectl get pod "$pod" -n "$FMA_NAMESPACE" -o jsonpath='{.metadata.finalizers[*]}' 2>/dev/null) || true
+            if echo "$FINALIZERS" | grep -q 'dual-pods.llm-d.ai'; then
+              KEEP=$(echo "$FINALIZERS" | tr ' ' '\n' | grep -v '^dual-pods\.llm-d\.ai' | paste -sd, -)


@copilot: the paste -sd, - here and the tr ',' '\n' below cancel each other out. It would be simpler to omit them.

MikeSpreitzer · 2026-02-26T15:58:30Z

.github/workflows/ci-e2e-openshift.yaml


-      - name: Scale down controller on failure
-        if: failure()
+      - name: Dump cluster state


@copilot: the state dumping must come before the state is deleted!

Also, it is best if each kind of state is dumped in a distinct job step, that makes the voluminous output easier to consume.

MikeSpreitzer · 2026-02-26T15:59:05Z

.github/workflows/ci-e2e-openshift.yaml

          kubectl logs deployment/"$FMA_RELEASE_NAME-dual-pods-controller" -n "$FMA_NAMESPACE" --previous 2>/dev/null || true
          kubectl logs deployment/"$FMA_RELEASE_NAME-dual-pods-controller" -n "$FMA_NAMESPACE" 2>/dev/null || true

+      - name: Scale down controller on failure


@copilot: this is moot, the whole namespace gets deleted.

MikeSpreitzer

@copilot: I have finished a review.

- Bump Go from 1.24.x to 1.25.x (current stable) - Remove paste/tr roundtrip in post-test finalizer cleanup - Move diagnostics dump steps before cleanup so state is captured - Split dump into separate steps for easier log consumption - Remove moot "scale down on failure" step (namespace gets deleted) Signed-off-by: Andrew Anderson <andy@clubanderson.com>

MikeSpreitzer · 2026-03-04T05:49:18Z

Closing this one because the work has been done in other PRs.

ci: remove dual-pods finalizers before namespace deletion #288 Remove finalizers in cleanup
ci: use literal Go build version instead of go.mod value #290 Use literal as Go version to build with
🐛 Stop the E2E on OpenShift workflow from deleting the CRDs #293 Stop deleting the CRD objects
✨ Reorg E2E on OCP workflow to always dump state #295 Reorg to always dump state then conditionally scale down
fix: wait for CRDs to be Established in OpenShift E2E workflow #314 Wait for CRDs to be "established"
✨ Add workflow summary step showing gate decision #316 Add workflow summary step

Copilot AI review requested due to automatic review settings February 25, 2026 16:02

Copilot AI reviewed Feb 25, 2026

View reviewed changes

MikeSpreitzer reviewed Feb 25, 2026

View reviewed changes

clubanderson requested a review from MikeSpreitzer February 25, 2026 21:06

MikeSpreitzer reviewed Feb 25, 2026

View reviewed changes

MikeSpreitzer reviewed Feb 26, 2026

View reviewed changes

clubanderson added 2 commits February 26, 2026 09:46

clubanderson mentioned this pull request Feb 26, 2026

⚠️ Improve E2E workflow reliability and observability #282

Closed

clubanderson changed the title ~~🐛 Fix namespace stuck terminating due to orphaned dual-pods finalizers~~ 🐛 Fix stuck namespace finalizers + improve E2E reliability and observability Feb 26, 2026

MikeSpreitzer reviewed Feb 26, 2026

View reviewed changes

github-actions bot mentioned this pull request Feb 26, 2026

[agentics] No-Op Runs #260

Open

clubanderson mentioned this pull request Feb 26, 2026

Governance, CI, and nightly E2E hygiene tracker (Feb 12–26) llm-d/llm-d#853

Open

24 tasks

MikeSpreitzer mentioned this pull request Feb 26, 2026

ci: remove dual-pods finalizers before namespace deletion #288

Merged

MikeSpreitzer closed this Mar 4, 2026

		kubectl patch pod "$pod" -n "$FMA_NAMESPACE" \
		-p '{"metadata":{"finalizers":null}}' --type=merge 2>/dev/null \|\| true

-              kubectl patch pod "$pod" -n "$FMA_NAMESPACE" \
-                -p '{"metadata":{"finalizers":null}}' --type=merge 2>/dev/null || true
+              kubectl get pod "$pod" -n "$FMA_NAMESPACE" -o json 2>/dev/null \
+                | jq 'if .metadata.finalizers then .metadata.finalizers |= map(select(. != "dual-pods.llm-d.ai/provider")) else . end' 2>/dev/null \
+                | kubectl replace -f - 2>/dev/null || true

	kubectl get namespace "$FMA_NAMESPACE" -o json \| jq '.status'
	kubectl get namespace "$FMA_NAMESPACE" -o yaml \|\| true

Conversation

clubanderson commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Finalizer fix (root cause)

Reliability improvements (from #282)

Observability improvements (from #282)

Problem

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MikeSpreitzer Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MikeSpreitzer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MikeSpreitzer left a comment

Choose a reason for hiding this comment

Uh oh!

MikeSpreitzer commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

clubanderson commented Feb 25, 2026 •

edited

Loading

MikeSpreitzer Feb 25, 2026 •

edited

Loading