Skip to content

🐛 Fix stuck namespace finalizers + improve E2E reliability and observability#285

Closed
clubanderson wants to merge 5 commits intomainfrom
fix/force-delete-stuck-namespaces
Closed

🐛 Fix stuck namespace finalizers + improve E2E reliability and observability#285
clubanderson wants to merge 5 commits intomainfrom
fix/force-delete-stuck-namespaces

Conversation

@clubanderson
Copy link
Copy Markdown
Contributor

@clubanderson clubanderson commented Feb 25, 2026

Summary

Consolidates #282 and fixes the root cause of stuck-Terminating namespaces in the OpenShift E2E workflow.

Finalizer fix (root cause)

  • Strip dual-pods.llm-d.ai/* finalizers from pods before namespace deletion in both pre-test and post-test cleanup
  • Replace infinite while polling loop with kubectl wait --for=delete --timeout=120s and hard error exit
  • Only remove dual-pods finalizers — other controllers' finalizers are preserved

Reliability improvements (from #282)

  • Hardcode Go 1.24.x instead of fragile sed extraction from go.mod
  • Replace kubectl get crd verification with kubectl wait --for=condition=Established (proper CRD readiness)
  • Stop deleting CRDs in cleanup — they are cluster-scoped and shared across concurrent test runs

Observability improvements (from #282)

  • Add workflow summary step showing gate decision (run vs skip)
  • Add always-run "Dump cluster state" diagnostics step (namespace resources, pod details, events, controller logs)
  • Separate diagnostics dump from scale-down so logs are captured on both success and failure

Problem

The dual-pods controller adds a dual-pods.llm-d.ai/provider finalizer to launcher pods. When the controller is uninstalled during cleanup, this finalizer is never removed. Completed launcher pods block namespace deletion indefinitely.

On pok-prod-sa, 3 FMA namespaces were stuck Terminating:

  • fma-e2e-pr-282 — 21 hours
  • fma-e2e-pr-2627 days
  • fma-e2e-pr-22367943591 — 19 hours

All had the same root cause: a single completed launcher pod with dual-pods.llm-d.ai/provider finalizer.

Test plan

  • Verify workflow YAML is valid
  • E2E run passed on OpenShift (pok-prod-sa)
  • Confirm namespace cleanup no longer hangs on subsequent runs

Closes #282.

The dual-pods controller adds a `dual-pods.llm-d.ai/provider` finalizer to
launcher pods, but this finalizer is not removed when the controller is
scaled down or uninstalled. This leaves completed pods blocking namespace
deletion indefinitely — the "Create namespace" step spins forever on
"Namespace still terminating..."

Three fixes:
1. Strip dual-pods finalizers from all pods before deleting the namespace
   (both pre-test cleanup and post-test cleanup)
2. Add a 60s timeout to the namespace termination wait loop with automatic
   finalizer removal as a fallback, plus a hard 120s error exit
3. Strip finalizers in the "Scale down on failure" step so the namespace
   can be cleaned up on the next run

This was observed on pok-prod-sa where 3 FMA namespaces were stuck
Terminating for up to 7 days due to this finalizer.

Signed-off-by: Andrew Anderson <andy@clubanderson.com>
Copilot AI review requested due to automatic review settings February 25, 2026 16:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the OpenShift E2E GitHub Actions workflow to avoid hangs when re-creating PR test namespaces that are stuck in Terminating due to orphaned pod finalizers left behind by the dual-pods controller.

Changes:

  • Adds pod-finalizer stripping prior to namespace deletion in both pre-test and post-test cleanup paths.
  • Adds a bounded wait loop (60s remediate, 120s fail) when waiting for a previously-used namespace to fully delete.
  • Adds finalizer stripping in the “scale down on failure” path to reduce the chance of the next run being blocked by orphaned finalizers.

Comment on lines +384 to +385
kubectl patch pod "$pod" -n "$FMA_NAMESPACE" \
-p '{"metadata":{"finalizers":null}}' --type=merge 2>/dev/null || true
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch sets metadata.finalizers to null, which removes all pod finalizers, not just dual-pods.llm-d.ai/provider. That can unintentionally bypass other controllers' cleanup logic. Prefer removing only the specific finalizer (e.g., JSON patch the finalizers list after filtering out dual-pods.llm-d.ai/provider, or patch only pods that actually have that finalizer).

Suggested change
kubectl patch pod "$pod" -n "$FMA_NAMESPACE" \
-p '{"metadata":{"finalizers":null}}' --type=merge 2>/dev/null || true
kubectl get pod "$pod" -n "$FMA_NAMESPACE" -o json 2>/dev/null \
| jq 'if .metadata.finalizers then .metadata.finalizers |= map(select(. != "dual-pods.llm-d.ai/provider")) else . end' 2>/dev/null \
| kubectl replace -f - 2>/dev/null || true

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that Copilot has raised a valid concern. Finalizers are a general-purpose feature of Kubernetes, and we do not know what other things might put finalizers on these Pods for cleanup purposes that need to be respected.

A little more broadly than just the one finalizer, I think that it would be valid to remove all the finalizers that start with "dual-pods.llm-d.ai".

Comment on lines +414 to +420
if [ "$ELAPSED" -ge 60 ]; then
echo "Namespace stuck terminating after 60s — force-removing finalizers..."
for pod in $(kubectl get pods -n "$FMA_NAMESPACE" -o jsonpath='{.items[*].metadata.name}' 2>/dev/null); do
kubectl patch pod "$pod" -n "$FMA_NAMESPACE" \
-p '{"metadata":{"finalizers":null}}' --type=merge 2>/dev/null || true
done
fi
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After 60s this loop will patch all pods every 5s until the namespace is deleted or 120s is reached. That can generate a lot of API traffic for namespaces with many pods. Consider ensuring the finalizer-stripping runs only once (e.g., a boolean flag) and/or only targeting pods that contain the dual-pods.llm-d.ai/provider finalizer.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That can generate a lot of API traffic for namespaces with many pods.

Bogus concern. This namespace is used just for this test, we know that it will have few Pods (unless something has gone horribly wrong in an FMA controller, and in this case they still all need to have finalizer(s) removed).

fi
if [ "$ELAPSED" -ge 120 ]; then
echo "::error::Namespace $FMA_NAMESPACE still terminating after 120s"
kubectl get namespace "$FMA_NAMESPACE" -o json | jq '.status'
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses jq for debugging output, but jq is not installed in the earlier “Install tools” step. Because Actions’ default bash runs with -e -o pipefail, a missing jq will abort the step before the intended exit 1, and you’ll lose the namespace status output. Either ensure jq is installed on the runner or avoid it (e.g., use kubectl get namespace ... -o yaml/jsonpath).

Suggested change
kubectl get namespace "$FMA_NAMESPACE" -o json | jq '.status'
kubectl get namespace "$FMA_NAMESPACE" -o yaml || true

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know how to check what is installed on the runner used here.

Comment on lines +885 to +886
kubectl patch pod "$pod" -n "$FMA_NAMESPACE" \
-p '{"metadata":{"finalizers":null}}' --type=merge 2>/dev/null || true
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as earlier: patching {"metadata":{"finalizers":null}} removes all pod finalizers, not just the dual-pods one. Narrow this to removing only dual-pods.llm-d.ai/provider (or only patch pods that contain that finalizer) to avoid bypassing unrelated finalizers.

Suggested change
kubectl patch pod "$pod" -n "$FMA_NAMESPACE" \
-p '{"metadata":{"finalizers":null}}' --type=merge 2>/dev/null || true
# Only patch pods that actually have the dual-pods.llm-d.ai/provider finalizer
if kubectl get pod "$pod" -n "$FMA_NAMESPACE" -o jsonpath='{.metadata.finalizers}' 2>/dev/null | grep -q 'dual-pods.llm-d.ai/provider'; then
kubectl patch pod "$pod" -n "$FMA_NAMESPACE" \
-p '{"metadata":{"finalizers":null}}' --type=merge 2>/dev/null || true
fi

Copilot uses AI. Check for mistakes.
Comment on lines +913 to +920
# Strip dual-pods finalizers so the namespace can be cleaned up on the next run.
# Without this, scaling down the controller leaves orphaned finalizers that block
# namespace deletion indefinitely.
echo "Stripping dual-pods finalizers from pods (prevents stuck namespace on next cleanup)..."
for pod in $(kubectl get pods -n "$FMA_NAMESPACE" -o jsonpath='{.items[*].metadata.name}' 2>/dev/null); do
kubectl patch pod "$pod" -n "$FMA_NAMESPACE" \
-p '{"metadata":{"finalizers":null}}' --type=merge 2>/dev/null || true
done
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This finalizer stripping uses metadata.finalizers: null, which removes all finalizers from all pods. If the goal is only to prevent orphaned dual-pods.llm-d.ai/provider finalizers, it’s safer to remove only that entry from the finalizers list (and/or only for pods that have it).

Copilot uses AI. Check for mistakes.
Comment on lines +417 to +418
kubectl patch pod "$pod" -n "$FMA_NAMESPACE" \
-p '{"metadata":{"finalizers":null}}' --type=merge 2>/dev/null || true
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see how repeating the same remediation done before the wait is going to help. I think that this can be greatly simplified. Instead of a custom loop, just do a kubectl wait --for=delete with the desired total timeout, and handle failure of that one command.

- Only remove dual-pods.llm-d.ai/* finalizers instead of nulling all
  finalizers, to avoid bypassing other controllers' cleanup logic
  (Mike + Copilot feedback)
- Replace custom polling loop with kubectl wait --for=delete + single
  remediation pass (Mike's suggestion)
- Replace jq with kubectl -o yaml for namespace status output since
  jq may not be installed on the runner (Copilot feedback)

Signed-off-by: Andy Anderson <andy@clubanderson.com>
Signed-off-by: Andrew Anderson <andy@clubanderson.com>
echo "Test failed - scaling down controller to free resources while preserving for debugging..."
kubectl scale deployment "$FMA_RELEASE_NAME-dual-pods-controller" -n "$FMA_NAMESPACE" --replicas=0 || true

# Strip dual-pods.llm-d.ai/* finalizers so the namespace can be cleaned up on the next run.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see the need to do this here, in the "Scale down controller on failure" job --- which does NOT delete the Namespace and DOES do state-dumping of the Pods.

while kubectl get namespace "$FMA_NAMESPACE" &>/dev/null; do
echo "Namespace still terminating..."
sleep 2
echo "Namespace $FMA_NAMESPACE still exists — stripping dual-pods.llm-d.ai finalizers..."
Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot: I do not see why this finalizer stripping should be done again. It was already done in the earlier job step named "Clean up resources for this PR".

echo " Stripping dual-pods.llm-d.ai finalizers from pods..."
for pod in $(kubectl get pods -n "$FMA_NAMESPACE" -o jsonpath='{.items[*].metadata.name}' 2>/dev/null); do
FINALIZERS=$(kubectl get pod "$pod" -n "$FMA_NAMESPACE" -o jsonpath='{.metadata.finalizers[*]}' 2>/dev/null) || true
if echo "$FINALIZERS" | grep -q 'dual-pods.llm-d.ai'; then
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot: The comment above had the right filter. The regexp here should be ^dual-pods\.llm-d\.ai/.

for pod in $(kubectl get pods -n "$FMA_NAMESPACE" -o jsonpath='{.items[*].metadata.name}' 2>/dev/null); do
FINALIZERS=$(kubectl get pod "$pod" -n "$FMA_NAMESPACE" -o jsonpath='{.metadata.finalizers[*]}' 2>/dev/null) || true
if echo "$FINALIZERS" | grep -q 'dual-pods.llm-d.ai'; then
KEEP=$(echo "$FINALIZERS" | tr ' ' '\n' | grep -v '^dual-pods\.llm-d\.ai' | paste -sd, -)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot: The paste here and the tr ',' '\n' later cancel each other out; it would be simpler to omit them.

Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot: I have finished a review.

- Simplify grep/paste/tr pipeline per Mike's feedback (paste+tr cancel out)
- Use proper regex ^dual-pods\.llm-d\.ai/ for finalizer matching
- Remove duplicate finalizer stripping from "Create namespace" step
  (already done in cleanup step above, just kubectl wait here)
- Remove finalizer stripping from "Scale down on failure" step
  (preserves state for debugging, cleanup handles it on next run)

Signed-off-by: Andrew Anderson <andy@clubanderson.com>
- Add workflow summary step showing gate decision
- Hardcode Go 1.24.x instead of extracting from go.mod
- Replace kubectl get crd with kubectl wait --for=condition=Established
- Stop deleting CRDs in cleanup (cluster-scoped, shared across runs)
- Add always-run "Dump cluster state" diagnostics step
- Separate scale-down from diagnostics dump

Signed-off-by: Andrew Anderson <andy@clubanderson.com>
@clubanderson clubanderson changed the title 🐛 Fix namespace stuck terminating due to orphaned dual-pods finalizers 🐛 Fix stuck namespace finalizers + improve E2E reliability and observability Feb 26, 2026
uses: actions/setup-go@7a3fe6cf4cb3a834922a1244abfce67bcef6a0c5 # v6.2.0
with:
go-version: "${{ env.GO_VERSION }}"
go-version: "1.24.x"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot: I would, in general, build with the latest Go available (while not insisting that this has to be tracked closely) unless that surfaces a problem. At the moment, according to https://go.dev/ , that is 1.26.0.

Some people recommend that ordinary users should let other people take the inevitable arrows from patch level zero. That leaves us with 1.25.7 at the moment.

for pod in $(kubectl get pods -n "$FMA_NAMESPACE" -o jsonpath='{.items[*].metadata.name}' 2>/dev/null); do
FINALIZERS=$(kubectl get pod "$pod" -n "$FMA_NAMESPACE" -o jsonpath='{.metadata.finalizers[*]}' 2>/dev/null) || true
if echo "$FINALIZERS" | grep -q 'dual-pods.llm-d.ai'; then
KEEP=$(echo "$FINALIZERS" | tr ' ' '\n' | grep -v '^dual-pods\.llm-d\.ai' | paste -sd, -)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot: the paste -sd, - here and the tr ',' '\n' below cancel each other out. It would be simpler to omit them.


- name: Scale down controller on failure
if: failure()
- name: Dump cluster state
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot: the state dumping must come before the state is deleted!

Also, it is best if each kind of state is dumped in a distinct job step, that makes the voluminous output easier to consume.

kubectl logs deployment/"$FMA_RELEASE_NAME-dual-pods-controller" -n "$FMA_NAMESPACE" --previous 2>/dev/null || true
kubectl logs deployment/"$FMA_RELEASE_NAME-dual-pods-controller" -n "$FMA_NAMESPACE" 2>/dev/null || true

- name: Scale down controller on failure
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot: this is moot, the whole namespace gets deleted.

Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot: I have finished a review.

- Bump Go from 1.24.x to 1.25.x (current stable)
- Remove paste/tr roundtrip in post-test finalizer cleanup
- Move diagnostics dump steps before cleanup so state is captured
- Split dump into separate steps for easier log consumption
- Remove moot "scale down on failure" step (namespace gets deleted)

Signed-off-by: Andrew Anderson <andy@clubanderson.com>
@MikeSpreitzer
Copy link
Copy Markdown
Collaborator

Closing this one because the work has been done in other PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants