🐛 Fix stuck namespace finalizers + improve E2E reliability and observability#285
🐛 Fix stuck namespace finalizers + improve E2E reliability and observability#285clubanderson wants to merge 5 commits intomainfrom
Conversation
The dual-pods controller adds a `dual-pods.llm-d.ai/provider` finalizer to launcher pods, but this finalizer is not removed when the controller is scaled down or uninstalled. This leaves completed pods blocking namespace deletion indefinitely — the "Create namespace" step spins forever on "Namespace still terminating..." Three fixes: 1. Strip dual-pods finalizers from all pods before deleting the namespace (both pre-test cleanup and post-test cleanup) 2. Add a 60s timeout to the namespace termination wait loop with automatic finalizer removal as a fallback, plus a hard 120s error exit 3. Strip finalizers in the "Scale down on failure" step so the namespace can be cleaned up on the next run This was observed on pok-prod-sa where 3 FMA namespaces were stuck Terminating for up to 7 days due to this finalizer. Signed-off-by: Andrew Anderson <andy@clubanderson.com>
There was a problem hiding this comment.
Pull request overview
Updates the OpenShift E2E GitHub Actions workflow to avoid hangs when re-creating PR test namespaces that are stuck in Terminating due to orphaned pod finalizers left behind by the dual-pods controller.
Changes:
- Adds pod-finalizer stripping prior to namespace deletion in both pre-test and post-test cleanup paths.
- Adds a bounded wait loop (60s remediate, 120s fail) when waiting for a previously-used namespace to fully delete.
- Adds finalizer stripping in the “scale down on failure” path to reduce the chance of the next run being blocked by orphaned finalizers.
| kubectl patch pod "$pod" -n "$FMA_NAMESPACE" \ | ||
| -p '{"metadata":{"finalizers":null}}' --type=merge 2>/dev/null || true |
There was a problem hiding this comment.
The patch sets metadata.finalizers to null, which removes all pod finalizers, not just dual-pods.llm-d.ai/provider. That can unintentionally bypass other controllers' cleanup logic. Prefer removing only the specific finalizer (e.g., JSON patch the finalizers list after filtering out dual-pods.llm-d.ai/provider, or patch only pods that actually have that finalizer).
| kubectl patch pod "$pod" -n "$FMA_NAMESPACE" \ | |
| -p '{"metadata":{"finalizers":null}}' --type=merge 2>/dev/null || true | |
| kubectl get pod "$pod" -n "$FMA_NAMESPACE" -o json 2>/dev/null \ | |
| | jq 'if .metadata.finalizers then .metadata.finalizers |= map(select(. != "dual-pods.llm-d.ai/provider")) else . end' 2>/dev/null \ | |
| | kubectl replace -f - 2>/dev/null || true |
There was a problem hiding this comment.
I think that Copilot has raised a valid concern. Finalizers are a general-purpose feature of Kubernetes, and we do not know what other things might put finalizers on these Pods for cleanup purposes that need to be respected.
A little more broadly than just the one finalizer, I think that it would be valid to remove all the finalizers that start with "dual-pods.llm-d.ai".
| if [ "$ELAPSED" -ge 60 ]; then | ||
| echo "Namespace stuck terminating after 60s — force-removing finalizers..." | ||
| for pod in $(kubectl get pods -n "$FMA_NAMESPACE" -o jsonpath='{.items[*].metadata.name}' 2>/dev/null); do | ||
| kubectl patch pod "$pod" -n "$FMA_NAMESPACE" \ | ||
| -p '{"metadata":{"finalizers":null}}' --type=merge 2>/dev/null || true | ||
| done | ||
| fi |
There was a problem hiding this comment.
After 60s this loop will patch all pods every 5s until the namespace is deleted or 120s is reached. That can generate a lot of API traffic for namespaces with many pods. Consider ensuring the finalizer-stripping runs only once (e.g., a boolean flag) and/or only targeting pods that contain the dual-pods.llm-d.ai/provider finalizer.
There was a problem hiding this comment.
That can generate a lot of API traffic for namespaces with many pods.
Bogus concern. This namespace is used just for this test, we know that it will have few Pods (unless something has gone horribly wrong in an FMA controller, and in this case they still all need to have finalizer(s) removed).
| fi | ||
| if [ "$ELAPSED" -ge 120 ]; then | ||
| echo "::error::Namespace $FMA_NAMESPACE still terminating after 120s" | ||
| kubectl get namespace "$FMA_NAMESPACE" -o json | jq '.status' |
There was a problem hiding this comment.
This uses jq for debugging output, but jq is not installed in the earlier “Install tools” step. Because Actions’ default bash runs with -e -o pipefail, a missing jq will abort the step before the intended exit 1, and you’ll lose the namespace status output. Either ensure jq is installed on the runner or avoid it (e.g., use kubectl get namespace ... -o yaml/jsonpath).
| kubectl get namespace "$FMA_NAMESPACE" -o json | jq '.status' | |
| kubectl get namespace "$FMA_NAMESPACE" -o yaml || true |
There was a problem hiding this comment.
I do not know how to check what is installed on the runner used here.
| kubectl patch pod "$pod" -n "$FMA_NAMESPACE" \ | ||
| -p '{"metadata":{"finalizers":null}}' --type=merge 2>/dev/null || true |
There was a problem hiding this comment.
Same as earlier: patching {"metadata":{"finalizers":null}} removes all pod finalizers, not just the dual-pods one. Narrow this to removing only dual-pods.llm-d.ai/provider (or only patch pods that contain that finalizer) to avoid bypassing unrelated finalizers.
| kubectl patch pod "$pod" -n "$FMA_NAMESPACE" \ | |
| -p '{"metadata":{"finalizers":null}}' --type=merge 2>/dev/null || true | |
| # Only patch pods that actually have the dual-pods.llm-d.ai/provider finalizer | |
| if kubectl get pod "$pod" -n "$FMA_NAMESPACE" -o jsonpath='{.metadata.finalizers}' 2>/dev/null | grep -q 'dual-pods.llm-d.ai/provider'; then | |
| kubectl patch pod "$pod" -n "$FMA_NAMESPACE" \ | |
| -p '{"metadata":{"finalizers":null}}' --type=merge 2>/dev/null || true | |
| fi |
| # Strip dual-pods finalizers so the namespace can be cleaned up on the next run. | ||
| # Without this, scaling down the controller leaves orphaned finalizers that block | ||
| # namespace deletion indefinitely. | ||
| echo "Stripping dual-pods finalizers from pods (prevents stuck namespace on next cleanup)..." | ||
| for pod in $(kubectl get pods -n "$FMA_NAMESPACE" -o jsonpath='{.items[*].metadata.name}' 2>/dev/null); do | ||
| kubectl patch pod "$pod" -n "$FMA_NAMESPACE" \ | ||
| -p '{"metadata":{"finalizers":null}}' --type=merge 2>/dev/null || true | ||
| done |
There was a problem hiding this comment.
This finalizer stripping uses metadata.finalizers: null, which removes all finalizers from all pods. If the goal is only to prevent orphaned dual-pods.llm-d.ai/provider finalizers, it’s safer to remove only that entry from the finalizers list (and/or only for pods that have it).
| kubectl patch pod "$pod" -n "$FMA_NAMESPACE" \ | ||
| -p '{"metadata":{"finalizers":null}}' --type=merge 2>/dev/null || true |
There was a problem hiding this comment.
I do not see how repeating the same remediation done before the wait is going to help. I think that this can be greatly simplified. Instead of a custom loop, just do a kubectl wait --for=delete with the desired total timeout, and handle failure of that one command.
- Only remove dual-pods.llm-d.ai/* finalizers instead of nulling all finalizers, to avoid bypassing other controllers' cleanup logic (Mike + Copilot feedback) - Replace custom polling loop with kubectl wait --for=delete + single remediation pass (Mike's suggestion) - Replace jq with kubectl -o yaml for namespace status output since jq may not be installed on the runner (Copilot feedback) Signed-off-by: Andy Anderson <andy@clubanderson.com> Signed-off-by: Andrew Anderson <andy@clubanderson.com>
| echo "Test failed - scaling down controller to free resources while preserving for debugging..." | ||
| kubectl scale deployment "$FMA_RELEASE_NAME-dual-pods-controller" -n "$FMA_NAMESPACE" --replicas=0 || true | ||
|
|
||
| # Strip dual-pods.llm-d.ai/* finalizers so the namespace can be cleaned up on the next run. |
There was a problem hiding this comment.
I do not see the need to do this here, in the "Scale down controller on failure" job --- which does NOT delete the Namespace and DOES do state-dumping of the Pods.
| while kubectl get namespace "$FMA_NAMESPACE" &>/dev/null; do | ||
| echo "Namespace still terminating..." | ||
| sleep 2 | ||
| echo "Namespace $FMA_NAMESPACE still exists — stripping dual-pods.llm-d.ai finalizers..." |
There was a problem hiding this comment.
@copilot: I do not see why this finalizer stripping should be done again. It was already done in the earlier job step named "Clean up resources for this PR".
| echo " Stripping dual-pods.llm-d.ai finalizers from pods..." | ||
| for pod in $(kubectl get pods -n "$FMA_NAMESPACE" -o jsonpath='{.items[*].metadata.name}' 2>/dev/null); do | ||
| FINALIZERS=$(kubectl get pod "$pod" -n "$FMA_NAMESPACE" -o jsonpath='{.metadata.finalizers[*]}' 2>/dev/null) || true | ||
| if echo "$FINALIZERS" | grep -q 'dual-pods.llm-d.ai'; then |
There was a problem hiding this comment.
@copilot: The comment above had the right filter. The regexp here should be ^dual-pods\.llm-d\.ai/.
| for pod in $(kubectl get pods -n "$FMA_NAMESPACE" -o jsonpath='{.items[*].metadata.name}' 2>/dev/null); do | ||
| FINALIZERS=$(kubectl get pod "$pod" -n "$FMA_NAMESPACE" -o jsonpath='{.metadata.finalizers[*]}' 2>/dev/null) || true | ||
| if echo "$FINALIZERS" | grep -q 'dual-pods.llm-d.ai'; then | ||
| KEEP=$(echo "$FINALIZERS" | tr ' ' '\n' | grep -v '^dual-pods\.llm-d\.ai' | paste -sd, -) |
There was a problem hiding this comment.
@copilot: The paste here and the tr ',' '\n' later cancel each other out; it would be simpler to omit them.
MikeSpreitzer
left a comment
There was a problem hiding this comment.
@copilot: I have finished a review.
- Simplify grep/paste/tr pipeline per Mike's feedback (paste+tr cancel out) - Use proper regex ^dual-pods\.llm-d\.ai/ for finalizer matching - Remove duplicate finalizer stripping from "Create namespace" step (already done in cleanup step above, just kubectl wait here) - Remove finalizer stripping from "Scale down on failure" step (preserves state for debugging, cleanup handles it on next run) Signed-off-by: Andrew Anderson <andy@clubanderson.com>
- Add workflow summary step showing gate decision - Hardcode Go 1.24.x instead of extracting from go.mod - Replace kubectl get crd with kubectl wait --for=condition=Established - Stop deleting CRDs in cleanup (cluster-scoped, shared across runs) - Add always-run "Dump cluster state" diagnostics step - Separate scale-down from diagnostics dump Signed-off-by: Andrew Anderson <andy@clubanderson.com>
| uses: actions/setup-go@7a3fe6cf4cb3a834922a1244abfce67bcef6a0c5 # v6.2.0 | ||
| with: | ||
| go-version: "${{ env.GO_VERSION }}" | ||
| go-version: "1.24.x" |
There was a problem hiding this comment.
@copilot: I would, in general, build with the latest Go available (while not insisting that this has to be tracked closely) unless that surfaces a problem. At the moment, according to https://go.dev/ , that is 1.26.0.
Some people recommend that ordinary users should let other people take the inevitable arrows from patch level zero. That leaves us with 1.25.7 at the moment.
| for pod in $(kubectl get pods -n "$FMA_NAMESPACE" -o jsonpath='{.items[*].metadata.name}' 2>/dev/null); do | ||
| FINALIZERS=$(kubectl get pod "$pod" -n "$FMA_NAMESPACE" -o jsonpath='{.metadata.finalizers[*]}' 2>/dev/null) || true | ||
| if echo "$FINALIZERS" | grep -q 'dual-pods.llm-d.ai'; then | ||
| KEEP=$(echo "$FINALIZERS" | tr ' ' '\n' | grep -v '^dual-pods\.llm-d\.ai' | paste -sd, -) |
There was a problem hiding this comment.
@copilot: the paste -sd, - here and the tr ',' '\n' below cancel each other out. It would be simpler to omit them.
|
|
||
| - name: Scale down controller on failure | ||
| if: failure() | ||
| - name: Dump cluster state |
There was a problem hiding this comment.
@copilot: the state dumping must come before the state is deleted!
Also, it is best if each kind of state is dumped in a distinct job step, that makes the voluminous output easier to consume.
| kubectl logs deployment/"$FMA_RELEASE_NAME-dual-pods-controller" -n "$FMA_NAMESPACE" --previous 2>/dev/null || true | ||
| kubectl logs deployment/"$FMA_RELEASE_NAME-dual-pods-controller" -n "$FMA_NAMESPACE" 2>/dev/null || true | ||
|
|
||
| - name: Scale down controller on failure |
There was a problem hiding this comment.
@copilot: this is moot, the whole namespace gets deleted.
MikeSpreitzer
left a comment
There was a problem hiding this comment.
@copilot: I have finished a review.
- Bump Go from 1.24.x to 1.25.x (current stable) - Remove paste/tr roundtrip in post-test finalizer cleanup - Move diagnostics dump steps before cleanup so state is captured - Split dump into separate steps for easier log consumption - Remove moot "scale down on failure" step (namespace gets deleted) Signed-off-by: Andrew Anderson <andy@clubanderson.com>
|
Closing this one because the work has been done in other PRs.
|
Summary
Consolidates #282 and fixes the root cause of stuck-Terminating namespaces in the OpenShift E2E workflow.
Finalizer fix (root cause)
dual-pods.llm-d.ai/*finalizers from pods before namespace deletion in both pre-test and post-test cleanupwhilepolling loop withkubectl wait --for=delete --timeout=120sand hard error exitReliability improvements (from #282)
sedextraction from go.modkubectl get crdverification withkubectl wait --for=condition=Established(proper CRD readiness)Observability improvements (from #282)
Problem
The dual-pods controller adds a
dual-pods.llm-d.ai/providerfinalizer to launcher pods. When the controller is uninstalled during cleanup, this finalizer is never removed. Completed launcher pods block namespace deletion indefinitely.On pok-prod-sa, 3 FMA namespaces were stuck Terminating:
fma-e2e-pr-282— 21 hoursfma-e2e-pr-262— 7 daysfma-e2e-pr-22367943591— 19 hoursAll had the same root cause: a single completed
launcherpod withdual-pods.llm-d.ai/providerfinalizer.Test plan
Closes #282.