OCPBUGS-66983: Fix race condition in gather_core_dumps pod name retrieval #517

anuragthehatter · 2025-12-05T22:30:15Z

Current logic found failed on one of prow CI run where node indeed had a core dump present but dump was not collected by oc adm must-gather -- gather_core_dumps command underneath a prow chain. It seems we may have missed core dumps collection since long time due to this race issue.

Failed logs on prod builds here

Saw a race condition where dump was tried to be copied but debug pod was already removed.

Analysis:

Old approach (main branch):
debugPod=$(oc debug --to-namespace="default" node/"$1" -o jsonpath='{.metadata.name}')
oc debug --to-namespace="default" node/"$1" -- /bin/bash -c 'sleep 300' > /dev/null 2>&1 &
sleep 2
oc wait -n "default" --for=condition=Ready pod/"$debugPod" --timeout=30s

Tried to get pod name first (but pod doesn't exist yet - race condition)
Then started the debug pod
This was broken

New approach:

local tmpfile=$(mktemp)
oc debug --to-namespace="default" node/"$1" -- /bin/bash -c 'sleep 300' > "$tmpfile" 2>&1 &
local debug_pid=$!

sleep 2
debugPod=$(grep -oP "(?<=pod/)[^ ]*" "$tmpfile" 2>/dev/null | head -1)
rm -f "$tmpfile"

if [ -n "$debugPod" ]; then
  oc wait -n "default" --for=condition=Ready pod/"$debugPod" --timeout=30s > /dev/null 2>&1
fi

Capture output: Creates a temp file to capture oc debug output
Extract pod name: Uses grep -oP to parse the pod name from the debug command's output (e.g., "Starting pod/xyz-debug-abc...")
Store PID: Saves the background process ID with debug_pid=$!
Conditional wait: Only waits for pod readiness if a pod name was found
Cleanup: Removes the temp file after extracting the pod name

Point to note:

Some conformance tests are using FAIL_ON_CORE_DUMP: "true" under prow CI workflows and they seemed to be passing as dumps were never collected resulted in successful prow CI flows executions. The flow logic seems ok but seems like oc adm must-gather never collected dumps if there were any?

Reproduced error locally with help of a simulated core dump on an OCP cluster node
using default must-gather image

anusaxen@anusaxen-thinkpadp1gen3:~/git/must-gather$ oc adm must-gather -- gather_core_dumps
[must-gather      ] OUT 2025-12-05T22:05:42.184927525Z Using must-gather plug-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:698aa2e4879ec5cd5587bf1d5343931436659f7201f6f131d6d352966cdd5071
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: 1c57e459-65cb-4a20-aed6-8912204401b3
ClientVersion: 4.20.0-ec.6
ClusterVersion: Stable at "4.21.0-0.nightly-2025-11-22-193140"
ClusterOperators:
	All healthy and stable


[must-gather      ] OUT 2025-12-05T22:05:42.424328163Z namespace/openshift-must-gather-2vkq4 created
[must-gather      ] OUT 2025-12-05T22:05:42.493340186Z clusterrolebinding.rbac.authorization.k8s.io/must-gather-chbgr created
Warning: spec.nodeSelector[node-role.kubernetes.io/master]: use "node-role.kubernetes.io/control-plane" instead
[must-gather      ] OUT 2025-12-05T22:05:42.692109137Z pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:698aa2e4879ec5cd5587bf1d5343931436659f7201f6f131d6d352966cdd5071 created
[must-gather-689bp] POD 2025-12-05T22:05:44.685980005Z volume percentage checker started.....
[must-gather-689bp] POD 2025-12-05T22:05:44.691389054Z WARNING: Collecting core dumps on ALL linux nodes in your cluster. This could take a long time.
[must-gather-689bp] POD 2025-12-05T22:05:44.693073765Z volume usage percentage 0
[must-gather-689bp] POD 2025-12-05T22:05:45.290317650Z INFO: Waiting for node core dump collection to complete ...
[must-gather-689bp] POD 2025-12-05T22:05:47.874867514Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-master-0-debug-gc7zx" not found
[must-gather-689bp] POD 2025-12-05T22:05:47.878701942Z Copying core dumps on node anurag-gcp1a4-mvtmh-master-0
[must-gather-689bp] POD 2025-12-05T22:05:47.932576410Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-worker-a-5hj4l-debug-jtlnw" not found
[must-gather-689bp] POD 2025-12-05T22:05:47.935967326Z Copying core dumps on node anurag-gcp1a4-mvtmh-worker-a-5hj4l
[must-gather-689bp] POD 2025-12-05T22:05:47.945819392Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-master-2-debug-dzwf7" not found
[must-gather-689bp] POD 2025-12-05T22:05:47.951535878Z Copying core dumps on node anurag-gcp1a4-mvtmh-master-2
[must-gather-689bp] POD 2025-12-05T22:05:48.011711673Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-worker-c-zb7kw-debug-qlcmq" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.047361398Z Copying core dumps on node anurag-gcp1a4-mvtmh-worker-c-zb7kw
[must-gather-689bp] POD 2025-12-05T22:05:48.086458741Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-worker-b-tfwtk-debug-75jbp" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.089404394Z Copying core dumps on node anurag-gcp1a4-mvtmh-worker-b-tfwtk
[must-gather-689bp] POD 2025-12-05T22:05:48.195009139Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-master-1-debug-xx6c8" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.198119460Z Copying core dumps on node anurag-gcp1a4-mvtmh-master-1
[must-gather-689bp] POD 2025-12-05T22:05:48.567670702Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-worker-a-5hj4l-debug-jtlnw" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.600954876Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-master-0-debug-gc7zx" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.671432198Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-master-2-debug-dzwf7" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.718137621Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-worker-b-tfwtk-debug-75jbp" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.729841592Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-worker-c-zb7kw-debug-qlcmq" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.761514176Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-master-1-debug-xx6c8" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.763972377Z INFO: Node core dump collection to complete.
[must-gather-689bp] OUT 2025-12-05T22:05:52.906114764Z waiting for gather to complete
[must-gather-689bp] OUT 2025-12-05T22:05:52.97167166Z downloading gather output
[must-gather-689bp] OUT 2025-12-05T22:05:54.186234169Z receiving incremental file list
[must-gather-689bp] OUT 2025-12-05T22:05:54.304961876Z ./
[must-gather-689bp] OUT 2025-12-05T22:05:54.304995241Z node_core_dumps/
[must-gather-689bp] OUT 2025-12-05T22:05:54.365683865Z 
[must-gather-689bp] OUT 2025-12-05T22:05:54.365737064Z sent 31 bytes  received 84 bytes  76.67 bytes/sec
[must-gather-689bp] OUT 2025-12-05T22:05:54.365748216Z total size is 0  speedup is 0.00
[must-gather      ] OUT 2025-12-05T22:05:54.518452504Z namespace/openshift-must-gather-2vkq4 deleted


Reprinting Cluster State:
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: 1c57e459-65cb-4a20-aed6-8912204401b3
ClientVersion: 4.20.0-ec.6
ClusterVersion: Stable at "4.21.0-0.nightly-2025-11-22-193140"
ClusterOperators:
	All healthy and stable


anusaxen@anusaxen-thinkpadp1gen3:~/git/must-gather$ tree must-gather.local.4137063052829339390/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-698aa2e4879ec5cd5587bf1d5343931436659f7201f6f131d6d352966cdd5071/node_core_dumps/
must-gather.local.4137063052829339390/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-698aa2e4879ec5cd5587bf1d5343931436659f7201f6f131d6d352966cdd5071/node_core_dumps/

0 directories, 0 files      <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Ran test image on same cluster

anusaxen@anusaxen-thinkpadp1gen3:~/git/must-gather$ oc adm must-gather --image=quay.io/anusaxen/mg_test -- gather_core_dumps
[must-gather      ] OUT 2025-12-05T22:00:27.825455875Z Using must-gather plug-in image: quay.io/anusaxen/mg_test
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: 1c57e459-65cb-4a20-aed6-8912204401b3
ClientVersion: 4.20.0-ec.6
ClusterVersion: Stable at "4.21.0-0.nightly-2025-11-22-193140"
ClusterOperators:
	All healthy and stable


[must-gather      ] OUT 2025-12-05T22:00:28.211471554Z namespace/openshift-must-gather-428qq created
[must-gather      ] OUT 2025-12-05T22:00:28.282437293Z clusterrolebinding.rbac.authorization.k8s.io/must-gather-gx8d5 created
Warning: spec.nodeSelector[node-role.kubernetes.io/master]: use "node-role.kubernetes.io/control-plane" instead
[must-gather      ] OUT 2025-12-05T22:00:28.462286466Z pod for plug-in image quay.io/anusaxen/mg_test created
[must-gather-96w5w] POD 2025-12-05T22:00:35.421103446Z volume percentage checker started.....
[must-gather-96w5w] POD 2025-12-05T22:00:35.426285409Z WARNING: Collecting core dumps on ALL linux nodes in your cluster. This could take a long time.
[must-gather-96w5w] POD 2025-12-05T22:00:35.430453639Z volume usage percentage 0
[must-gather-96w5w] POD 2025-12-05T22:00:35.627603969Z INFO: Waiting for node core dump collection to complete ...
[must-gather-96w5w] POD 2025-12-05T22:00:40.445203536Z volume usage percentage 0
[must-gather-96w5w] POD 2025-12-05T22:00:43.856625685Z Copying core dumps on node anurag-gcp1a4-mvtmh-worker-a-5hj4l
[must-gather-96w5w] POD 2025-12-05T22:00:44.252720553Z pod "anurag-gcp1a4-mvtmh-worker-a-5hj4l-debug-4g4nc" deleted from default namespace
[must-gather-96w5w] POD 2025-12-05T22:00:44.253016466Z Copying core dumps on node anurag-gcp1a4-mvtmh-master-2
[must-gather-96w5w] POD 2025-12-05T22:00:44.683275254Z pod "anurag-gcp1a4-mvtmh-master-2-debug-7sjcp" deleted from default namespace
[must-gather-96w5w] POD 2025-12-05T22:00:44.821808049Z Copying core dumps on node anurag-gcp1a4-mvtmh-worker-c-zb7kw
[must-gather-96w5w] POD 2025-12-05T22:00:45.316557598Z pod "anurag-gcp1a4-mvtmh-worker-c-zb7kw-debug-njqn7" deleted from default namespace
[must-gather-96w5w] POD 2025-12-05T22:00:45.455118582Z volume usage percentage 0
[must-gather-96w5w] POD 2025-12-05T22:00:45.738901649Z Copying core dumps on node anurag-gcp1a4-mvtmh-master-1
[must-gather-96w5w] POD 2025-12-05T22:00:46.455348492Z Copying core dumps on node anurag-gcp1a4-mvtmh-worker-b-tfwtk
[must-gather-96w5w] POD 2025-12-05T22:00:46.517540176Z pod "anurag-gcp1a4-mvtmh-master-1-debug-7vcc5" deleted from default namespace
[must-gather-96w5w] POD 2025-12-05T22:00:46.845129180Z pod "anurag-gcp1a4-mvtmh-worker-b-tfwtk-debug-x55jh" deleted from default namespace
[must-gather-96w5w] POD 2025-12-05T22:00:47.403784945Z Copying core dumps on node anurag-gcp1a4-mvtmh-master-0
[must-gather-96w5w] POD 2025-12-05T22:00:47.740623299Z pod "anurag-gcp1a4-mvtmh-master-0-debug-h9pvt" deleted from default namespace
[must-gather-96w5w] POD 2025-12-05T22:00:49.781600356Z INFO: Node core dump collection to complete.
[must-gather-96w5w] POD 2025-12-05T22:00:50.462420052Z volume usage percentage 0
[must-gather-96w5w] OUT 2025-12-05T22:00:52.566126217Z waiting for gather to complete
[must-gather-96w5w] OUT 2025-12-05T22:00:52.63025287Z downloading gather output
[must-gather-96w5w] OUT 2025-12-05T22:00:55.153665712Z receiving incremental file list
[must-gather-96w5w] OUT 2025-12-05T22:00:55.339664307Z ./
[must-gather-96w5w] OUT 2025-12-05T22:00:55.339727486Z node_core_dumps/
[must-gather-96w5w] OUT 2025-12-05T22:00:55.339744636Z node_core_dumps/anurag-gcp1a4-mvtmh-master-0_core_dump/
[must-gather-96w5w] OUT 2025-12-05T22:00:55.339758243Z node_core_dumps/anurag-gcp1a4-mvtmh-master-1_core_dump/
[must-gather-96w5w] OUT 2025-12-05T22:00:55.339771313Z node_core_dumps/anurag-gcp1a4-mvtmh-master-2_core_dump/
[must-gather-96w5w] OUT 2025-12-05T22:00:55.339826053Z node_core_dumps/anurag-gcp1a4-mvtmh-worker-a-5hj4l_core_dump/
[must-gather-96w5w] OUT 2025-12-05T22:00:55.340055164Z node_core_dumps/anurag-gcp1a4-mvtmh-worker-a-5hj4l_core_dump/core.ovn-northd.0.f11f3120fc0c483ea86980a8b0c6fe0b.3294.1764972013000000.zst
[must-gather-96w5w] OUT 2025-12-05T22:00:55.842680287Z node_core_dumps/anurag-gcp1a4-mvtmh-worker-b-tfwtk_core_dump/
[must-gather-96w5w] OUT 2025-12-05T22:00:55.842702723Z node_core_dumps/anurag-gcp1a4-mvtmh-worker-c-zb7kw_core_dump/
[must-gather-96w5w] OUT 2025-12-05T22:00:55.968925919Z 
[must-gather-96w5w] OUT 2025-12-05T22:00:55.968954801Z sent 78 bytes  received 1,731,937 bytes  494,861.43 bytes/sec
[must-gather-96w5w] OUT 2025-12-05T22:00:55.968963964Z total size is 1,731,012  speedup is 1.00
[must-gather      ] OUT 2025-12-05T22:00:56.119849962Z namespace/openshift-must-gather-428qq deleted


Reprinting Cluster State:
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: 1c57e459-65cb-4a20-aed6-8912204401b3
ClientVersion: 4.20.0-ec.6
ClusterVersion: Stable at "4.21.0-0.nightly-2025-11-22-193140"
ClusterOperators:
	All healthy and stable

anusaxen@anusaxen-thinkpadp1gen3:~/git/must-gather$ tree must-gather.local.4794133176849791442/quay-io-anusaxen-mg-test-sha256-373d8976064216ab7f8209810fd9b314a2133505a696dbd3eb2d9220777919ff/node_core_dumps/
must-gather.local.4794133176849791442/quay-io-anusaxen-mg-test-sha256-373d8976064216ab7f8209810fd9b314a2133505a696dbd3eb2d9220777919ff/node_core_dumps/
├── anurag-gcp1a4-mvtmh-master-0_core_dump
├── anurag-gcp1a4-mvtmh-master-1_core_dump
├── anurag-gcp1a4-mvtmh-master-2_core_dump
├── anurag-gcp1a4-mvtmh-worker-a-5hj4l_core_dump
│   └── core.ovn-northd.0.f11f3120fc0c483ea86980a8b0c6fe0b.3294.1764972013000000.zst
├── anurag-gcp1a4-mvtmh-worker-b-tfwtk_core_dump
└── anurag-gcp1a4-mvtmh-worker-c-zb7kw_core_dump

7 directories, 1 file

openshift-ci · 2025-12-05T22:30:30Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: anuragthehatter
Once this PR has been reviewed and has the lgtm label, please assign sferich888 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

collection-scripts/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

anuragthehatter · 2025-12-05T22:32:09Z

@sferich888 @chrisdolphy can you help review?

anuragthehatter · 2025-12-05T22:32:58Z

/assign @sferich888 if i may. Thnaks

sferich888

I think if we merge this it is likely to break some of our 'threading' that backgrounds processes and tracks their completion.

sferich888 · 2025-12-08T18:34:47Z

collection-scripts/gather_core_dumps


-    #Mimic a normal oc call, i.e pause between two successive calls to allow pod to register
+    #Wait for the debug pod to be created and extract its name
    sleep 2


This shouldn't be a sleep but a retry loop checking if the pod was started; using an exponential backoff (for the check); with a limited number of attempts 3-5; 10 max.

Agree. I thought to keep it untouched from original code.

Moved to exponential backoff retry logic

collection-scripts/gather_core_dumps

anuragthehatter · 2025-12-09T20:42:04Z

cc @TrilokGeer PTAL as well

openshift-ci-robot · 2025-12-18T20:49:45Z

@anuragthehatter: An error was encountered adding this pull request to the external tracker bugs for bug OCPBUGS-66983 on the Jira server at https://issues.redhat.com/. No known errors were detected, please see the full error message for details.

Full error message.


failed to add remote link: failed to add link: No Link Issue Permission for issue 'OCPBUGS-66983'.: request failed. Please analyze the request body for more details. Status code: 403:

Please contact an administrator to resolve this issue, then request a bug refresh with /jira refresh.

Details

In response to this:

Current logic found failed on one of prow CI run where node indeed had a core dump present but dump was not collected by oc adm must-gather -- gather_core_dumps command underneath a prow chain. It seems we may have missed core dumps collection since long time due to this race issue.

Failed logs on prod builds here

Saw a race condition where dump was tried to be copied but debug pod was already removed.

Analysis:

Old approach (main branch):
debugPod=$(oc debug --to-namespace="default" node/"$1" -o jsonpath='{.metadata.name}')
oc debug --to-namespace="default" node/"$1" -- /bin/bash -c 'sleep 300' > /dev/null 2>&1 &
sleep 2
oc wait -n "default" --for=condition=Ready pod/"$debugPod" --timeout=30s

Tried to get pod name first (but pod doesn't exist yet - race condition)
Then started the debug pod
This was broken

New approach:

local tmpfile=$(mktemp)
oc debug --to-namespace="default" node/"$1" -- /bin/bash -c 'sleep 300' > "$tmpfile" 2>&1 &
local debug_pid=$!

sleep 2
debugPod=$(grep -oP "(?<=pod/)[^ ]*" "$tmpfile" 2>/dev/null | head -1)
rm -f "$tmpfile"

if [ -n "$debugPod" ]; then
  oc wait -n "default" --for=condition=Ready pod/"$debugPod" --timeout=30s > /dev/null 2>&1
fi

Capture output: Creates a temp file to capture oc debug output
Extract pod name: Uses grep -oP to parse the pod name from the debug command's output (e.g., "Starting pod/xyz-debug-abc...")
Store PID: Saves the background process ID with debug_pid=$!
Conditional wait: Only waits for pod readiness if a pod name was found
Cleanup: Removes the temp file after extracting the pod name

Point to note:

Some conformance tests are using FAIL_ON_CORE_DUMP: "true" under prow CI workflows and they seemed to be passing as dumps were never collected resulted in successful prow CI flows executions. The flow logic seems ok but seems like oc adm must-gather never collected dumps if there were any?

Reproduced error locally with help of a simulated core dump on an OCP cluster node
using default must-gather image

anusaxen@anusaxen-thinkpadp1gen3:~/git/must-gather$ oc adm must-gather -- gather_core_dumps
[must-gather      ] OUT 2025-12-05T22:05:42.184927525Z Using must-gather plug-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:698aa2e4879ec5cd5587bf1d5343931436659f7201f6f131d6d352966cdd5071
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: 1c57e459-65cb-4a20-aed6-8912204401b3
ClientVersion: 4.20.0-ec.6
ClusterVersion: Stable at "4.21.0-0.nightly-2025-11-22-193140"
ClusterOperators:
  All healthy and stable


[must-gather      ] OUT 2025-12-05T22:05:42.424328163Z namespace/openshift-must-gather-2vkq4 created
[must-gather      ] OUT 2025-12-05T22:05:42.493340186Z clusterrolebinding.rbac.authorization.k8s.io/must-gather-chbgr created
Warning: spec.nodeSelector[node-role.kubernetes.io/master]: use "node-role.kubernetes.io/control-plane" instead
[must-gather      ] OUT 2025-12-05T22:05:42.692109137Z pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:698aa2e4879ec5cd5587bf1d5343931436659f7201f6f131d6d352966cdd5071 created
[must-gather-689bp] POD 2025-12-05T22:05:44.685980005Z volume percentage checker started.....
[must-gather-689bp] POD 2025-12-05T22:05:44.691389054Z WARNING: Collecting core dumps on ALL linux nodes in your cluster. This could take a long time.
[must-gather-689bp] POD 2025-12-05T22:05:44.693073765Z volume usage percentage 0
[must-gather-689bp] POD 2025-12-05T22:05:45.290317650Z INFO: Waiting for node core dump collection to complete ...
[must-gather-689bp] POD 2025-12-05T22:05:47.874867514Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-master-0-debug-gc7zx" not found
[must-gather-689bp] POD 2025-12-05T22:05:47.878701942Z Copying core dumps on node anurag-gcp1a4-mvtmh-master-0
[must-gather-689bp] POD 2025-12-05T22:05:47.932576410Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-worker-a-5hj4l-debug-jtlnw" not found
[must-gather-689bp] POD 2025-12-05T22:05:47.935967326Z Copying core dumps on node anurag-gcp1a4-mvtmh-worker-a-5hj4l
[must-gather-689bp] POD 2025-12-05T22:05:47.945819392Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-master-2-debug-dzwf7" not found
[must-gather-689bp] POD 2025-12-05T22:05:47.951535878Z Copying core dumps on node anurag-gcp1a4-mvtmh-master-2
[must-gather-689bp] POD 2025-12-05T22:05:48.011711673Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-worker-c-zb7kw-debug-qlcmq" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.047361398Z Copying core dumps on node anurag-gcp1a4-mvtmh-worker-c-zb7kw
[must-gather-689bp] POD 2025-12-05T22:05:48.086458741Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-worker-b-tfwtk-debug-75jbp" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.089404394Z Copying core dumps on node anurag-gcp1a4-mvtmh-worker-b-tfwtk
[must-gather-689bp] POD 2025-12-05T22:05:48.195009139Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-master-1-debug-xx6c8" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.198119460Z Copying core dumps on node anurag-gcp1a4-mvtmh-master-1
[must-gather-689bp] POD 2025-12-05T22:05:48.567670702Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-worker-a-5hj4l-debug-jtlnw" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.600954876Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-master-0-debug-gc7zx" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.671432198Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-master-2-debug-dzwf7" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.718137621Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-worker-b-tfwtk-debug-75jbp" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.729841592Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-worker-c-zb7kw-debug-qlcmq" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.761514176Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-master-1-debug-xx6c8" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.763972377Z INFO: Node core dump collection to complete.
[must-gather-689bp] OUT 2025-12-05T22:05:52.906114764Z waiting for gather to complete
[must-gather-689bp] OUT 2025-12-05T22:05:52.97167166Z downloading gather output
[must-gather-689bp] OUT 2025-12-05T22:05:54.186234169Z receiving incremental file list
[must-gather-689bp] OUT 2025-12-05T22:05:54.304961876Z ./
[must-gather-689bp] OUT 2025-12-05T22:05:54.304995241Z node_core_dumps/
[must-gather-689bp] OUT 2025-12-05T22:05:54.365683865Z 
[must-gather-689bp] OUT 2025-12-05T22:05:54.365737064Z sent 31 bytes  received 84 bytes  76.67 bytes/sec
[must-gather-689bp] OUT 2025-12-05T22:05:54.365748216Z total size is 0  speedup is 0.00
[must-gather      ] OUT 2025-12-05T22:05:54.518452504Z namespace/openshift-must-gather-2vkq4 deleted


Reprinting Cluster State:
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: 1c57e459-65cb-4a20-aed6-8912204401b3
ClientVersion: 4.20.0-ec.6
ClusterVersion: Stable at "4.21.0-0.nightly-2025-11-22-193140"
ClusterOperators:
  All healthy and stable


anusaxen@anusaxen-thinkpadp1gen3:~/git/must-gather$ tree must-gather.local.4137063052829339390/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-698aa2e4879ec5cd5587bf1d5343931436659f7201f6f131d6d352966cdd5071/node_core_dumps/
must-gather.local.4137063052829339390/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-698aa2e4879ec5cd5587bf1d5343931436659f7201f6f131d6d352966cdd5071/node_core_dumps/

0 directories, 0 files      <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Ran test image on same cluster

anusaxen@anusaxen-thinkpadp1gen3:~/git/must-gather$ oc adm must-gather --image=quay.io/anusaxen/mg_test -- gather_core_dumps
[must-gather      ] OUT 2025-12-05T22:00:27.825455875Z Using must-gather plug-in image: quay.io/anusaxen/mg_test
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: 1c57e459-65cb-4a20-aed6-8912204401b3
ClientVersion: 4.20.0-ec.6
ClusterVersion: Stable at "4.21.0-0.nightly-2025-11-22-193140"
ClusterOperators:
  All healthy and stable


[must-gather      ] OUT 2025-12-05T22:00:28.211471554Z namespace/openshift-must-gather-428qq created
[must-gather      ] OUT 2025-12-05T22:00:28.282437293Z clusterrolebinding.rbac.authorization.k8s.io/must-gather-gx8d5 created
Warning: spec.nodeSelector[node-role.kubernetes.io/master]: use "node-role.kubernetes.io/control-plane" instead
[must-gather      ] OUT 2025-12-05T22:00:28.462286466Z pod for plug-in image quay.io/anusaxen/mg_test created
[must-gather-96w5w] POD 2025-12-05T22:00:35.421103446Z volume percentage checker started.....
[must-gather-96w5w] POD 2025-12-05T22:00:35.426285409Z WARNING: Collecting core dumps on ALL linux nodes in your cluster. This could take a long time.
[must-gather-96w5w] POD 2025-12-05T22:00:35.430453639Z volume usage percentage 0
[must-gather-96w5w] POD 2025-12-05T22:00:35.627603969Z INFO: Waiting for node core dump collection to complete ...
[must-gather-96w5w] POD 2025-12-05T22:00:40.445203536Z volume usage percentage 0
[must-gather-96w5w] POD 2025-12-05T22:00:43.856625685Z Copying core dumps on node anurag-gcp1a4-mvtmh-worker-a-5hj4l
[must-gather-96w5w] POD 2025-12-05T22:00:44.252720553Z pod "anurag-gcp1a4-mvtmh-worker-a-5hj4l-debug-4g4nc" deleted from default namespace
[must-gather-96w5w] POD 2025-12-05T22:00:44.253016466Z Copying core dumps on node anurag-gcp1a4-mvtmh-master-2
[must-gather-96w5w] POD 2025-12-05T22:00:44.683275254Z pod "anurag-gcp1a4-mvtmh-master-2-debug-7sjcp" deleted from default namespace
[must-gather-96w5w] POD 2025-12-05T22:00:44.821808049Z Copying core dumps on node anurag-gcp1a4-mvtmh-worker-c-zb7kw
[must-gather-96w5w] POD 2025-12-05T22:00:45.316557598Z pod "anurag-gcp1a4-mvtmh-worker-c-zb7kw-debug-njqn7" deleted from default namespace
[must-gather-96w5w] POD 2025-12-05T22:00:45.455118582Z volume usage percentage 0
[must-gather-96w5w] POD 2025-12-05T22:00:45.738901649Z Copying core dumps on node anurag-gcp1a4-mvtmh-master-1
[must-gather-96w5w] POD 2025-12-05T22:00:46.455348492Z Copying core dumps on node anurag-gcp1a4-mvtmh-worker-b-tfwtk
[must-gather-96w5w] POD 2025-12-05T22:00:46.517540176Z pod "anurag-gcp1a4-mvtmh-master-1-debug-7vcc5" deleted from default namespace
[must-gather-96w5w] POD 2025-12-05T22:00:46.845129180Z pod "anurag-gcp1a4-mvtmh-worker-b-tfwtk-debug-x55jh" deleted from default namespace
[must-gather-96w5w] POD 2025-12-05T22:00:47.403784945Z Copying core dumps on node anurag-gcp1a4-mvtmh-master-0
[must-gather-96w5w] POD 2025-12-05T22:00:47.740623299Z pod "anurag-gcp1a4-mvtmh-master-0-debug-h9pvt" deleted from default namespace
[must-gather-96w5w] POD 2025-12-05T22:00:49.781600356Z INFO: Node core dump collection to complete.
[must-gather-96w5w] POD 2025-12-05T22:00:50.462420052Z volume usage percentage 0
[must-gather-96w5w] OUT 2025-12-05T22:00:52.566126217Z waiting for gather to complete
[must-gather-96w5w] OUT 2025-12-05T22:00:52.63025287Z downloading gather output
[must-gather-96w5w] OUT 2025-12-05T22:00:55.153665712Z receiving incremental file list
[must-gather-96w5w] OUT 2025-12-05T22:00:55.339664307Z ./
[must-gather-96w5w] OUT 2025-12-05T22:00:55.339727486Z node_core_dumps/
[must-gather-96w5w] OUT 2025-12-05T22:00:55.339744636Z node_core_dumps/anurag-gcp1a4-mvtmh-master-0_core_dump/
[must-gather-96w5w] OUT 2025-12-05T22:00:55.339758243Z node_core_dumps/anurag-gcp1a4-mvtmh-master-1_core_dump/
[must-gather-96w5w] OUT 2025-12-05T22:00:55.339771313Z node_core_dumps/anurag-gcp1a4-mvtmh-master-2_core_dump/
[must-gather-96w5w] OUT 2025-12-05T22:00:55.339826053Z node_core_dumps/anurag-gcp1a4-mvtmh-worker-a-5hj4l_core_dump/
[must-gather-96w5w] OUT 2025-12-05T22:00:55.340055164Z node_core_dumps/anurag-gcp1a4-mvtmh-worker-a-5hj4l_core_dump/core.ovn-northd.0.f11f3120fc0c483ea86980a8b0c6fe0b.3294.1764972013000000.zst
[must-gather-96w5w] OUT 2025-12-05T22:00:55.842680287Z node_core_dumps/anurag-gcp1a4-mvtmh-worker-b-tfwtk_core_dump/
[must-gather-96w5w] OUT 2025-12-05T22:00:55.842702723Z node_core_dumps/anurag-gcp1a4-mvtmh-worker-c-zb7kw_core_dump/
[must-gather-96w5w] OUT 2025-12-05T22:00:55.968925919Z 
[must-gather-96w5w] OUT 2025-12-05T22:00:55.968954801Z sent 78 bytes  received 1,731,937 bytes  494,861.43 bytes/sec
[must-gather-96w5w] OUT 2025-12-05T22:00:55.968963964Z total size is 1,731,012  speedup is 1.00
[must-gather      ] OUT 2025-12-05T22:00:56.119849962Z namespace/openshift-must-gather-428qq deleted


Reprinting Cluster State:
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: 1c57e459-65cb-4a20-aed6-8912204401b3
ClientVersion: 4.20.0-ec.6
ClusterVersion: Stable at "4.21.0-0.nightly-2025-11-22-193140"
ClusterOperators:
  All healthy and stable

anusaxen@anusaxen-thinkpadp1gen3:~/git/must-gather$ tree must-gather.local.4794133176849791442/quay-io-anusaxen-mg-test-sha256-373d8976064216ab7f8209810fd9b314a2133505a696dbd3eb2d9220777919ff/node_core_dumps/
must-gather.local.4794133176849791442/quay-io-anusaxen-mg-test-sha256-373d8976064216ab7f8209810fd9b314a2133505a696dbd3eb2d9220777919ff/node_core_dumps/
├── anurag-gcp1a4-mvtmh-master-0_core_dump
├── anurag-gcp1a4-mvtmh-master-1_core_dump
├── anurag-gcp1a4-mvtmh-master-2_core_dump
├── anurag-gcp1a4-mvtmh-worker-a-5hj4l_core_dump
│   └── core.ovn-northd.0.f11f3120fc0c483ea86980a8b0c6fe0b.3294.1764972013000000.zst
├── anurag-gcp1a4-mvtmh-worker-b-tfwtk_core_dump
└── anurag-gcp1a4-mvtmh-worker-c-zb7kw_core_dump

7 directories, 1 file

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

anuragthehatter · 2025-12-19T19:29:13Z

/jira refresh

openshift-ci-robot · 2025-12-19T19:29:19Z

@anuragthehatter: This pull request references Jira Issue OCPBUGS-66983, which is invalid:

expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

TrilokGeer · 2025-12-20T14:37:40Z

collection-scripts/gather_core_dumps

+    #Start Debug pod in background and capture output to get pod name
+    local tmpfile=$(mktemp)
+    oc debug --to-namespace="default" node/"$1" -- /bin/bash -c 'sleep 300' > "$tmpfile" 2>&1 &
+    local debug_pid=$!


Unused variable

TrilokGeer · 2025-12-20T14:43:12Z

collection-scripts/gather_core_dumps

+    local max_delay=2.0   # Cap the maximum delay
+
+    while [ -z "$debugPod" ] && [ $attempt -lt $max_attempts ]; do
+        debugPod=$(grep -oP "(?<=pod/)[^ ]*" "$tmpfile" 2>/dev/null | head -1)


POSIX grep does not gurantee -P option and also, it is affects portability for other variants. For example, the script fails to run BSD/Linux (mac os) for local development. Would it be possible to update to accomodate the posix-compliant options?

Addressed it

TrilokGeer · 2025-12-20T14:48:13Z

collection-scripts/gather_core_dumps

      #Copy Core Dumps out of Nodes suppress Stdout
      echo "Copying core dumps on node ""$1"""
-      oc cp  --loglevel 1 -n "default" "$debugPod":/host/var/lib/systemd/coredump "${CORE_DUMP_PATH}"/"$1"_core_dump > /dev/null 2>&1 && PIDS+=($!)
+      oc cp  --loglevel 1 -n "default" "$debugPod":/host/var/lib/systemd/coredump "${CORE_DUMP_PATH}"/"$1"_core_dump > /dev/null 2>&1


What happens if the oc fails to copy the core dump?

Added error handling for oc cp failures

Now checks exit status and prints warning if copy fails

Provides visibility into failures instead of silently ignoring them

collection-scripts/gather_core_dumps

TrilokGeer · 2025-12-20T15:34:27Z

Thanks for the PR @anuragthehatter, dropped in some reviews, hope it helps. Borrowing more help from @praveencodes @shivprakashmuley @swghosh to get attention on the PR on priority.

anuragthehatter · 2026-01-06T03:35:31Z

Addressed above comments and re-ran script on an 18 node cluster

anurag@fedora:~$ ./test.sh 
WARNING: Collecting core dumps on ALL linux nodes in your cluster. This could take a long time.
INFO: Waiting for node core dump collection to complete ...
Copying core dumps on node testmg1-tglrx-worker-a-g6whz
pod "testmg1-tglrx-worker-a-g6whz-debug-lkd5v" deleted
Copying core dumps on node testmg1-tglrx-worker-a-k554f
Copying core dumps on node testmg1-tglrx-worker-a-f98h2
Copying core dumps on node testmg1-tglrx-worker-a-tl7lj
Copying core dumps on node testmg1-tglrx-worker-a-wtsgc
Copying core dumps on node testmg1-tglrx-worker-c-nvj64
pod "testmg1-tglrx-worker-a-k554f-debug-vkjvp" deleted
Copying core dumps on node testmg1-tglrx-worker-c-b9kdr
Copying core dumps on node testmg1-tglrx-worker-b-6qlk7
Copying core dumps on node testmg1-tglrx-worker-c-6bj8r
Copying core dumps on node testmg1-tglrx-worker-b-ltsvq
Copying core dumps on node testmg1-tglrx-worker-b-fhzlr
Copying core dumps on node testmg1-tglrx-worker-c-fw8mb
Copying core dumps on node testmg1-tglrx-worker-b-rb84h
pod "testmg1-tglrx-worker-a-f98h2-debug-c7phc" deleted
pod "testmg1-tglrx-worker-a-tl7lj-debug-5nldz" deleted
pod "testmg1-tglrx-worker-a-wtsgc-debug-b6zrh" deleted
Copying core dumps on node testmg1-tglrx-worker-b-fk57v
Copying core dumps on node testmg1-tglrx-worker-c-mcts9
Copying core dumps on node testmg1-tglrx-master-0.us-central1-a.c.openshift-qe.internal
pod "testmg1-tglrx-worker-c-nvj64-debug-knvvq" deleted
pod "testmg1-tglrx-worker-c-b9kdr-debug-gktvw" deleted
pod "testmg1-tglrx-worker-b-6qlk7-debug-ggmfd" deleted
Copying core dumps on node testmg1-tglrx-master-2.us-central1-c.c.openshift-qe.internal
pod "testmg1-tglrx-worker-c-6bj8r-debug-bnxwz" deleted
pod "testmg1-tglrx-worker-c-fw8mb-debug-dtjgq" deleted
pod "testmg1-tglrx-worker-b-ltsvq-debug-crrxq" deleted
pod "testmg1-tglrx-worker-b-fhzlr-debug-52q2v" deleted
pod "testmg1-tglrx-worker-b-rb84h-debug-qh4d6" deleted
Copying core dumps on node testmg1-tglrx-master-1.us-central1-b.c.openshift-qe.internal
pod "testmg1-tglrx-worker-c-mcts9-debug-44g7n" deleted
pod "testmg1-tglrx-master-0us-central1-acopenshift-qeinternal-debug-4rfgd" deleted
pod "testmg1-tglrx-master-2us-central1-ccopenshift-qeinternal-debug-9f2pz" deleted
pod "testmg1-tglrx-worker-b-fk57v-debug-4snhd" deleted
pod "testmg1-tglrx-master-1us-central1-bcopenshift-qeinternal-debug-xp8zz" deleted
INFO: Node core dump collection to complete.
anurag@fedora:~$ tree must-gather/
must-gather/
└── node_core_dumps
    ├── testmg1-tglrx-master-0.us-central1-a.c.openshift-qe.internal_core_dump
    ├── testmg1-tglrx-master-1.us-central1-b.c.openshift-qe.internal_core_dump
    ├── testmg1-tglrx-master-2.us-central1-c.c.openshift-qe.internal_core_dump
    ├── testmg1-tglrx-worker-a-f98h2_core_dump
    ├── testmg1-tglrx-worker-a-g6whz_core_dump
    │   └── core.ovn-northd.0.90a06998f5e846579d89b563f4948faa.3218.1767669799000000.zst     <<<<<<<<<<<<<<<<
    ├── testmg1-tglrx-worker-a-k554f_core_dump
    ├── testmg1-tglrx-worker-a-tl7lj_core_dump
    ├── testmg1-tglrx-worker-a-wtsgc_core_dump
    ├── testmg1-tglrx-worker-b-6qlk7_core_dump
    ├── testmg1-tglrx-worker-b-fhzlr_core_dump
    ├── testmg1-tglrx-worker-b-fk57v_core_dump
    ├── testmg1-tglrx-worker-b-ltsvq_core_dump
    ├── testmg1-tglrx-worker-b-rb84h_core_dump
    ├── testmg1-tglrx-worker-c-6bj8r_core_dump
    ├── testmg1-tglrx-worker-c-b9kdr_core_dump
    ├── testmg1-tglrx-worker-c-fw8mb_core_dump
    ├── testmg1-tglrx-worker-c-mcts9_core_dump
    └── testmg1-tglrx-worker-c-nvj64_core_dump

20 directories, 1 file

@TrilokGeer PTAL again .Thanks

openshift-ci · 2026-01-06T06:26:11Z

@anuragthehatter: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/okd-scos-images	`8d1bdfa`	link	true	`/test okd-scos-images`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Fix race condition in gather_core_dumps pod name retrieval

c9e8847

openshift-ci bot requested review from ingvagabund and sferich888 December 5, 2025 22:30

sferich888 suggested changes Dec 8, 2025

View reviewed changes

added logic retry loop for pod readiness using an exponential backoff

71f1bd8

anuragthehatter changed the title ~~Fix race condition in gather_core_dumps pod name retrieval~~ OCPBUGS-66983: Fix race condition in gather_core_dumps pod name retrieval Dec 18, 2025

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Dec 19, 2025

TrilokGeer reviewed Dec 20, 2025

View reviewed changes

anuragthehatter mentioned this pull request Dec 23, 2025

CORENET-6356: Bump OVN to 25.09 and 25.09 for OKD openshift/ovn-kubernetes#2909

Merged

5 tasks

Addressed comments

8d1bdfa

OCPBUGS-66983: Fix race condition in gather_core_dumps pod name retrieval #517

Are you sure you want to change the base?

OCPBUGS-66983: Fix race condition in gather_core_dumps pod name retrieval #517

Uh oh!

Conversation

anuragthehatter commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Dec 5, 2025

Uh oh!

anuragthehatter commented Dec 5, 2025

Uh oh!

anuragthehatter commented Dec 5, 2025

Uh oh!

sferich888 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

anuragthehatter commented Dec 9, 2025

Uh oh!

openshift-ci-robot commented Dec 18, 2025

Uh oh!

anuragthehatter commented Dec 19, 2025

Uh oh!

openshift-ci-robot commented Dec 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

TrilokGeer commented Dec 20, 2025

Uh oh!

anuragthehatter commented Jan 6, 2026

Uh oh!

openshift-ci bot commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

anuragthehatter commented Dec 5, 2025 •

edited

Loading