Skip to content

Conversation

@anuragthehatter
Copy link

@anuragthehatter anuragthehatter commented Dec 5, 2025

Current logic found failed on one of prow CI run where node indeed had a core dump present but dump was not collected by oc adm must-gather -- gather_core_dumps command underneath a prow chain. It seems we may have missed core dumps collection since long time due to this race issue.

Failed logs on prod builds here

Saw a race condition where dump was tried to be copied but debug pod was already removed.

Analysis:

Old approach (main branch):
debugPod=$(oc debug --to-namespace="default" node/"$1" -o jsonpath='{.metadata.name}')
oc debug --to-namespace="default" node/"$1" -- /bin/bash -c 'sleep 300' > /dev/null 2>&1 &
sleep 2
oc wait -n "default" --for=condition=Ready pod/"$debugPod" --timeout=30s

  • Tried to get pod name first (but pod doesn't exist yet - race condition)
  • Then started the debug pod
  • This was broken

New approach:

local tmpfile=$(mktemp)
oc debug --to-namespace="default" node/"$1" -- /bin/bash -c 'sleep 300' > "$tmpfile" 2>&1 &
local debug_pid=$!

sleep 2
debugPod=$(grep -oP "(?<=pod/)[^ ]*" "$tmpfile" 2>/dev/null | head -1)
rm -f "$tmpfile"

if [ -n "$debugPod" ]; then
  oc wait -n "default" --for=condition=Ready pod/"$debugPod" --timeout=30s > /dev/null 2>&1
fi
  1. Capture output: Creates a temp file to capture oc debug output
  2. Extract pod name: Uses grep -oP to parse the pod name from the debug command's output (e.g., "Starting pod/xyz-debug-abc...")
  3. Store PID: Saves the background process ID with debug_pid=$!
  4. Conditional wait: Only waits for pod readiness if a pod name was found
  5. Cleanup: Removes the temp file after extracting the pod name

Point to note:

  • Some conformance tests are using FAIL_ON_CORE_DUMP: "true" under prow CI workflows and they seemed to be passing as dumps were never collected resulted in successful prow CI flows executions. The flow logic seems ok but seems like oc adm must-gather never collected dumps if there were any?

Reproduced error locally with help of a simulated core dump on an OCP cluster node
using default must-gather image

anusaxen@anusaxen-thinkpadp1gen3:~/git/must-gather$ oc adm must-gather -- gather_core_dumps
[must-gather      ] OUT 2025-12-05T22:05:42.184927525Z Using must-gather plug-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:698aa2e4879ec5cd5587bf1d5343931436659f7201f6f131d6d352966cdd5071
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: 1c57e459-65cb-4a20-aed6-8912204401b3
ClientVersion: 4.20.0-ec.6
ClusterVersion: Stable at "4.21.0-0.nightly-2025-11-22-193140"
ClusterOperators:
	All healthy and stable


[must-gather      ] OUT 2025-12-05T22:05:42.424328163Z namespace/openshift-must-gather-2vkq4 created
[must-gather      ] OUT 2025-12-05T22:05:42.493340186Z clusterrolebinding.rbac.authorization.k8s.io/must-gather-chbgr created
Warning: spec.nodeSelector[node-role.kubernetes.io/master]: use "node-role.kubernetes.io/control-plane" instead
[must-gather      ] OUT 2025-12-05T22:05:42.692109137Z pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:698aa2e4879ec5cd5587bf1d5343931436659f7201f6f131d6d352966cdd5071 created
[must-gather-689bp] POD 2025-12-05T22:05:44.685980005Z volume percentage checker started.....
[must-gather-689bp] POD 2025-12-05T22:05:44.691389054Z WARNING: Collecting core dumps on ALL linux nodes in your cluster. This could take a long time.
[must-gather-689bp] POD 2025-12-05T22:05:44.693073765Z volume usage percentage 0
[must-gather-689bp] POD 2025-12-05T22:05:45.290317650Z INFO: Waiting for node core dump collection to complete ...
[must-gather-689bp] POD 2025-12-05T22:05:47.874867514Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-master-0-debug-gc7zx" not found
[must-gather-689bp] POD 2025-12-05T22:05:47.878701942Z Copying core dumps on node anurag-gcp1a4-mvtmh-master-0
[must-gather-689bp] POD 2025-12-05T22:05:47.932576410Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-worker-a-5hj4l-debug-jtlnw" not found
[must-gather-689bp] POD 2025-12-05T22:05:47.935967326Z Copying core dumps on node anurag-gcp1a4-mvtmh-worker-a-5hj4l
[must-gather-689bp] POD 2025-12-05T22:05:47.945819392Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-master-2-debug-dzwf7" not found
[must-gather-689bp] POD 2025-12-05T22:05:47.951535878Z Copying core dumps on node anurag-gcp1a4-mvtmh-master-2
[must-gather-689bp] POD 2025-12-05T22:05:48.011711673Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-worker-c-zb7kw-debug-qlcmq" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.047361398Z Copying core dumps on node anurag-gcp1a4-mvtmh-worker-c-zb7kw
[must-gather-689bp] POD 2025-12-05T22:05:48.086458741Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-worker-b-tfwtk-debug-75jbp" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.089404394Z Copying core dumps on node anurag-gcp1a4-mvtmh-worker-b-tfwtk
[must-gather-689bp] POD 2025-12-05T22:05:48.195009139Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-master-1-debug-xx6c8" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.198119460Z Copying core dumps on node anurag-gcp1a4-mvtmh-master-1
[must-gather-689bp] POD 2025-12-05T22:05:48.567670702Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-worker-a-5hj4l-debug-jtlnw" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.600954876Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-master-0-debug-gc7zx" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.671432198Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-master-2-debug-dzwf7" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.718137621Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-worker-b-tfwtk-debug-75jbp" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.729841592Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-worker-c-zb7kw-debug-qlcmq" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.761514176Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-master-1-debug-xx6c8" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.763972377Z INFO: Node core dump collection to complete.
[must-gather-689bp] OUT 2025-12-05T22:05:52.906114764Z waiting for gather to complete
[must-gather-689bp] OUT 2025-12-05T22:05:52.97167166Z downloading gather output
[must-gather-689bp] OUT 2025-12-05T22:05:54.186234169Z receiving incremental file list
[must-gather-689bp] OUT 2025-12-05T22:05:54.304961876Z ./
[must-gather-689bp] OUT 2025-12-05T22:05:54.304995241Z node_core_dumps/
[must-gather-689bp] OUT 2025-12-05T22:05:54.365683865Z 
[must-gather-689bp] OUT 2025-12-05T22:05:54.365737064Z sent 31 bytes  received 84 bytes  76.67 bytes/sec
[must-gather-689bp] OUT 2025-12-05T22:05:54.365748216Z total size is 0  speedup is 0.00
[must-gather      ] OUT 2025-12-05T22:05:54.518452504Z namespace/openshift-must-gather-2vkq4 deleted


Reprinting Cluster State:
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: 1c57e459-65cb-4a20-aed6-8912204401b3
ClientVersion: 4.20.0-ec.6
ClusterVersion: Stable at "4.21.0-0.nightly-2025-11-22-193140"
ClusterOperators:
	All healthy and stable


anusaxen@anusaxen-thinkpadp1gen3:~/git/must-gather$ tree must-gather.local.4137063052829339390/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-698aa2e4879ec5cd5587bf1d5343931436659f7201f6f131d6d352966cdd5071/node_core_dumps/
must-gather.local.4137063052829339390/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-698aa2e4879ec5cd5587bf1d5343931436659f7201f6f131d6d352966cdd5071/node_core_dumps/

0 directories, 0 files      <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Ran test image on same cluster

anusaxen@anusaxen-thinkpadp1gen3:~/git/must-gather$ oc adm must-gather --image=quay.io/anusaxen/mg_test -- gather_core_dumps
[must-gather      ] OUT 2025-12-05T22:00:27.825455875Z Using must-gather plug-in image: quay.io/anusaxen/mg_test
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: 1c57e459-65cb-4a20-aed6-8912204401b3
ClientVersion: 4.20.0-ec.6
ClusterVersion: Stable at "4.21.0-0.nightly-2025-11-22-193140"
ClusterOperators:
	All healthy and stable


[must-gather      ] OUT 2025-12-05T22:00:28.211471554Z namespace/openshift-must-gather-428qq created
[must-gather      ] OUT 2025-12-05T22:00:28.282437293Z clusterrolebinding.rbac.authorization.k8s.io/must-gather-gx8d5 created
Warning: spec.nodeSelector[node-role.kubernetes.io/master]: use "node-role.kubernetes.io/control-plane" instead
[must-gather      ] OUT 2025-12-05T22:00:28.462286466Z pod for plug-in image quay.io/anusaxen/mg_test created
[must-gather-96w5w] POD 2025-12-05T22:00:35.421103446Z volume percentage checker started.....
[must-gather-96w5w] POD 2025-12-05T22:00:35.426285409Z WARNING: Collecting core dumps on ALL linux nodes in your cluster. This could take a long time.
[must-gather-96w5w] POD 2025-12-05T22:00:35.430453639Z volume usage percentage 0
[must-gather-96w5w] POD 2025-12-05T22:00:35.627603969Z INFO: Waiting for node core dump collection to complete ...
[must-gather-96w5w] POD 2025-12-05T22:00:40.445203536Z volume usage percentage 0
[must-gather-96w5w] POD 2025-12-05T22:00:43.856625685Z Copying core dumps on node anurag-gcp1a4-mvtmh-worker-a-5hj4l
[must-gather-96w5w] POD 2025-12-05T22:00:44.252720553Z pod "anurag-gcp1a4-mvtmh-worker-a-5hj4l-debug-4g4nc" deleted from default namespace
[must-gather-96w5w] POD 2025-12-05T22:00:44.253016466Z Copying core dumps on node anurag-gcp1a4-mvtmh-master-2
[must-gather-96w5w] POD 2025-12-05T22:00:44.683275254Z pod "anurag-gcp1a4-mvtmh-master-2-debug-7sjcp" deleted from default namespace
[must-gather-96w5w] POD 2025-12-05T22:00:44.821808049Z Copying core dumps on node anurag-gcp1a4-mvtmh-worker-c-zb7kw
[must-gather-96w5w] POD 2025-12-05T22:00:45.316557598Z pod "anurag-gcp1a4-mvtmh-worker-c-zb7kw-debug-njqn7" deleted from default namespace
[must-gather-96w5w] POD 2025-12-05T22:00:45.455118582Z volume usage percentage 0
[must-gather-96w5w] POD 2025-12-05T22:00:45.738901649Z Copying core dumps on node anurag-gcp1a4-mvtmh-master-1
[must-gather-96w5w] POD 2025-12-05T22:00:46.455348492Z Copying core dumps on node anurag-gcp1a4-mvtmh-worker-b-tfwtk
[must-gather-96w5w] POD 2025-12-05T22:00:46.517540176Z pod "anurag-gcp1a4-mvtmh-master-1-debug-7vcc5" deleted from default namespace
[must-gather-96w5w] POD 2025-12-05T22:00:46.845129180Z pod "anurag-gcp1a4-mvtmh-worker-b-tfwtk-debug-x55jh" deleted from default namespace
[must-gather-96w5w] POD 2025-12-05T22:00:47.403784945Z Copying core dumps on node anurag-gcp1a4-mvtmh-master-0
[must-gather-96w5w] POD 2025-12-05T22:00:47.740623299Z pod "anurag-gcp1a4-mvtmh-master-0-debug-h9pvt" deleted from default namespace
[must-gather-96w5w] POD 2025-12-05T22:00:49.781600356Z INFO: Node core dump collection to complete.
[must-gather-96w5w] POD 2025-12-05T22:00:50.462420052Z volume usage percentage 0
[must-gather-96w5w] OUT 2025-12-05T22:00:52.566126217Z waiting for gather to complete
[must-gather-96w5w] OUT 2025-12-05T22:00:52.63025287Z downloading gather output
[must-gather-96w5w] OUT 2025-12-05T22:00:55.153665712Z receiving incremental file list
[must-gather-96w5w] OUT 2025-12-05T22:00:55.339664307Z ./
[must-gather-96w5w] OUT 2025-12-05T22:00:55.339727486Z node_core_dumps/
[must-gather-96w5w] OUT 2025-12-05T22:00:55.339744636Z node_core_dumps/anurag-gcp1a4-mvtmh-master-0_core_dump/
[must-gather-96w5w] OUT 2025-12-05T22:00:55.339758243Z node_core_dumps/anurag-gcp1a4-mvtmh-master-1_core_dump/
[must-gather-96w5w] OUT 2025-12-05T22:00:55.339771313Z node_core_dumps/anurag-gcp1a4-mvtmh-master-2_core_dump/
[must-gather-96w5w] OUT 2025-12-05T22:00:55.339826053Z node_core_dumps/anurag-gcp1a4-mvtmh-worker-a-5hj4l_core_dump/
[must-gather-96w5w] OUT 2025-12-05T22:00:55.340055164Z node_core_dumps/anurag-gcp1a4-mvtmh-worker-a-5hj4l_core_dump/core.ovn-northd.0.f11f3120fc0c483ea86980a8b0c6fe0b.3294.1764972013000000.zst
[must-gather-96w5w] OUT 2025-12-05T22:00:55.842680287Z node_core_dumps/anurag-gcp1a4-mvtmh-worker-b-tfwtk_core_dump/
[must-gather-96w5w] OUT 2025-12-05T22:00:55.842702723Z node_core_dumps/anurag-gcp1a4-mvtmh-worker-c-zb7kw_core_dump/
[must-gather-96w5w] OUT 2025-12-05T22:00:55.968925919Z 
[must-gather-96w5w] OUT 2025-12-05T22:00:55.968954801Z sent 78 bytes  received 1,731,937 bytes  494,861.43 bytes/sec
[must-gather-96w5w] OUT 2025-12-05T22:00:55.968963964Z total size is 1,731,012  speedup is 1.00
[must-gather      ] OUT 2025-12-05T22:00:56.119849962Z namespace/openshift-must-gather-428qq deleted


Reprinting Cluster State:
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: 1c57e459-65cb-4a20-aed6-8912204401b3
ClientVersion: 4.20.0-ec.6
ClusterVersion: Stable at "4.21.0-0.nightly-2025-11-22-193140"
ClusterOperators:
	All healthy and stable

anusaxen@anusaxen-thinkpadp1gen3:~/git/must-gather$ tree must-gather.local.4794133176849791442/quay-io-anusaxen-mg-test-sha256-373d8976064216ab7f8209810fd9b314a2133505a696dbd3eb2d9220777919ff/node_core_dumps/
must-gather.local.4794133176849791442/quay-io-anusaxen-mg-test-sha256-373d8976064216ab7f8209810fd9b314a2133505a696dbd3eb2d9220777919ff/node_core_dumps/
├── anurag-gcp1a4-mvtmh-master-0_core_dump
├── anurag-gcp1a4-mvtmh-master-1_core_dump
├── anurag-gcp1a4-mvtmh-master-2_core_dump
├── anurag-gcp1a4-mvtmh-worker-a-5hj4l_core_dump
│   └── core.ovn-northd.0.f11f3120fc0c483ea86980a8b0c6fe0b.3294.1764972013000000.zst
├── anurag-gcp1a4-mvtmh-worker-b-tfwtk_core_dump
└── anurag-gcp1a4-mvtmh-worker-c-zb7kw_core_dump

7 directories, 1 file

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 5, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: anuragthehatter
Once this PR has been reviewed and has the lgtm label, please assign sferich888 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@anuragthehatter
Copy link
Author

@sferich888 @chrisdolphy can you help review?

@anuragthehatter
Copy link
Author

/assign @sferich888 if i may. Thnaks

Copy link
Contributor

@sferich888 sferich888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we merge this it is likely to break some of our 'threading' that backgrounds processes and tracks their completion.


#Mimic a normal oc call, i.e pause between two successive calls to allow pod to register
#Wait for the debug pod to be created and extract its name
sleep 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be a sleep but a retry loop checking if the pod was started; using an exponential backoff (for the check); with a limited number of attempts 3-5; 10 max.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I thought to keep it untouched from original code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to exponential backoff retry logic

@anuragthehatter
Copy link
Author

cc @TrilokGeer PTAL as well

@anuragthehatter anuragthehatter changed the title Fix race condition in gather_core_dumps pod name retrieval OCPBUGS-66983: Fix race condition in gather_core_dumps pod name retrieval Dec 18, 2025
@openshift-ci-robot
Copy link

@anuragthehatter: An error was encountered adding this pull request to the external tracker bugs for bug OCPBUGS-66983 on the Jira server at https://issues.redhat.com/. No known errors were detected, please see the full error message for details.

Full error message. failed to add remote link: failed to add link: No Link Issue Permission for issue 'OCPBUGS-66983'.: request failed. Please analyze the request body for more details. Status code: 403:

Please contact an administrator to resolve this issue, then request a bug refresh with /jira refresh.

Details

In response to this:

Current logic found failed on one of prow CI run where node indeed had a core dump present but dump was not collected by oc adm must-gather -- gather_core_dumps command underneath a prow chain. It seems we may have missed core dumps collection since long time due to this race issue.

Failed logs on prod builds here

Saw a race condition where dump was tried to be copied but debug pod was already removed.

Analysis:

Old approach (main branch):
debugPod=$(oc debug --to-namespace="default" node/"$1" -o jsonpath='{.metadata.name}')
oc debug --to-namespace="default" node/"$1" -- /bin/bash -c 'sleep 300' > /dev/null 2>&1 &
sleep 2
oc wait -n "default" --for=condition=Ready pod/"$debugPod" --timeout=30s

  • Tried to get pod name first (but pod doesn't exist yet - race condition)
  • Then started the debug pod
  • This was broken

New approach:

local tmpfile=$(mktemp)
oc debug --to-namespace="default" node/"$1" -- /bin/bash -c 'sleep 300' > "$tmpfile" 2>&1 &
local debug_pid=$!

sleep 2
debugPod=$(grep -oP "(?<=pod/)[^ ]*" "$tmpfile" 2>/dev/null | head -1)
rm -f "$tmpfile"

if [ -n "$debugPod" ]; then
  oc wait -n "default" --for=condition=Ready pod/"$debugPod" --timeout=30s > /dev/null 2>&1
fi
  1. Capture output: Creates a temp file to capture oc debug output
  2. Extract pod name: Uses grep -oP to parse the pod name from the debug command's output (e.g., "Starting pod/xyz-debug-abc...")
  3. Store PID: Saves the background process ID with debug_pid=$!
  4. Conditional wait: Only waits for pod readiness if a pod name was found
  5. Cleanup: Removes the temp file after extracting the pod name

Point to note:

  • Some conformance tests are using FAIL_ON_CORE_DUMP: "true" under prow CI workflows and they seemed to be passing as dumps were never collected resulted in successful prow CI flows executions. The flow logic seems ok but seems like oc adm must-gather never collected dumps if there were any?

Reproduced error locally with help of a simulated core dump on an OCP cluster node
using default must-gather image

anusaxen@anusaxen-thinkpadp1gen3:~/git/must-gather$ oc adm must-gather -- gather_core_dumps
[must-gather      ] OUT 2025-12-05T22:05:42.184927525Z Using must-gather plug-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:698aa2e4879ec5cd5587bf1d5343931436659f7201f6f131d6d352966cdd5071
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: 1c57e459-65cb-4a20-aed6-8912204401b3
ClientVersion: 4.20.0-ec.6
ClusterVersion: Stable at "4.21.0-0.nightly-2025-11-22-193140"
ClusterOperators:
  All healthy and stable


[must-gather      ] OUT 2025-12-05T22:05:42.424328163Z namespace/openshift-must-gather-2vkq4 created
[must-gather      ] OUT 2025-12-05T22:05:42.493340186Z clusterrolebinding.rbac.authorization.k8s.io/must-gather-chbgr created
Warning: spec.nodeSelector[node-role.kubernetes.io/master]: use "node-role.kubernetes.io/control-plane" instead
[must-gather      ] OUT 2025-12-05T22:05:42.692109137Z pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:698aa2e4879ec5cd5587bf1d5343931436659f7201f6f131d6d352966cdd5071 created
[must-gather-689bp] POD 2025-12-05T22:05:44.685980005Z volume percentage checker started.....
[must-gather-689bp] POD 2025-12-05T22:05:44.691389054Z WARNING: Collecting core dumps on ALL linux nodes in your cluster. This could take a long time.
[must-gather-689bp] POD 2025-12-05T22:05:44.693073765Z volume usage percentage 0
[must-gather-689bp] POD 2025-12-05T22:05:45.290317650Z INFO: Waiting for node core dump collection to complete ...
[must-gather-689bp] POD 2025-12-05T22:05:47.874867514Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-master-0-debug-gc7zx" not found
[must-gather-689bp] POD 2025-12-05T22:05:47.878701942Z Copying core dumps on node anurag-gcp1a4-mvtmh-master-0
[must-gather-689bp] POD 2025-12-05T22:05:47.932576410Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-worker-a-5hj4l-debug-jtlnw" not found
[must-gather-689bp] POD 2025-12-05T22:05:47.935967326Z Copying core dumps on node anurag-gcp1a4-mvtmh-worker-a-5hj4l
[must-gather-689bp] POD 2025-12-05T22:05:47.945819392Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-master-2-debug-dzwf7" not found
[must-gather-689bp] POD 2025-12-05T22:05:47.951535878Z Copying core dumps on node anurag-gcp1a4-mvtmh-master-2
[must-gather-689bp] POD 2025-12-05T22:05:48.011711673Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-worker-c-zb7kw-debug-qlcmq" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.047361398Z Copying core dumps on node anurag-gcp1a4-mvtmh-worker-c-zb7kw
[must-gather-689bp] POD 2025-12-05T22:05:48.086458741Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-worker-b-tfwtk-debug-75jbp" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.089404394Z Copying core dumps on node anurag-gcp1a4-mvtmh-worker-b-tfwtk
[must-gather-689bp] POD 2025-12-05T22:05:48.195009139Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-master-1-debug-xx6c8" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.198119460Z Copying core dumps on node anurag-gcp1a4-mvtmh-master-1
[must-gather-689bp] POD 2025-12-05T22:05:48.567670702Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-worker-a-5hj4l-debug-jtlnw" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.600954876Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-master-0-debug-gc7zx" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.671432198Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-master-2-debug-dzwf7" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.718137621Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-worker-b-tfwtk-debug-75jbp" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.729841592Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-worker-c-zb7kw-debug-qlcmq" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.761514176Z Error from server (NotFound): pods "anurag-gcp1a4-mvtmh-master-1-debug-xx6c8" not found
[must-gather-689bp] POD 2025-12-05T22:05:48.763972377Z INFO: Node core dump collection to complete.
[must-gather-689bp] OUT 2025-12-05T22:05:52.906114764Z waiting for gather to complete
[must-gather-689bp] OUT 2025-12-05T22:05:52.97167166Z downloading gather output
[must-gather-689bp] OUT 2025-12-05T22:05:54.186234169Z receiving incremental file list
[must-gather-689bp] OUT 2025-12-05T22:05:54.304961876Z ./
[must-gather-689bp] OUT 2025-12-05T22:05:54.304995241Z node_core_dumps/
[must-gather-689bp] OUT 2025-12-05T22:05:54.365683865Z 
[must-gather-689bp] OUT 2025-12-05T22:05:54.365737064Z sent 31 bytes  received 84 bytes  76.67 bytes/sec
[must-gather-689bp] OUT 2025-12-05T22:05:54.365748216Z total size is 0  speedup is 0.00
[must-gather      ] OUT 2025-12-05T22:05:54.518452504Z namespace/openshift-must-gather-2vkq4 deleted


Reprinting Cluster State:
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: 1c57e459-65cb-4a20-aed6-8912204401b3
ClientVersion: 4.20.0-ec.6
ClusterVersion: Stable at "4.21.0-0.nightly-2025-11-22-193140"
ClusterOperators:
  All healthy and stable


anusaxen@anusaxen-thinkpadp1gen3:~/git/must-gather$ tree must-gather.local.4137063052829339390/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-698aa2e4879ec5cd5587bf1d5343931436659f7201f6f131d6d352966cdd5071/node_core_dumps/
must-gather.local.4137063052829339390/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-698aa2e4879ec5cd5587bf1d5343931436659f7201f6f131d6d352966cdd5071/node_core_dumps/

0 directories, 0 files      <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Ran test image on same cluster

anusaxen@anusaxen-thinkpadp1gen3:~/git/must-gather$ oc adm must-gather --image=quay.io/anusaxen/mg_test -- gather_core_dumps
[must-gather      ] OUT 2025-12-05T22:00:27.825455875Z Using must-gather plug-in image: quay.io/anusaxen/mg_test
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: 1c57e459-65cb-4a20-aed6-8912204401b3
ClientVersion: 4.20.0-ec.6
ClusterVersion: Stable at "4.21.0-0.nightly-2025-11-22-193140"
ClusterOperators:
  All healthy and stable


[must-gather      ] OUT 2025-12-05T22:00:28.211471554Z namespace/openshift-must-gather-428qq created
[must-gather      ] OUT 2025-12-05T22:00:28.282437293Z clusterrolebinding.rbac.authorization.k8s.io/must-gather-gx8d5 created
Warning: spec.nodeSelector[node-role.kubernetes.io/master]: use "node-role.kubernetes.io/control-plane" instead
[must-gather      ] OUT 2025-12-05T22:00:28.462286466Z pod for plug-in image quay.io/anusaxen/mg_test created
[must-gather-96w5w] POD 2025-12-05T22:00:35.421103446Z volume percentage checker started.....
[must-gather-96w5w] POD 2025-12-05T22:00:35.426285409Z WARNING: Collecting core dumps on ALL linux nodes in your cluster. This could take a long time.
[must-gather-96w5w] POD 2025-12-05T22:00:35.430453639Z volume usage percentage 0
[must-gather-96w5w] POD 2025-12-05T22:00:35.627603969Z INFO: Waiting for node core dump collection to complete ...
[must-gather-96w5w] POD 2025-12-05T22:00:40.445203536Z volume usage percentage 0
[must-gather-96w5w] POD 2025-12-05T22:00:43.856625685Z Copying core dumps on node anurag-gcp1a4-mvtmh-worker-a-5hj4l
[must-gather-96w5w] POD 2025-12-05T22:00:44.252720553Z pod "anurag-gcp1a4-mvtmh-worker-a-5hj4l-debug-4g4nc" deleted from default namespace
[must-gather-96w5w] POD 2025-12-05T22:00:44.253016466Z Copying core dumps on node anurag-gcp1a4-mvtmh-master-2
[must-gather-96w5w] POD 2025-12-05T22:00:44.683275254Z pod "anurag-gcp1a4-mvtmh-master-2-debug-7sjcp" deleted from default namespace
[must-gather-96w5w] POD 2025-12-05T22:00:44.821808049Z Copying core dumps on node anurag-gcp1a4-mvtmh-worker-c-zb7kw
[must-gather-96w5w] POD 2025-12-05T22:00:45.316557598Z pod "anurag-gcp1a4-mvtmh-worker-c-zb7kw-debug-njqn7" deleted from default namespace
[must-gather-96w5w] POD 2025-12-05T22:00:45.455118582Z volume usage percentage 0
[must-gather-96w5w] POD 2025-12-05T22:00:45.738901649Z Copying core dumps on node anurag-gcp1a4-mvtmh-master-1
[must-gather-96w5w] POD 2025-12-05T22:00:46.455348492Z Copying core dumps on node anurag-gcp1a4-mvtmh-worker-b-tfwtk
[must-gather-96w5w] POD 2025-12-05T22:00:46.517540176Z pod "anurag-gcp1a4-mvtmh-master-1-debug-7vcc5" deleted from default namespace
[must-gather-96w5w] POD 2025-12-05T22:00:46.845129180Z pod "anurag-gcp1a4-mvtmh-worker-b-tfwtk-debug-x55jh" deleted from default namespace
[must-gather-96w5w] POD 2025-12-05T22:00:47.403784945Z Copying core dumps on node anurag-gcp1a4-mvtmh-master-0
[must-gather-96w5w] POD 2025-12-05T22:00:47.740623299Z pod "anurag-gcp1a4-mvtmh-master-0-debug-h9pvt" deleted from default namespace
[must-gather-96w5w] POD 2025-12-05T22:00:49.781600356Z INFO: Node core dump collection to complete.
[must-gather-96w5w] POD 2025-12-05T22:00:50.462420052Z volume usage percentage 0
[must-gather-96w5w] OUT 2025-12-05T22:00:52.566126217Z waiting for gather to complete
[must-gather-96w5w] OUT 2025-12-05T22:00:52.63025287Z downloading gather output
[must-gather-96w5w] OUT 2025-12-05T22:00:55.153665712Z receiving incremental file list
[must-gather-96w5w] OUT 2025-12-05T22:00:55.339664307Z ./
[must-gather-96w5w] OUT 2025-12-05T22:00:55.339727486Z node_core_dumps/
[must-gather-96w5w] OUT 2025-12-05T22:00:55.339744636Z node_core_dumps/anurag-gcp1a4-mvtmh-master-0_core_dump/
[must-gather-96w5w] OUT 2025-12-05T22:00:55.339758243Z node_core_dumps/anurag-gcp1a4-mvtmh-master-1_core_dump/
[must-gather-96w5w] OUT 2025-12-05T22:00:55.339771313Z node_core_dumps/anurag-gcp1a4-mvtmh-master-2_core_dump/
[must-gather-96w5w] OUT 2025-12-05T22:00:55.339826053Z node_core_dumps/anurag-gcp1a4-mvtmh-worker-a-5hj4l_core_dump/
[must-gather-96w5w] OUT 2025-12-05T22:00:55.340055164Z node_core_dumps/anurag-gcp1a4-mvtmh-worker-a-5hj4l_core_dump/core.ovn-northd.0.f11f3120fc0c483ea86980a8b0c6fe0b.3294.1764972013000000.zst
[must-gather-96w5w] OUT 2025-12-05T22:00:55.842680287Z node_core_dumps/anurag-gcp1a4-mvtmh-worker-b-tfwtk_core_dump/
[must-gather-96w5w] OUT 2025-12-05T22:00:55.842702723Z node_core_dumps/anurag-gcp1a4-mvtmh-worker-c-zb7kw_core_dump/
[must-gather-96w5w] OUT 2025-12-05T22:00:55.968925919Z 
[must-gather-96w5w] OUT 2025-12-05T22:00:55.968954801Z sent 78 bytes  received 1,731,937 bytes  494,861.43 bytes/sec
[must-gather-96w5w] OUT 2025-12-05T22:00:55.968963964Z total size is 1,731,012  speedup is 1.00
[must-gather      ] OUT 2025-12-05T22:00:56.119849962Z namespace/openshift-must-gather-428qq deleted


Reprinting Cluster State:
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: 1c57e459-65cb-4a20-aed6-8912204401b3
ClientVersion: 4.20.0-ec.6
ClusterVersion: Stable at "4.21.0-0.nightly-2025-11-22-193140"
ClusterOperators:
  All healthy and stable

anusaxen@anusaxen-thinkpadp1gen3:~/git/must-gather$ tree must-gather.local.4794133176849791442/quay-io-anusaxen-mg-test-sha256-373d8976064216ab7f8209810fd9b314a2133505a696dbd3eb2d9220777919ff/node_core_dumps/
must-gather.local.4794133176849791442/quay-io-anusaxen-mg-test-sha256-373d8976064216ab7f8209810fd9b314a2133505a696dbd3eb2d9220777919ff/node_core_dumps/
├── anurag-gcp1a4-mvtmh-master-0_core_dump
├── anurag-gcp1a4-mvtmh-master-1_core_dump
├── anurag-gcp1a4-mvtmh-master-2_core_dump
├── anurag-gcp1a4-mvtmh-worker-a-5hj4l_core_dump
│   └── core.ovn-northd.0.f11f3120fc0c483ea86980a8b0c6fe0b.3294.1764972013000000.zst
├── anurag-gcp1a4-mvtmh-worker-b-tfwtk_core_dump
└── anurag-gcp1a4-mvtmh-worker-c-zb7kw_core_dump

7 directories, 1 file

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@anuragthehatter
Copy link
Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Dec 19, 2025
@openshift-ci-robot
Copy link

@anuragthehatter: This pull request references Jira Issue OCPBUGS-66983, which is invalid:

  • expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

#Start Debug pod in background and capture output to get pod name
local tmpfile=$(mktemp)
oc debug --to-namespace="default" node/"$1" -- /bin/bash -c 'sleep 300' > "$tmpfile" 2>&1 &
local debug_pid=$!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused variable

local max_delay=2.0 # Cap the maximum delay

while [ -z "$debugPod" ] && [ $attempt -lt $max_attempts ]; do
debugPod=$(grep -oP "(?<=pod/)[^ ]*" "$tmpfile" 2>/dev/null | head -1)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

POSIX grep does not gurantee -P option and also, it is affects portability for other variants. For example, the script fails to run BSD/Linux (mac os) for local development. Would it be possible to update to accomodate the posix-compliant options?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed it

#Copy Core Dumps out of Nodes suppress Stdout
echo "Copying core dumps on node ""$1"""
oc cp --loglevel 1 -n "default" "$debugPod":/host/var/lib/systemd/coredump "${CORE_DUMP_PATH}"/"$1"_core_dump > /dev/null 2>&1 && PIDS+=($!)
oc cp --loglevel 1 -n "default" "$debugPod":/host/var/lib/systemd/coredump "${CORE_DUMP_PATH}"/"$1"_core_dump > /dev/null 2>&1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the oc fails to copy the core dump?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added error handling for oc cp failures

  • Now checks exit status and prints warning if copy fails
  • Provides visibility into failures instead of silently ignoring them

@TrilokGeer
Copy link

Thanks for the PR @anuragthehatter, dropped in some reviews, hope it helps. Borrowing more help from @praveencodes @shivprakashmuley @swghosh to get attention on the PR on priority.

@anuragthehatter
Copy link
Author

Addressed above comments and re-ran script on an 18 node cluster

anurag@fedora:~$ ./test.sh 
WARNING: Collecting core dumps on ALL linux nodes in your cluster. This could take a long time.
INFO: Waiting for node core dump collection to complete ...
Copying core dumps on node testmg1-tglrx-worker-a-g6whz
pod "testmg1-tglrx-worker-a-g6whz-debug-lkd5v" deleted
Copying core dumps on node testmg1-tglrx-worker-a-k554f
Copying core dumps on node testmg1-tglrx-worker-a-f98h2
Copying core dumps on node testmg1-tglrx-worker-a-tl7lj
Copying core dumps on node testmg1-tglrx-worker-a-wtsgc
Copying core dumps on node testmg1-tglrx-worker-c-nvj64
pod "testmg1-tglrx-worker-a-k554f-debug-vkjvp" deleted
Copying core dumps on node testmg1-tglrx-worker-c-b9kdr
Copying core dumps on node testmg1-tglrx-worker-b-6qlk7
Copying core dumps on node testmg1-tglrx-worker-c-6bj8r
Copying core dumps on node testmg1-tglrx-worker-b-ltsvq
Copying core dumps on node testmg1-tglrx-worker-b-fhzlr
Copying core dumps on node testmg1-tglrx-worker-c-fw8mb
Copying core dumps on node testmg1-tglrx-worker-b-rb84h
pod "testmg1-tglrx-worker-a-f98h2-debug-c7phc" deleted
pod "testmg1-tglrx-worker-a-tl7lj-debug-5nldz" deleted
pod "testmg1-tglrx-worker-a-wtsgc-debug-b6zrh" deleted
Copying core dumps on node testmg1-tglrx-worker-b-fk57v
Copying core dumps on node testmg1-tglrx-worker-c-mcts9
Copying core dumps on node testmg1-tglrx-master-0.us-central1-a.c.openshift-qe.internal
pod "testmg1-tglrx-worker-c-nvj64-debug-knvvq" deleted
pod "testmg1-tglrx-worker-c-b9kdr-debug-gktvw" deleted
pod "testmg1-tglrx-worker-b-6qlk7-debug-ggmfd" deleted
Copying core dumps on node testmg1-tglrx-master-2.us-central1-c.c.openshift-qe.internal
pod "testmg1-tglrx-worker-c-6bj8r-debug-bnxwz" deleted
pod "testmg1-tglrx-worker-c-fw8mb-debug-dtjgq" deleted
pod "testmg1-tglrx-worker-b-ltsvq-debug-crrxq" deleted
pod "testmg1-tglrx-worker-b-fhzlr-debug-52q2v" deleted
pod "testmg1-tglrx-worker-b-rb84h-debug-qh4d6" deleted
Copying core dumps on node testmg1-tglrx-master-1.us-central1-b.c.openshift-qe.internal
pod "testmg1-tglrx-worker-c-mcts9-debug-44g7n" deleted
pod "testmg1-tglrx-master-0us-central1-acopenshift-qeinternal-debug-4rfgd" deleted
pod "testmg1-tglrx-master-2us-central1-ccopenshift-qeinternal-debug-9f2pz" deleted
pod "testmg1-tglrx-worker-b-fk57v-debug-4snhd" deleted
pod "testmg1-tglrx-master-1us-central1-bcopenshift-qeinternal-debug-xp8zz" deleted
INFO: Node core dump collection to complete.
anurag@fedora:~$ tree must-gather/
must-gather/
└── node_core_dumps
    ├── testmg1-tglrx-master-0.us-central1-a.c.openshift-qe.internal_core_dump
    ├── testmg1-tglrx-master-1.us-central1-b.c.openshift-qe.internal_core_dump
    ├── testmg1-tglrx-master-2.us-central1-c.c.openshift-qe.internal_core_dump
    ├── testmg1-tglrx-worker-a-f98h2_core_dump
    ├── testmg1-tglrx-worker-a-g6whz_core_dump
    │   └── core.ovn-northd.0.90a06998f5e846579d89b563f4948faa.3218.1767669799000000.zst     <<<<<<<<<<<<<<<<
    ├── testmg1-tglrx-worker-a-k554f_core_dump
    ├── testmg1-tglrx-worker-a-tl7lj_core_dump
    ├── testmg1-tglrx-worker-a-wtsgc_core_dump
    ├── testmg1-tglrx-worker-b-6qlk7_core_dump
    ├── testmg1-tglrx-worker-b-fhzlr_core_dump
    ├── testmg1-tglrx-worker-b-fk57v_core_dump
    ├── testmg1-tglrx-worker-b-ltsvq_core_dump
    ├── testmg1-tglrx-worker-b-rb84h_core_dump
    ├── testmg1-tglrx-worker-c-6bj8r_core_dump
    ├── testmg1-tglrx-worker-c-b9kdr_core_dump
    ├── testmg1-tglrx-worker-c-fw8mb_core_dump
    ├── testmg1-tglrx-worker-c-mcts9_core_dump
    └── testmg1-tglrx-worker-c-nvj64_core_dump

20 directories, 1 file

@TrilokGeer PTAL again .Thanks

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 6, 2026

@anuragthehatter: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-images 8d1bdfa link true /test okd-scos-images

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants