-
Notifications
You must be signed in to change notification settings - Fork 216
OCPBUGS-66983: Fix race condition in gather_core_dumps pod name retrieval #517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: anuragthehatter The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@sferich888 @chrisdolphy can you help review? |
|
/assign @sferich888 if i may. Thnaks |
sferich888
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we merge this it is likely to break some of our 'threading' that backgrounds processes and tracks their completion.
collection-scripts/gather_core_dumps
Outdated
|
|
||
| #Mimic a normal oc call, i.e pause between two successive calls to allow pod to register | ||
| #Wait for the debug pod to be created and extract its name | ||
| sleep 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be a sleep but a retry loop checking if the pod was started; using an exponential backoff (for the check); with a limited number of attempts 3-5; 10 max.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. I thought to keep it untouched from original code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to exponential backoff retry logic
|
cc @TrilokGeer PTAL as well |
|
@anuragthehatter: An error was encountered adding this pull request to the external tracker bugs for bug OCPBUGS-66983 on the Jira server at https://issues.redhat.com/. No known errors were detected, please see the full error message for details. Full error message.
failed to add remote link: failed to add link: No Link Issue Permission for issue 'OCPBUGS-66983'.: request failed. Please analyze the request body for more details. Status code: 403:
Please contact an administrator to resolve this issue, then request a bug refresh with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@anuragthehatter: This pull request references Jira Issue OCPBUGS-66983, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
collection-scripts/gather_core_dumps
Outdated
| #Start Debug pod in background and capture output to get pod name | ||
| local tmpfile=$(mktemp) | ||
| oc debug --to-namespace="default" node/"$1" -- /bin/bash -c 'sleep 300' > "$tmpfile" 2>&1 & | ||
| local debug_pid=$! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused variable
collection-scripts/gather_core_dumps
Outdated
| local max_delay=2.0 # Cap the maximum delay | ||
|
|
||
| while [ -z "$debugPod" ] && [ $attempt -lt $max_attempts ]; do | ||
| debugPod=$(grep -oP "(?<=pod/)[^ ]*" "$tmpfile" 2>/dev/null | head -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
POSIX grep does not gurantee -P option and also, it is affects portability for other variants. For example, the script fails to run BSD/Linux (mac os) for local development. Would it be possible to update to accomodate the posix-compliant options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed it
collection-scripts/gather_core_dumps
Outdated
| #Copy Core Dumps out of Nodes suppress Stdout | ||
| echo "Copying core dumps on node ""$1""" | ||
| oc cp --loglevel 1 -n "default" "$debugPod":/host/var/lib/systemd/coredump "${CORE_DUMP_PATH}"/"$1"_core_dump > /dev/null 2>&1 && PIDS+=($!) | ||
| oc cp --loglevel 1 -n "default" "$debugPod":/host/var/lib/systemd/coredump "${CORE_DUMP_PATH}"/"$1"_core_dump > /dev/null 2>&1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the oc fails to copy the core dump?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added error handling for oc cp failures
- Now checks exit status and prints warning if copy fails
- Provides visibility into failures instead of silently ignoring them
|
Thanks for the PR @anuragthehatter, dropped in some reviews, hope it helps. Borrowing more help from @praveencodes @shivprakashmuley @swghosh to get attention on the PR on priority. |
|
Addressed above comments and re-ran script on an 18 node cluster @TrilokGeer PTAL again .Thanks |
|
@anuragthehatter: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Current logic found failed on one of prow CI run where node indeed had a core dump present but dump was not collected by
oc adm must-gather -- gather_core_dumpscommand underneath a prow chain. It seems we may have missed core dumps collection since long time due to this race issue.Failed logs on prod builds here
Saw a race condition where dump was tried to be copied but debug pod was already removed.
Analysis:
Old approach (main branch):
debugPod=$(oc debug --to-namespace="default" node/"$1" -o jsonpath='{.metadata.name}')
oc debug --to-namespace="default" node/"$1" -- /bin/bash -c 'sleep 300' > /dev/null 2>&1 &
sleep 2
oc wait -n "default" --for=condition=Ready pod/"$debugPod" --timeout=30s
New approach:
Point to note:
FAIL_ON_CORE_DUMP: "true"under prow CI workflows and they seemed to be passing as dumps were never collected resulted in successful prow CI flows executions. The flow logic seems ok but seems like oc adm must-gather never collected dumps if there were any?Reproduced error locally with help of a simulated core dump on an OCP cluster node
using default must-gather image
Ran test image on same cluster