Skip to content

SRE-3755 ci: Add NFS mount retry to handle transient server readiness#18244

Merged
ryon-jensen merged 1 commit into
release/2.8from
ryon-jensen/2.8/SRE-3775_mount_nfs
Jun 12, 2026
Merged

SRE-3755 ci: Add NFS mount retry to handle transient server readiness#18244
ryon-jensen merged 1 commit into
release/2.8from
ryon-jensen/2.8/SRE-3775_mount_nfs

Conversation

@ryon-jensen

@ryon-jensen ryon-jensen commented May 13, 2026

Copy link
Copy Markdown
Contributor

The NFS mount in test_main_prep_node.sh can fail with "access denied" when the NFS server on FIRST_NODE hasn't fully registered its exports before client nodes attempt to mount. This is a race between setup_nfs.sh completing on the server and clush launching test_main_prep_node.sh on all nodes simultaneously.

Add a retry loop (3 attempts, 5s apart) around the mount call, and on final failure print showmount/getent diagnostics to aid debugging. Also tighten the /proc/mounts grep to avoid false substring matches.

backport of … #18118

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions

Copy link
Copy Markdown

Errors are Unable to load ticket data
https://daosio.atlassian.net/browse/SRE-3755

…#18118)

The NFS mount in test_main_prep_node.sh can fail with "access denied"
when the NFS server on FIRST_NODE hasn't fully registered its exports
before client nodes attempt to mount. This is a race between
setup_nfs.sh completing on the server and clush launching
test_main_prep_node.sh on all nodes simultaneously.

Add a retry loop (3 attempts, 5s apart) around the mount call, and
on final failure print showmount/getent diagnostics to aid debugging.
Also tighten the /proc/mounts grep to avoid false substring matches.

Signed-off-by: Ryon Jensen <ryon.jensen@hpe.com>
@ryon-jensen ryon-jensen force-pushed the ryon-jensen/2.8/SRE-3775_mount_nfs branch from 17cc3ff to 039669c Compare June 4, 2026 00:11
@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18244/2/testReport/

@JohnMalmberg JohnMalmberg left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will probably want to use a global "retry" function in the future like we do for post provisioning, but we can optimize later...

@ryon-jensen ryon-jensen requested a review from a team June 8, 2026 21:35
@daltonbohning daltonbohning added approved-to-merge PR has received release branch merge approval forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. labels Jun 8, 2026
@ryon-jensen

Copy link
Copy Markdown
Contributor Author

ping @daos-stack/daos-gatekeeper - can we land this one?

@ryon-jensen ryon-jensen merged commit 4529b62 into release/2.8 Jun 12, 2026
63 of 65 checks passed
@ryon-jensen ryon-jensen deleted the ryon-jensen/2.8/SRE-3775_mount_nfs branch June 12, 2026 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved-to-merge PR has received release branch merge approval forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed.

Development

Successfully merging this pull request may close these issues.

5 participants