Skip to content

chef recipes are forcing lockd to port 32768 which leads to occasional NFS mount failures. #6949

@gwolski

Description

@gwolski

APC 3.11.2 and 3.13.2
I've been struggling with lockd failures when I try to do an NFS mount at boot time of my parallelcluster machines. One in a few hundred machines fails to NFS v3 mount disks....

I get the following error message in my /var/log/messages files:

Aug 25 10:31:49 ip-10-6-12-13 kernel: lockd_up: makesock failed, error=-98
Aug 25 10:31:50 ip-10-6-12-13 92_tsi_mount_disks_withdiags.sh[4215]: mount.nfs: mount(2): Address already in use
Aug 25 10:31:50 ip-10-6-12-13 92_tsi_mount_disks_withdiags.sh[4215]: mount.nfs: Address already in use

Some chef recipe that I cannot find on my AMI is forcing the lockd to use port 32768.
You can see that in the file:

cat /etc/nfs.conf
# Generated by Chef for [hostname redacted] # Local modifications will be overwritten.
#
# This is a general configuration for the
# NFS daemons and tools
[general]
pipefs-directory=/var/lib/nfs/rpc_pipefs

[gssd]
use-gss-proxy=1

[lockd]
port=32768
udp-port=32768

It is also set in the /etc/modprobe.d/options_lockd.conf and /etc/syctl.d/ in two files
99-chef-fs.nfs.nlm_tcpport.conf:fs.nfs.nlm_tcpport = 32768
99-chef-fs.nfs.nlm_udpport.conf:fs.nfs.nlm_udpport = 32768

Redhat has an article on this problem. https://access.redhat.com/solutions/2200331- it is mentioned this is for RHEL 7, so it is old, but it was updated in 2024 and smells exactly like my problem.

I have not been able to figure out what is actually locking port 32768, so I have just modified the boot up of my machine to change everything to dynamic, which is the default RHEL/Rocky experience:

# make sure lockd uses a dynamic port to avoid address collisions
[[ ! -e /etc/modprobe.d/options_lockd.conf.fcs ]] && cp /etc/modprobe.d/options_lockd.conf /etc/modprobe.d/options_lockd.conf.fcs
sudo sed -i '/^\[lockd\]/,/^$/s/^port=/#port=/' /etc/nfs.conf
sudo sed -i '/^\[lockd\]/,/^$/s/^udp-port=/#udp-port=/' /etc/nfs.conf
sudo sed -i -e 's/nlm_tcpport=[0-9]*/nlm_tcpport=0/g' -e 's/nlm_udpport=[0-9]*/nlm_udpport=0/g' /etc/modprobe.d/options_lockd.conf
# Apply sysctl changes dynamically
sudo sysctl fs.nfs.nlm_tcpport=0
sudo sysctl fs.nfs.nlm_udpport=0
modprobe -v -r lockd
modprobe -v lockd

I would ask that you remove this static assignment.

I am running with this new setup and will report back with an update if this is working.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions