Skip to content

Unable to ping or mount FSx ONTAP file systems occasionally at startup. #7016

@gwolski

Description

@gwolski

APC 3.11.1 and APC 3.13.2 exhibit this problem. I don't think this is an APC problem. but posting here in case it is recognized. I have also raised a ticket with AWS support.

Problem statement: Some number of my machines that get started in a day fail to be able to reach FSx ONTAP filers and I'm unable to NFS mount the file systems. I can't even ping the filers, so of course NFS will fail.

Background:
I'm using AWS parallelcluster (APC) 3.11.1 and 3.13.2 to start machines. I start/terminate hundreds of machines per day, both spot and on-demand. I've built a customized AMI on rocky 8.10 upon which I overlay the parallecluster config with pcluster build-image. I am using the Rocky 8.9 marketplace image with updates in my 3.11.1 cluster and the community edition 8.10 in my 3.13.2 instances. This issue has been ongoing since I started using parallelcluster in Nov 2024. It has just taken me a while to track it down, so I think it has existed since day one.

I'm in us-west-2a and I see my availability zone id is usw2-az2. I have a custom vpc.

Details: Machine starts up just fine. DNS works, networking works, can update packages on the machine, can access s3 files, I can mount nfs4 exported disks from the APC head_node. The problem is that some small number of machines cannot ping nor mount filesystems from my FSx ONTAP filers, nor can they access key NFS ports on FSx file servers, e.g. port 2049. Sometimes they can ping the fileserver after 10,20,30 seconds, other times they timeout after 60s on this initial test in my machine startup code (running out of prolog script). Once the initial ping test fails, a further mount attempt may or may not fail. I have even captured a machine failing, I logged into it, enabled termination protection and tried to ping the filers. I could not. After some amount of time while I was poking around, I found I could ping the filer and mount the disks!

I've enabled a ton of debugging of my machine booting. Routes look good, vpc is correct, security groups are correct, everything looks fine. Everything else is working, just not access to the FSx filers.

It is multiple filers, not just a single filer that these machines cannot ping at times. I have a "failing" case where I was able to mount from one filer, but then an attempt to mount from a second filer failed.

I have put a mount entry in /etc/fstab as well as use the automounter, it doesn't matter, some mounts will still fail. I wait until well after cloud-init is done before attempting any communication to the filers. That said, network connectivity is well established as I've grabbed start up files from S3, I've updated packages, I've pinged other machines, I've started slurmd and it has communicated with the headNode. The key issue is that these "failing" machines cannot even ping the filers when they fail.

I have turned on vpc-flow tracing in Cloudwatch yet I can't see anything odd, i.e. no rejections, but I have never really used this - I just have been playing with Amazon Q for help.

I have gone through the CloudWatch logs (as far as I understand things), I see nothing odd.

I see the security groups are properly applied at the instance and ENI level. I enable all outbound traffic.

I have increased the FSx ONTAP filer bandwidth temporarily to 512MB/s to see if that would fix things. It did not.

I start hundreds of machines a day, most of them work just fine. It's just some number on the order of 10-50 that fail. On Sep 16, 47 machines have failed to reach the filers and the instances were killed. Sometimes they fail in batches, i.e. 5-10 machines fail at once. Sometimes just random one or two. I never really start up more than 50 at one time, so I don't think we're overwhelming the filers. And even then, I have put "jitter" code in my start up so I don't attack the filers all at once, i.e. I wait from 0-30 seconds based on the IP address of the instances before I attempt a ping or NFS connection.

It is as if there is some connectivity issue from these failing machines to the FSx ONTAP filers.

All my 100s of other machines that start up in a day work just fine.

Thanks for reading this far, hope someone has some insights.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions