Skip to content

Conversation

@agrawalkhushi18
Copy link
Contributor

@agrawalkhushi18 agrawalkhushi18 commented Jan 16, 2026

This PR addresses transient "Aperture devices not found" errors on A3-Mega nodes.
The hardware that manages these devices boots at the same time as the compute node. If the node finishes booting first, it looks for the devices before they are ready, leading to a "not found" error.

A move to udev rules was made to automate mounting, but the system's "readiness check" was too rigid—it would fail instantly if the device was even a second late.

These changes improves the reliability of Aperture device initialization:

  • Implements Polling in ExecCondition: Replaces the static directory check with a 60-second polling loop (30×2s) to wait for device readiness.

@agrawalkhushi18 agrawalkhushi18 added the release-improvements Added to release notes under the "Improvements" heading. label Jan 16, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @agrawalkhushi18, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the reliability of Aperture device initialization on A3-Mega nodes by addressing transient "device not found" errors. It ensures that critical hardware components are fully operational before being accessed, preventing premature system failures and enhancing overall system stability.

Highlights

  • NVIDIA Fabric Manager Activation: Explicitly enables the nvidia-fabricmanager service within the Slurm blueprint to ensure proper GPU fabric functionality on A3-Mega nodes.
  • Robust Device Readiness Polling: Replaces the immediate ExecCondition check for /dev/aperture_devices/ with a 60-second polling loop (30 checks every 2 seconds) to gracefully wait for device readiness during node boot-up, preventing premature failures.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@agrawalkhushi18 agrawalkhushi18 changed the title Fix transient Aperture device initialization with Fabric Manager and polling in A3M slurm yaml Fix transient aperture device error in A3M slurm yaml Jan 16, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves the reliability of Aperture device initialization on A3-Mega nodes by enabling the nvidia-fabricmanager service and implementing a polling mechanism to wait for the devices to be ready. The changes are logical and directly address the transient errors described. I've suggested one improvement for maintainability: refactoring the complex inline shell script used for polling into a separate, more readable script file within the Ansible playbook. This also addresses potential parsing issues with command substitution in YAML multiline strings, making the logic easier to understand and manage in the future.

@agrawalkhushi18 agrawalkhushi18 marked this pull request as ready for review January 16, 2026 13:55
@agrawalkhushi18 agrawalkhushi18 requested review from a team and samskillman as code owners January 16, 2026 13:55
@agrawalkhushi18 agrawalkhushi18 added release-chore To not include into release notes release-bugfix Added to release notes under the "Bug fixes" heading. and removed release-improvements Added to release notes under the "Improvements" heading. labels Jan 23, 2026
@agrawalkhushi18 agrawalkhushi18 merged commit 22a079a into GoogleCloudPlatform:develop Jan 23, 2026
20 of 80 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-bugfix Added to release notes under the "Bug fixes" heading. release-chore To not include into release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants