Skip to content

Conversation

@sarthakag
Copy link
Contributor

@sarthakag sarthakag commented Dec 16, 2025

Adds R&R API integration for Slurm

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @sarthakag, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates Google Cloud's Repair and Recovery (R&R) functionality directly into the Slurm scheduler. It introduces automated detection of faulty Slurm nodes, initiates repairs via the GCP API, and manages the entire repair lifecycle. This includes tracking repair progress and dynamically updating Slurm node states to reflect successful recovery or persistent failures, thereby enhancing the resilience and self-healing capabilities of Slurm clusters on GCP.

Highlights

  • New Repair Module: A new Python script, repair.py, has been introduced to manage the lifecycle of node repair operations, including initiating repairs and tracking their status.
  • GCP R&R API Integration: The repair.py script directly interacts with the gcloud compute instances report-host-as-faulty API to trigger repair actions on Google Cloud instances.
  • Slurm Node State Management: The slurmsync.py script is enhanced to detect Slurm nodes in a DRAIN state with specific repair reasons, trigger the R&R process, and manage their state (down, resume) based on the repair outcomes.
  • Persistent Repair Tracking: Repair operations are now tracked persistently in a JSON file (repair_operations.json) to maintain state across multiple runs of the slurmsync process.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an integration with a Repair and Recovery (R&R) mechanism for Slurm nodes on GCP. A new repair.py script is added to handle the lifecycle of a node repair, from calling the GCP API to report a faulty host to monitoring the operation and bringing the node back online. The slurmsync.py script is modified to detect nodes that need repair based on their DRAIN state and reason, and to trigger the repair process. The changes look good overall, but I have a few suggestions for improving the robustness and maintainability of the new repair.py script, mainly around file handling, exception handling, and command-line argument safety. Additionally, the new functionality is not covered by unit tests, which is encouraged by the project's contribution guidelines.

@sarthakag sarthakag force-pushed the rrslurmm branch 2 times, most recently from 6f334c0 to 13fe9fd Compare December 16, 2025 13:51
@sarthakag sarthakag force-pushed the rrslurmm branch 2 times, most recently from 052ccce to 3740b21 Compare January 5, 2026 04:11
@sarthakag sarthakag marked this pull request as ready for review January 5, 2026 04:13
@sarthakag sarthakag requested review from a team and samskillman as code owners January 5, 2026 04:13
@sarthakag sarthakag added the release-improvements Added to release notes under the "Improvements" heading. label Jan 5, 2026
@sarthakag sarthakag force-pushed the rrslurmm branch 2 times, most recently from 6489ee6 to e6e4ea3 Compare January 5, 2026 06:08
Copy link
Contributor

@Neelabh94 Neelabh94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some changes requested

Copy link
Collaborator

@bytetwin bytetwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address the comments

@sarthakag
Copy link
Contributor Author

Addressed all comments. PTAL @bytetwin @Neelabh94

bytetwin
bytetwin previously approved these changes Jan 21, 2026
@sarthakag sarthakag enabled auto-merge (squash) January 23, 2026 06:15
@sarthakag sarthakag merged commit 59f4df2 into GoogleCloudPlatform:develop Jan 23, 2026
37 of 73 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-improvements Added to release notes under the "Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants