Skip to content

Support for TPUv5 and v6e #4109

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from

Conversation

wiktorn
Copy link
Contributor

@wiktorn wiktorn commented May 12, 2025

Add support for TPU v5e and V6e.

  • add support for startup scripts on TPU nodes
  • breaking change: rename tf_version to runtime_version
  • add support for spot TPU nodes
  • reduced number of TPU API calls by slurmsync.py
  • include startup script in slurm-gcp-devel.zip - otherwise only version from image is used

TODO:

  • align on the naming of Docker containers in the public repository

Linked PR: GoogleCloudPlatform/slurm-gcp#265
Obsoletes: #3927

@wiktorn wiktorn requested review from samskillman and a team as code owners May 12, 2025 13:34
@wiktorn wiktorn added release-breaking-changes Prevents "smooth" re-deploy across versions enhancement New feature or request labels May 12, 2025
@wiktorn wiktorn force-pushed the tpu-v5-6-pr branch 2 times, most recently from dcfcf4d to fb65009 Compare May 12, 2025 14:03
@@ -7,7 +7,7 @@ google-cloud-bigquery==3.11.3
google-cloud-core==2.3.3
google-cloud-secret-manager~=2.22
google-cloud-storage==2.10.0
google-cloud-tpu==1.10.0
google-cloud-tpu==1.21.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, that will not be enough, we need to change requirements.txt in slurm-gcp repo and roll an image first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was bumped already - though I need to sync versions and have 1.23.0 here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note, slurm-gcp repo uses lower versions for following packages:

requests==2.31.0
PyYAML==6.0.1

@mr0re1 mr0re1 self-assigned this May 16, 2025
wiktorn added 5 commits May 16, 2025 22:01
Bumped versions to the versions defined in slurm-gcp repository.

Did not downgrade:
requests==2.31.0
PyYAML==6.0.1
* use separate startup script for TPU nodes
* fix container references in README
* document list_nodes and fix caching of multi-page responses
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request release-breaking-changes Prevents "smooth" re-deploy across versions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants