Skip to content

Commit 1f86318

Browse files
Add filestore integration to TPU v6e as a storage support (#5058)
1 parent 06e0f44 commit 1f86318

File tree

2 files changed

+88
-4
lines changed

2 files changed

+88
-4
lines changed

examples/gke-tpu-v6/README.md

Lines changed: 44 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,9 @@ This repository also includes an advanced blueprint, `gke-tpu-v6-advanced.yaml`,
9696
* **Dedicated Service Accounts** for nodes and workloads, following security best practices.
9797
* **Automatic creation of two GCS buckets** for training data and checkpoints.
9898
* **Performance-tuned GCS FUSE mounts** pre-configured in the cluster as Persistent Volumes.
99-
* **Optional High-Performance Storage: [Managed Lustre](https://cloud.google.com/managed-lustre/docs/overview)** for high-performance, fully managed parallel file system optimized for heavy AI and HPC workloads. For details of configuring Managed Lustre, please refer to the [appendix](#understanding-managed-lustre-integration)
99+
* **Optional** High-Performance Storage: [Managed Lustre](https://cloud.google.com/managed-lustre/docs/overview) for high-performance, fully managed parallel file system optimized for heavy AI and HPC workloads. For details of configuring Managed Lustre, please refer to the [appendix](#understanding-managed-lustre-integration)
100+
* **Optional** High-Performance Storage: [Hyperdisk Balanced](https://docs.cloud.google.com/compute/docs/disks/hyperdisks) support for highly available and consistent performance across GKE nodes. For details of configuring Hyperdisk Balanced, please refer to the [appendix](#understanding-hyperdisk-balanced-integration).
101+
* **Optional** Shared File Storage: [Filestore](https://docs.cloud.google.com/filestore/docs/overview) for managed NFS capabilities allowing multiple TPU hosts to share logs, code, or datasets. For details, refer to the [appendix](#understanding-filestore-integration).
100102

101103
### Deploying the Advanced Blueprint
102104

@@ -141,7 +143,7 @@ The [tpu-multislice.yaml](https://github.com/GoogleCloudPlatform/cluster-toolkit
141143
1. Connect to your cluster:
142144

143145
```sh
144-
gcloud container clusters get-credentials gke-tpu-v6 --region=REGION --project_id=PROJECT_ID
146+
gcloud container clusters get-credentials gke-tpu-v6 --region=REGION --project=PROJECT_ID
145147
```
146148

147149
Replace the `REGION` and `PROJECT_ID` with the ones used in the blueprint.
@@ -286,7 +288,7 @@ After making these changes, run the `gcluster deploy` command as usual.
286288
1. Connect to your cluster:
287289
288290
```sh
289-
gcloud container clusters get-credentials DEPLOYMENT_NAME --region=REGION --project_id=PROJECT_ID
291+
gcloud container clusters get-credentials DEPLOYMENT_NAME --region=REGION --project=PROJECT_ID
290292
```
291293
292294
Replace the `DEPLOYMENT_NAME`,`REGION` and `PROJECT_ID` with the ones used in the blueprint.
@@ -312,3 +314,42 @@ After making these changes, run the `gcluster deploy` command as usual.
312314
```
313315
314316
The logs of the pod verifies the disk is mounted successfully and performs a mixed I/O test to validate the disk's provisioned performance.
317+
318+
### Understanding Filestore integration
319+
320+
To enable Filestore integration, perform the following steps before deploying:
321+
322+
1. In the `gke-tpu-v6-cluster` module settings, ensure `enable_filestore_csi: true` is set.
323+
2. Find the section commented `--- FILESTORE ADDITIONS ---`. Uncomment the following modules:
324+
* `filestore`: Provisions the Filestore instance and specifies the `local_mount` point.
325+
* `shared-filestore-pv`: Creates the Kubernetes Persistent Volume and Claim.
326+
* `shared-fs-job`: (Optional) A test job template to verify multi-node shared writing.
327+
328+
#### Testing the Shared Filestore Mount
329+
The blueprint includes a sample job (`shared-fs-job`) that demonstrates how two different pods can write to and read from the same file simultaneously.
330+
331+
1. Connect to your cluster:
332+
333+
```sh
334+
gcloud container clusters get-credentials DEPLOYMENT_NAME --region=REGION --project=PROJECT_ID
335+
```
336+
337+
Replace the `DEPLOYMENT_NAME`,`REGION` and `PROJECT_ID` with the ones used in the blueprint.
338+
339+
2. Apply the Filestore test manifest,whose path is provided in the final deployment instructions:
340+
341+
```sh
342+
kubectl apply -f <path/to/shared-fs-job.yaml>
343+
```
344+
345+
3. Verify the Shared Output: Once the pods are running, check the logs of the first pod to see it reading data written by the second pod:
346+
347+
```sh
348+
# Get pod names
349+
kubectl get pods
350+
351+
# Check logs for the first pod
352+
kubectl logs <pod-name-0>
353+
```
354+
355+
The logs will display content from `shared_output.txt`, showing timestamps and hostnames from both pods, confirming that the filesystem is truly shared.

examples/gke-tpu-v6/gke-tpu-v6-advanced.yaml

Lines changed: 44 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ vars:
3333
num_slices:
3434

3535
# Machine type
36-
machine_type:
36+
machine_type: ct6e-standard-4t
3737

3838
# The TPU placement topology for pod slice node pool.
3939
tpu_topology:
@@ -171,6 +171,7 @@ deployment_groups:
171171
system_node_pool_taints: []
172172
enable_private_endpoint: false # Allows access from authorized public IPs
173173
configure_workload_identity_sa: true
174+
enable_filestore_csi: true
174175
enable_gcsfuse_csi: true
175176
enable_managed_lustre_csi: true
176177
enable_persistent_disk_csi: true # enable Hyperdisk for the cluster
@@ -465,3 +466,45 @@ deployment_groups:
465466
# node_count: 1
466467

467468
# outputs: [instructions]
469+
470+
# # --- FILESTORE ADDITIONS ---
471+
# - id: filestore
472+
# source: modules/file-system/filestore
473+
# use: [gke-tpu-v6-net-0]
474+
# settings: {local_mount: /mnt/v6e-filestore}
475+
476+
# - id: shared-filestore-pv
477+
# source: modules/file-system/gke-persistent-volume
478+
# use: [gke-tpu-v6-cluster, filestore]
479+
480+
# # Shared Filestore Job
481+
# - id: shared-fs-job
482+
# source: modules/compute/gke-job-template
483+
# use:
484+
# - gke-tpu-v6-cluster
485+
# - gke-tpu-v6-pool
486+
# - shared-filestore-pv
487+
# settings:
488+
# image: alpine/git:latest
489+
# command:
490+
# - sh
491+
# - -c
492+
# - |
493+
# echo "Pod ${HOSTNAME} is starting..."
494+
# SHARED_DIR=/mnt/v6e-filestore
495+
# mkdir -p $SHARED_DIR
496+
# cd $SHARED_DIR
497+
#
498+
# # Simulate some work and write to a shared file
499+
# cmd="date +%s"
500+
# TIMESTAMP=`$cmd`
501+
# echo "Pod ${HOSTNAME} writing data at $$TIMESTAMP" >> shared_output.txt
502+
# sleep 5
503+
# echo "Displaying content of shared_output.txt:"
504+
# echo "---"
505+
# cat shared_output.txt # Read the content to show it's shared
506+
# echo "---"
507+
# sleep 20
508+
# echo "Pod ${HOSTNAME} finished."
509+
# node_count: 2
510+
# outputs: [instructions]

0 commit comments

Comments
 (0)