Skip to content

Commit 06e0f44

Browse files
Add filestore as a storage support option for TPU 7x (#5059)
1 parent 0bbe6e5 commit 06e0f44

File tree

2 files changed

+85
-2
lines changed

2 files changed

+85
-2
lines changed

examples/gke-tpu-7x/README.md

Lines changed: 42 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,7 @@ This repository also includes an advanced blueprint, `gke-tpu-7x-advanced.yaml`,
114114
- Performance-tuned GCS FUSE mounts pre-configured in the cluster as Persistent Volumes.
115115
- **Optional** High-Performance Storage: [Hyperdisk Balanced](https://docs.cloud.google.com/compute/docs/disks/hyperdisks) support for highly available and consistent performance across GKE nodes. For details of configuring Hyperdisk Balanced, please refer to the [appendix](#understanding-hyperdisk-balanced-integration).
116116
- **Optional** High-Performance Storage: [Managed Lustre](https://cloud.google.com/managed-lustre/docs/overview) for high-performance, fully managed parallel file system optimized for heavy AI and HPC workloads. For details of configuring Managed Lustre, please refer to the [appendix](#understanding-managed-lustre-integration).
117+
- **Optional** Shared File Storage: [Filestore](https://docs.cloud.google.com/filestore/docs/overview) for managed NFS capabilities allowing multiple TPU hosts to share logs, code, or datasets. For details, refer to the [appendix](#understanding-filestore-integration).
117118

118119
### Deploying the Advanced Blueprint
119120

@@ -295,7 +296,7 @@ Once deployed, the `Lustre` filesystem is available to the cluster as a `Persist
295296
1. Connect to your cluster:
296297
297298
```sh
298-
gcloud container clusters get-credentials DEPLOYMENT_NAME --region=REGION --project_id=PROJECT_ID
299+
gcloud container clusters get-credentials DEPLOYMENT_NAME --region=REGION --project=PROJECT_ID
299300
```
300301
301302
Replace the `DEPLOYMENT_NAME`,`REGION` and `PROJECT_ID` with the ones used in the blueprint.
@@ -366,7 +367,7 @@ After making these changes, run the `gcluster deploy` command as usual.
366367
1. Connect to your cluster:
367368
368369
```sh
369-
gcloud container clusters get-credentials DEPLOYMENT_NAME --region=REGION --project_id=PROJECT_ID
370+
gcloud container clusters get-credentials DEPLOYMENT_NAME --region=REGION --project=PROJECT_ID
370371
```
371372
372373
Replace the `DEPLOYMENT_NAME`,`REGION` and `PROJECT_ID` with the ones used in the blueprint.
@@ -391,3 +392,42 @@ After making these changes, run the `gcluster deploy` command as usual.
391392
```
392393
393394
The logs of the pod verifies the disk is mounted successfully and performs a mixed I/O test to validate the disk's provisioned performance.
395+
396+
### Understanding Filestore integration
397+
398+
To enable Filestore integration, perform the following steps before deploying:
399+
400+
1. In the `gke-tpu-7x-cluster` module settings, ensure `enable_filestore_csi: true` is set.
401+
2. Find the section commented `--- FILESTORE ADDITIONS ---`. Uncomment the following modules:
402+
- `filestore`: Provisions the Filestore instance and specifies the `local_mount` point.
403+
- `shared-filestore-pv`: Creates the Kubernetes Persistent Volume and Claim.
404+
- `shared-fs-job`: (Optional) A test job template to verify multi-node shared writing.
405+
406+
#### Testing the Shared Filestore Mount
407+
The blueprint includes a sample job (`shared-fs-job`) that demonstrates how two different pods can write to and read from the same file simultaneously.
408+
409+
1. Connect to your cluster:
410+
411+
```sh
412+
gcloud container clusters get-credentials DEPLOYMENT_NAME --region=REGION --project=PROJECT_ID
413+
```
414+
415+
Replace the `DEPLOYMENT_NAME`,`REGION` and `PROJECT_ID` with the ones used in the blueprint.
416+
417+
2. Apply the Filestore test manifest,whose path is provided in the final deployment instructions:
418+
419+
```sh
420+
kubectl apply -f <path/to/shared-fs-job.yaml>
421+
```
422+
423+
3. Verify the Shared Output: Once the pods are running, check the logs of the first pod to see it reading data written by the second pod:
424+
425+
```sh
426+
# Get pod names
427+
kubectl get pods
428+
429+
# Check logs for the first pod
430+
kubectl logs <pod-name-0>
431+
```
432+
433+
The logs will display content from `shared_output.txt`, showing timestamps and hostnames from both pods, confirming that the filesystem is truly shared.

examples/gke-tpu-7x/gke-tpu-7x-advanced.yaml

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -167,6 +167,7 @@ deployment_groups:
167167
enable_gcsfuse_csi: true # enable GCS Fuse for the cluster
168168
enable_persistent_disk_csi: true # enable Hyperdisk for the cluster
169169
enable_managed_lustre_csi: true # enable Managed Lustre for the cluster
170+
enable_filestore_csi: true # enable Filestore for the cluster
170171
configure_workload_identity_sa: true
171172
master_authorized_networks:
172173
- cidr_block: $(vars.authorized_cidr) # Allows your machine to run the kubectl command. Required for multi network setup.
@@ -454,3 +455,45 @@ deployment_groups:
454455
# node_count: 1
455456

456457
# outputs: [instructions]
458+
459+
# # --- FILESTORE ADDITIONS ---
460+
# - id: filestore
461+
# source: modules/file-system/filestore
462+
# use: [gke-tpu-7x-net-0]
463+
# settings: {local_mount: /mnt/7x-filestore}
464+
465+
# - id: shared-filestore-pv
466+
# source: modules/file-system/gke-persistent-volume
467+
# use: [gke-tpu-7x-cluster, filestore]
468+
469+
# # Shared Filestore Job
470+
# - id: shared-fs-job
471+
# source: modules/compute/gke-job-template
472+
# use:
473+
# - gke-tpu-7x-cluster
474+
# - gke-tpu-7x-pool
475+
# - shared-filestore-pv
476+
# settings:
477+
# image: alpine/git:latest
478+
# command:
479+
# - sh
480+
# - -c
481+
# - |
482+
# echo "Pod ${HOSTNAME} is starting..."
483+
# SHARED_DIR=/mnt/7x-filestore
484+
# mkdir -p $SHARED_DIR
485+
# cd $SHARED_DIR
486+
487+
# # Simulate some work and write to a shared file
488+
# cmd="date +%s"
489+
# TIMESTAMP=`$cmd`
490+
# echo "Pod ${HOSTNAME} writing data at $$TIMESTAMP" >> shared_output.txt
491+
# sleep 5
492+
# echo "Displaying content of shared_output.txt:"
493+
# echo "---"
494+
# cat shared_output.txt # Read the content to show it's shared
495+
# echo "---"
496+
# sleep 20
497+
# echo "Pod ${HOSTNAME} finished."
498+
# node_count: 2
499+
# outputs: [instructions]

0 commit comments

Comments
 (0)