You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/gke-tpu-v6/README.md
+44-3Lines changed: 44 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -96,7 +96,9 @@ This repository also includes an advanced blueprint, `gke-tpu-v6-advanced.yaml`,
96
96
***Dedicated Service Accounts**for nodes and workloads, following security best practices.
97
97
***Automatic creation of two GCS buckets**for training data and checkpoints.
98
98
***Performance-tuned GCS FUSE mounts** pre-configured in the cluster as Persistent Volumes.
99
-
***Optional High-Performance Storage: [Managed Lustre](https://cloud.google.com/managed-lustre/docs/overview)**for high-performance, fully managed parallel file system optimized for heavy AI and HPC workloads. For details of configuring Managed Lustre, please refer to the [appendix](#understanding-managed-lustre-integration)
99
+
***Optional** High-Performance Storage: [Managed Lustre](https://cloud.google.com/managed-lustre/docs/overview) for high-performance, fully managed parallel file system optimized for heavy AI and HPC workloads. For details of configuring Managed Lustre, please refer to the [appendix](#understanding-managed-lustre-integration)
100
+
***Optional** High-Performance Storage: [Hyperdisk Balanced](https://docs.cloud.google.com/compute/docs/disks/hyperdisks) support for highly available and consistent performance across GKE nodes. For details of configuring Hyperdisk Balanced, please refer to the [appendix](#understanding-hyperdisk-balanced-integration).
101
+
***Optional** Shared File Storage: [Filestore](https://docs.cloud.google.com/filestore/docs/overview) for managed NFS capabilities allowing multiple TPU hosts to share logs, code, or datasets. For details, refer to the [appendix](#understanding-filestore-integration).
100
102
101
103
### Deploying the Advanced Blueprint
102
104
@@ -141,7 +143,7 @@ The [tpu-multislice.yaml](https://github.com/GoogleCloudPlatform/cluster-toolkit
Replace the `DEPLOYMENT_NAME`,`REGION` and `PROJECT_ID` with the ones used in the blueprint.
@@ -312,3 +314,42 @@ After making these changes, run the `gcluster deploy` command as usual.
312
314
```
313
315
314
316
The logs of the pod verifies the disk is mounted successfully and performs a mixed I/O test to validate the disk's provisioned performance.
317
+
318
+
### Understanding Filestore integration
319
+
320
+
To enable Filestore integration, perform the following steps before deploying:
321
+
322
+
1. In the `gke-tpu-v6-cluster` module settings, ensure `enable_filestore_csi: true` is set.
323
+
2. Find the section commented `--- FILESTORE ADDITIONS ---`. Uncomment the following modules:
324
+
*`filestore`: Provisions the Filestore instance and specifies the `local_mount` point.
325
+
*`shared-filestore-pv`: Creates the Kubernetes Persistent Volume and Claim.
326
+
*`shared-fs-job`: (Optional) A test job template to verify multi-node shared writing.
327
+
328
+
#### Testing the Shared Filestore Mount
329
+
The blueprint includes a sample job (`shared-fs-job`) that demonstrates how two different pods can write to and read from the same file simultaneously.
Replace the `DEPLOYMENT_NAME`,`REGION` and `PROJECT_ID` with the ones used in the blueprint.
338
+
339
+
2. Apply the Filestore test manifest,whose path is provided in the final deployment instructions:
340
+
341
+
```sh
342
+
kubectl apply -f <path/to/shared-fs-job.yaml>
343
+
```
344
+
345
+
3. Verify the Shared Output: Once the pods are running, check the logs of the first pod to see it reading data written by the second pod:
346
+
347
+
```sh
348
+
# Get pod names
349
+
kubectl get pods
350
+
351
+
# Check logs for the first pod
352
+
kubectl logs <pod-name-0>
353
+
```
354
+
355
+
The logs will display content from `shared_output.txt`, showing timestamps and hostnames from both pods, confirming that the filesystem is truly shared.
0 commit comments