You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Extends the **tuning** package with baked-in H100 and GB200 configs for GKE Container Optimized OS. You supply only `accelerator` and `intent`; the package selects the matching sysctl (and optional containerd drop-in) and runs the base tuning apply. No grub—GKE nodes do not use grub. Note: this is a limited set from nvidia-tuned due to the limitations of the mainly read-only OS. For non COS GKE setups consider updating nvidia-tuned to support gke and use the base profiles.
96
+
97
+
**Capabilities:**
98
+
- Sysctl and service drop-ins derived from [nvidia-tuned](./nvidia-tuned/)
99
+
- ConfigMap: `accelerator` (h100, gb200) and `intent` (inference, multiNodeTraining)
100
+
- Baked-in profiles under `profiles/{accelerator}/{intent}/`
101
+
102
+
**Key features:**
103
+
- No manual sysctl.conf authoring; profile content is fixed in the image
104
+
- See [nvidia-tuning-gke README](./nvidia-tuning-gke/README.md)
105
+
94
106
## Package Structure
95
107
96
108
Each package follows the standard skyhook package structure:
A Skyhook package that extends the base **tuning** package with baked-in H100 and GB200 tuning configs for GKE. It mirrors the sysctl (and optional containerd drop-in) from the [nvidia-tuned](../nvidia-tuned/). **GRUB/kernel cmdline is not used**—GKE nodes do not use grub, so only sysctl and service drop-ins are applied. This package is required instead of the nvidia-tuned because Container Optimized OS does not include tuned and it cannot be installed.
4
+
5
+
## Overview
6
+
7
+
-**Inherits from:**[tuning](../tuning/) (same pattern as nvidia-tuned inheriting from tuned).
8
+
-**ConfigMap:** You supply only `accelerator` and `intent`; the package fills in `sysctl.conf` and for GB200 `service_containerd.conf` from baked-in profiles, then runs the base tuning package to apply them.
Profiles are selected by the pair `{accelerator}/{intent}` and live under `profiles/{accelerator}/{intent}/` (e.g. `profiles/h100/inference/`, `profiles/gb200/multiNodeTraining/`). The prepare step discovers available accelerators and intents from the filesystem, so new profiles can be added without changing the scripts.
18
+
19
+
## Interrupts
20
+
21
+
Use **restart_all_services** so sysctl changes take effect; DO NOT USE reboot interrupt as skyhook has to re-apply all changes every reboot and this will cause an infinite loop. Example:
Profiles are grouped by accelerator then intent: `profiles/{accelerator}/{intent}/`. Each profile directory contains `sysctl.conf` and optionally `service_containerd.conf`. No grub (GKE does not use grub). Content matches [tuning/examples/](../tuning/examples/) sysctl (and service_containerd for GB200):
38
+
39
+
- **profiles/h100/inference/** – Base ARP + sched (sysctl).
40
+
- **profiles/h100/multiNodeTraining/** – Base ARP + net/tcp/bbr/fq (sysctl).
Adding a new accelerator or intent is done by adding a new directory under `profiles/`; the prepare script discovers them at runtime.
45
+
46
+
## What is not applied
47
+
48
+
Due to Container Optimized OS the following limitations apply: no CPU governor, no kernel module loading, no dynamic `isolcpus` (add a concrete `isolcpus=` line to the profile and rebuild if needed).
0 commit comments