Merge pull request #30 from NVIDIA/feat/nvidia-tuning-gke

ayuskauskas · web-flow · commit 3a974e732f7a · 2026-03-11T10:11:16.000-07:00
feat: add nvidia-tuning-gke to support GKE container optimized OS
diff --git a/README.md b/README.md
@@ -91,6 +91,18 @@ A package that applies the same node setup steps as the dgxcloud_aws_eks VMI for
 - ConfigMap: `service` and `accelerator` only; versions baked in `defaults/*.conf`
 - No OFI, hardening, or system-node-settings; see [nvidia-setup README](./nvidia-setup/README.md)
 
+### 6. NVIDIA Tuning GKE Package (`nvidia-tuning-gke/`)
+Extends the **tuning** package with baked-in H100 and GB200 configs for GKE Container Optimized OS. You supply only `accelerator` and `intent`; the package selects the matching sysctl (and optional containerd drop-in) and runs the base tuning apply. No grub—GKE nodes do not use grub. Note: this is a limited set from nvidia-tuned due to the limitations of the mainly read-only OS. For non COS GKE setups consider updating nvidia-tuned to support gke and use the base profiles.
+
+**Capabilities:**
+- Sysctl and service drop-ins derived from [nvidia-tuned](./nvidia-tuned/)
+- ConfigMap: `accelerator` (h100, gb200) and `intent` (inference, multiNodeTraining)
+- Baked-in profiles under `profiles/{accelerator}/{intent}/`
+
+**Key features:**
+- No manual sysctl.conf authoring; profile content is fixed in the image
+- See [nvidia-tuning-gke README](./nvidia-tuning-gke/README.md)
+
 ## Package Structure
 
 Each package follows the standard skyhook package structure:
diff --git a/nvidia-tuning-gke/Dockerfile b/nvidia-tuning-gke/Dockerfile
@@ -0,0 +1,18 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Extends the tuning package with baked-in H100/GB200 configs for GKE.
+# Config step: prepare_nvidia_configs.sh (populate configmaps from profile)
+#              then update_settings.sh (base tuning apply).
+
+ARG TUNING_VERSION=1.1.4
+FROM ghcr.io/nvidia/skyhook-packages/tuning:${TUNING_VERSION}
+
+COPY profiles/ /skyhook-package/profiles/
+COPY skyhook_dir/prepare_nvidia_configs.sh /skyhook-package/skyhook_dir/
+COPY skyhook_dir/prepare_nvidia_configs_check.sh /skyhook-package/skyhook_dir/
+COPY config.json /skyhook-package/
+
+RUN chmod +x /skyhook-package/skyhook_dir/prepare_nvidia_configs.sh \
+             /skyhook-package/skyhook_dir/prepare_nvidia_configs_check.sh \
+             /skyhook-package/skyhook_dir/*.sh
diff --git a/nvidia-tuning-gke/README.md b/nvidia-tuning-gke/README.md
@@ -0,0 +1,54 @@
+# NVIDIA Tuning GKE Package
+
+A Skyhook package that extends the base **tuning** package with baked-in H100 and GB200 tuning configs for GKE. It mirrors the sysctl (and optional containerd drop-in) from the [nvidia-tuned](../nvidia-tuned/). **GRUB/kernel cmdline is not used**—GKE nodes do not use grub, so only sysctl and service drop-ins are applied. This package is required instead of the nvidia-tuned because Container Optimized OS does not include tuned and it cannot be installed.
+
+## Overview
+
+- **Inherits from:** [tuning](../tuning/) (same pattern as nvidia-tuned inheriting from tuned).
+- **ConfigMap:** You supply only `accelerator` and `intent`; the package fills in `sysctl.conf` and for GB200 `service_containerd.conf` from baked-in profiles, then runs the base tuning package to apply them.
+
+## ConfigMap (required)
+
+| Key           | Values              | Description |
+|---------------|---------------------|-------------|
+| `accelerator` | `h100`, `gb200`     | GPU/accelerator type. |
+| `intent`      | `inference`, `multiNodeTraining` | Workload intent. |
+
+Profiles are selected by the pair `{accelerator}/{intent}` and live under `profiles/{accelerator}/{intent}/` (e.g. `profiles/h100/inference/`, `profiles/gb200/multiNodeTraining/`). The prepare step discovers available accelerators and intents from the filesystem, so new profiles can be added without changing the scripts.
+
+## Interrupts
+
+Use **restart_all_services** so sysctl changes take effect; DO NOT USE reboot interrupt as skyhook has to re-apply all changes every reboot and this will cause an infinite loop. Example:
+
+```yaml
+packages:
+  nvidia-tuning-gke:
+    image: ghcr.io/nvidia/skyhook-packages/nvidia-tuning-gke
+    version: 0.1.0
+    interrupt:
+      type: restart_all_services
+    configMap:
+      accelerator: gb200
+      intent: inference
+```
+
+## Baked-in profiles
+
+Profiles are grouped by accelerator then intent: `profiles/{accelerator}/{intent}/`. Each profile directory contains `sysctl.conf` and optionally `service_containerd.conf`. No grub (GKE does not use grub). Content matches [tuning/examples/](../tuning/examples/) sysctl (and service_containerd for GB200):
+
+- **profiles/h100/inference/** – Base ARP + sched (sysctl).
+- **profiles/h100/multiNodeTraining/** – Base ARP + net/tcp/bbr/fq (sysctl).
+- **profiles/gb200/inference/** – Base + gb200-perf + sched (sysctl); containerd LimitSTACK.
+- **profiles/gb200/multiNodeTraining/** – Base + gb200-perf + net/tcp (sysctl); containerd LimitSTACK.
+
+Adding a new accelerator or intent is done by adding a new directory under `profiles/`; the prepare script discovers them at runtime.
+
+## What is not applied
+
+Due to Container Optimized OS the following limitations apply: no CPU governor, no kernel module loading, no dynamic `isolcpus` (add a concrete `isolcpus=` line to the profile and rebuild if needed).
+
+## Version
+
+- **Package version:** 0.1.0
+- **Base package:** tuning (1.1.4)
+- **Schema version:** v1
diff --git a/nvidia-tuning-gke/config.json b/nvidia-tuning-gke/config.json
@@ -0,0 +1,98 @@
+{
+    "schema_version": "v1",
+    "package_name": "nvidia_tuning_gke",
+    "package_version": "0.1.0",
+    "expected_config_files": ["accelerator", "intent"],
+    "modes": {
+        "config": [
+            {
+                "name": "prepare",
+                "path": "prepare_nvidia_configs.sh",
+                "arguments": [],
+                "returncodes": [0],
+                "on_host": true,
+                "env": {},
+                "idempotence": false,
+                "upgrade_step": false
+            },
+            {
+                "name": "config",
+                "path": "update_settings.sh",
+                "arguments": [],
+                "returncodes": [0],
+                "on_host": true,
+                "env": {},
+                "idempotence": true,
+                "upgrade_step": false
+            }
+        ],
+        "config-check": [
+            {
+                "name": "prepare-check",
+                "path": "prepare_nvidia_configs_check.sh",
+                "arguments": [],
+                "returncodes": [0],
+                "on_host": true,
+                "env": {},
+                "idempotence": true,
+                "upgrade_step": false
+            },
+            {
+                "name": "config-check",
+                "path": "update_settings_check.sh",
+                "arguments": [],
+                "returncodes": [0],
+                "on_host": true,
+                "env": {},
+                "idempotence": true,
+                "upgrade_step": false
+            }
+        ],
+        "post-interrupt-check": [
+            {
+                "name": "prepare",
+                "path": "prepare_nvidia_configs.sh",
+                "arguments": [],
+                "returncodes": [0],
+                "on_host": true,
+                "env": {},
+                "idempotence": false,
+                "upgrade_step": false
+            },
+            {
+                "name": "post-interrupt-check",
+                "path": "update_settings_post_check.sh",
+                "arguments": [],
+                "returncodes": [0],
+                "on_host": true,
+                "env": {},
+                "idempotence": true,
+                "upgrade_step": false
+            }
+        ],
+        "uninstall": [
+            {
+                "name": "uninstall",
+                "path": "update_settings_uninstall.sh",
+                "arguments": [],
+                "returncodes": [0],
+                "on_host": true,
+                "env": {},
+                "idempotence": true,
+                "upgrade_step": false
+            }
+        ],
+        "uninstall-check": [
+            {
+                "name": "uninstall-check",
+                "path": "update_settings_uninstall_check.sh",
+                "arguments": [],
+                "returncodes": [0],
+                "on_host": true,
+                "env": {},
+                "idempotence": true,
+                "upgrade_step": false
+            }
+        ]
+    }
+}
diff --git a/nvidia-tuning-gke/preprocess.sh b/nvidia-tuning-gke/preprocess.sh
@@ -0,0 +1,55 @@
+#!/bin/bash
+
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Preprocess script for nvidia_tuned package
+# Fetches the most recent tag for the tuned package and outputs it as TUNED_VERSION
+#
+# This script outputs GitHub Actions environment variables in the format:
+#   BUILD_ARGS=TUNED_VERSION=<version>
+#
+# Usage: ./preprocess.sh
+# Environment variables:
+#   GITHUB_OUTPUT - If set, outputs are written to this file (GitHub Actions)
+
+set -e
+
+# Check if PACKAGE_VERSIONS is set
+if [ -z "${PACKAGE_VERSIONS:-}" ]; then
+    echo "ERROR: PACKAGE_VERSIONS environment variable is not set"
+    exit 1
+fi
+
+# Extract the tuned version from the JSON
+latest_version=$(jq -r '.tuning' <<< "${PACKAGE_VERSIONS}")
+
+# Check if the version was found
+if [ -z "${latest_version}" ] || [ "${latest_version}" = "null" ]; then
+    echo "ERROR: Could not find 'tuning' package version in PACKAGE_VERSIONS: ${PACKAGE_VERSIONS}"
+    exit 1
+fi
+
+echo "Found tuning version: ${latest_version}"
+
+# Output the build args
+# If running in GitHub Actions, write to GITHUB_OUTPUT
+if [ -n "${GITHUB_OUTPUT:-}" ]; then
+    echo "BUILD_ARGS=TUNING_VERSION=${latest_version}" >> "$GITHUB_OUTPUT"
+else
+    # For local testing, output to stdout
+    echo "BUILD_ARGS=TUNING_VERSION=${latest_version}"
+fi
diff --git a/nvidia-tuning-gke/profiles/gb200/inference/service_containerd.conf b/nvidia-tuning-gke/profiles/gb200/inference/service_containerd.conf
@@ -0,0 +1,2 @@
+[Service]
+LimitSTACK=67108864
diff --git a/nvidia-tuning-gke/profiles/gb200/inference/sysctl.conf b/nvidia-tuning-gke/profiles/gb200/inference/sysctl.conf
@@ -0,0 +1,14 @@
+# GB200 inference – sysctl.
+net.ipv4.conf.all.arp_announce = 2
+net.ipv4.conf.default.arp_announce = 2
+net.ipv4.conf.all.arp_ignore = 1
+net.ipv4.conf.default.arp_ignore = 1
+fs.inotify.max_user_instances=65535
+fs.inotify.max_user_watches=524288
+kernel.threads-max=16512444
+vm.max_map_count=262144
+vm.min_free_kbytes=65536
+vm.overcommit_memory=1
+vm.swappiness=1
+kernel.sched_latency_ns=1000000
+kernel.sched_min_granularity_ns=100000
diff --git a/nvidia-tuning-gke/profiles/gb200/multiNodeTraining/service_containerd.conf b/nvidia-tuning-gke/profiles/gb200/multiNodeTraining/service_containerd.conf
@@ -0,0 +1,2 @@
+[Service]
+LimitSTACK=67108864
diff --git a/nvidia-tuning-gke/profiles/gb200/multiNodeTraining/sysctl.conf b/nvidia-tuning-gke/profiles/gb200/multiNodeTraining/sysctl.conf
@@ -0,0 +1,21 @@
+# GB200 multiNodeTraining – sysctl.
+net.ipv4.conf.all.arp_announce = 2
+net.ipv4.conf.default.arp_announce = 2
+net.ipv4.conf.all.arp_ignore = 1
+net.ipv4.conf.default.arp_ignore = 1
+fs.inotify.max_user_instances=65535
+fs.inotify.max_user_watches=524288
+kernel.threads-max=16512444
+vm.max_map_count=262144
+vm.min_free_kbytes=65536
+vm.overcommit_memory=1
+net.core.rmem_max=536870912
+net.core.wmem_max=536870912
+net.core.rmem_default=134217728
+net.core.wmem_default=134217728
+net.ipv4.tcp_rmem=4096 87380 268435456
+net.ipv4.tcp_wmem=4096 65536 268435456
+net.core.netdev_max_backlog=10000
+net.ipv4.tcp_max_syn_backlog=8192
+net.ipv4.tcp_congestion_control=bbr
+net.core.default_qdisc=fq
diff --git a/nvidia-tuning-gke/profiles/h100/inference/sysctl.conf b/nvidia-tuning-gke/profiles/h100/inference/sysctl.conf
@@ -0,0 +1,8 @@
+# H100 inference – sysctl. Mirrors nvidia-base + nvidia-h100-inference [sysctl].
+net.ipv4.conf.all.arp_announce = 2
+net.ipv4.conf.default.arp_announce = 2
+net.ipv4.conf.all.arp_ignore = 1
+net.ipv4.conf.default.arp_ignore = 1
+vm.swappiness=1
+kernel.sched_latency_ns=1000000
+kernel.sched_min_granularity_ns=100000
diff --git a/nvidia-tuning-gke/profiles/h100/multiNodeTraining/sysctl.conf b/nvidia-tuning-gke/profiles/h100/multiNodeTraining/sysctl.conf
@@ -0,0 +1,15 @@
+# H100 multiNodeTraining – sysctl.
+net.ipv4.conf.all.arp_announce = 2
+net.ipv4.conf.default.arp_announce = 2
+net.ipv4.conf.all.arp_ignore = 1
+net.ipv4.conf.default.arp_ignore = 1
+net.core.rmem_max=536870912
+net.core.wmem_max=536870912
+net.core.rmem_default=134217728
+net.core.wmem_default=134217728
+net.ipv4.tcp_rmem=4096 87380 268435456
+net.ipv4.tcp_wmem=4096 65536 268435456
+net.core.netdev_max_backlog=10000
+net.ipv4.tcp_max_syn_backlog=8192
+net.ipv4.tcp_congestion_control=bbr
+net.core.default_qdisc=fq
diff --git a/nvidia-tuning-gke/skyhook_dir/prepare_nvidia_configs.sh b/nvidia-tuning-gke/skyhook_dir/prepare_nvidia_configs.sh
diff --git a/nvidia-tuning-gke/skyhook_dir/prepare_nvidia_configs_check.sh b/nvidia-tuning-gke/skyhook_dir/prepare_nvidia_configs_check.sh
diff --git a/tests/integration/nvidia_tuning_gke/__init__.py b/tests/integration/nvidia_tuning_gke/__init__.py
diff --git a/tests/integration/nvidia_tuning_gke/test_prepare_nvidia_configs.py b/tests/integration/nvidia_tuning_gke/test_prepare_nvidia_configs.py