Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions nvidia-setup/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,8 @@ Defaults are defined in `skyhook_dir/defaults/eks-h100.conf` and `eks-gb200.conf

Set these on the package spec in the Skyhook Custom Resource (`spec.packages.<name>.env`):

- `NVIDIA_SETUP_INSTALL_KERNEL` – `true` or `false` (default: `false`). If `true`, apply **only** installs the exact kernel from the defaults file (via `downgrade_kernel.sh`) and then exits; a reboot is required. After reboot, the **post-interrupt-check** verifies the running kernel matches the expected version. If `false`, apply verifies the current kernel is >= the required version and errors otherwise, then continues with the full apply
- `NVIDIA_SETUP_INSTALL_KERNEL` – `true` or `false` (default: `false`). If `true`, apply **only** installs the exact kernel from the defaults file (via `downgrade_kernel.sh`) and then exits; a reboot is required. After reboot, the **post-interrupt-check** verifies the running kernel matches the expected version. If `false`, apply verifies the current kernel meets the requirement (see `NVIDIA_SETUP_KERNEL_ALLOW_NEWER`) and errors otherwise, then continues with the full apply.
- `NVIDIA_SETUP_KERNEL_ALLOW_NEWER` – `true` or `false` (default: `false`). When `NVIDIA_SETUP_INSTALL_KERNEL=false`, this controls the kernel check: if `false`, the running kernel must match the required upstream version exactly; if `true`, the running kernel may be newer (current >= required).
- `NVIDIA_PIN_KERNEL` - `true` or `false` (defaults: `false`). If `true`, pin the kernel to the exact version in the package so that it will not upgrade in future.
- `NVIDIA_KERNEL` – kernel version (overrides default from defaults file)
- `NVIDIA_EFA` – EFA installer version
Expand All @@ -43,7 +44,7 @@ Set these on the package spec in the Skyhook Custom Resource (`spec.packages.<na

For `service=eks` the apply step currently runs, in order:

1. **ensure_kernel** – if `NVIDIA_SETUP_INSTALL_KERNEL=false`: verify running kernel is >= required; if `true`: install exact kernel only (then exit; reboot required).
1. **ensure_kernel** – if `NVIDIA_SETUP_INSTALL_KERNEL=false`: verify running kernel meets requirement (exact match by default; allow newer if `NVIDIA_SETUP_KERNEL_ALLOW_NEWER=true`); if `true`: install exact kernel only (then exit; reboot required).
2. **upgrade** – `apt-get update && apt-get upgrade -y`
3. **install-efa-driver** – download and run AWS EFA installer

Expand Down
5 changes: 5 additions & 0 deletions nvidia-setup/skyhook_dir/apply_check.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,11 @@ STEPS_CHECK_DIR="${SKYHOOK_DIR}/skyhook_dir/steps_check"
# shellcheck source=load_defaults.sh
. "${SKYHOOK_DIR}/skyhook_dir/load_defaults.sh"

# Skip checks if only installing kernel as we need to reboot before any check would work
if [ "${NVIDIA_SETUP_INSTALL_KERNEL}" = "true" ]; then
exit 0
fi

check_eks_h100() {
"${STEPS_CHECK_DIR}/upgrade_check.sh"
"${STEPS_CHECK_DIR}/install_efa_driver_check.sh"
Expand Down
31 changes: 26 additions & 5 deletions nvidia-setup/skyhook_dir/steps/ensure_kernel.sh
Original file line number Diff line number Diff line change
@@ -1,7 +1,11 @@
#!/bin/bash
# ensure_kernel.sh: install exact kernel (if NVIDIA_SETUP_INSTALL_KERNEL=true) or
# verify current kernel is >= required (if false).
# verify current kernel meets requirement (see NVIDIA_SETUP_KERNEL_ALLOW_NEWER).
set -e
#
# NVIDIA_SETUP_KERNEL_ALLOW_NEWER (default: false). When false, the running kernel
# must match the required upstream version exactly. When true, the running kernel
# may be newer (current >= required).

STEPS_DIR="${SKYHOOK_DIR}/skyhook_dir/steps"

Expand Down Expand Up @@ -37,18 +41,35 @@ check_kernel_at_least() {
return 1
}

# Returns 0 if current upstream version equals required upstream version (exact match).
check_kernel_exact() {
local required="$1"
local current
current=$(uname -r)
local required_upstream="${required%%-*}"
local current_upstream="${current%%-*}"
[ "${current_upstream}" = "${required_upstream}" ]
}

# When TEST_CHECK_KERNEL_AT_LEAST is set, skip normal execution so tests can source this file and call check_kernel_at_least.
if [ -z "${TEST_CHECK_KERNEL_AT_LEAST:-}" ]; then
if [ "${NVIDIA_SETUP_INSTALL_KERNEL:-false}" = "true" ]; then
install_kernel
exit 0
fi

# Check current kernel is >= required
# Check current kernel meets requirement (exact or at-least depending on env)
required_full="$(resolve_full_kernel "${KERNEL}")"
if ! check_kernel_at_least "${required_full}"; then
echo "Error: current kernel $(uname -r) is not >= required ${required_full}. Set NVIDIA_SETUP_INSTALL_KERNEL=true to install the exact kernel, or boot with a compatible kernel." >&2
exit 1
if [ "${NVIDIA_SETUP_KERNEL_ALLOW_NEWER:-false}" = "true" ]; then
if ! check_kernel_at_least "${required_full}"; then
echo "Error: current kernel $(uname -r) is not >= required ${required_full}. Set NVIDIA_SETUP_INSTALL_KERNEL=true to install the exact kernel, or boot with a compatible kernel." >&2
exit 1
fi
else
if ! check_kernel_exact "${required_full}"; then
echo "Error: current kernel $(uname -r) does not match required ${required_full} (exact match required). Set NVIDIA_SETUP_KERNEL_ALLOW_NEWER=true to allow a newer kernel, or NVIDIA_SETUP_INSTALL_KERNEL=true to install the exact kernel." >&2
exit 1
fi
fi
fi

Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
#!/bin/bash
# Test harness for check_kernel_at_least. Set CURRENT_KERNEL and REQUIRED_KERNEL,
# then source ensure_kernel.sh (with uname mocked) and run the check. Exit code
# 0 = current >= required, 1 = current < required.
# Test harness for check_kernel_at_least and check_kernel_exact. Set CURRENT_KERNEL,
# REQUIRED_KERNEL, and optionally KERNEL_CHECK_MODE=at_least|exact (default: at_least).
# Source ensure_kernel.sh (with uname mocked) and run the chosen check.
# Exit code 0 = pass, 1 = fail.
set -e

CURRENT_KERNEL="${CURRENT_KERNEL:?CURRENT_KERNEL must be set}"
REQUIRED_KERNEL="${REQUIRED_KERNEL:?REQUIRED_KERNEL must be set}"
KERNEL_CHECK_MODE="${KERNEL_CHECK_MODE:-at_least}"
[ -n "${SKYHOOK_DIR:-}" ] || { echo "SKYHOOK_DIR must be set" >&2; exit 1; }

uname() {
Expand All @@ -19,5 +21,9 @@ export TEST_CHECK_KERNEL_AT_LEAST=1
# shellcheck source=ensure_kernel.sh
. "${SKYHOOK_DIR}/skyhook_dir/steps/ensure_kernel.sh"

check_kernel_at_least "$REQUIRED_KERNEL"
case "${KERNEL_CHECK_MODE}" in
at_least) check_kernel_at_least "$REQUIRED_KERNEL" ;;
exact) check_kernel_exact "$REQUIRED_KERNEL" ;;
*) echo "KERNEL_CHECK_MODE must be at_least or exact" >&2; exit 1 ;;
esac
exit $?
1 change: 1 addition & 0 deletions tests/integration/nvidia_setup/test_apply.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ def test_apply_with_env_overrides(base_image):
configmaps={"service": "eks", "accelerator": "h100"},
env_vars={
"NVIDIA_KERNEL": "6.8.0",
"NVIDIA_SETUP_KERNEL_ALLOW_NEWER": "true", # container kernel may be newer than override
"NVIDIA_EFA": "1.31.0",
"NVIDIA_LUSTRE": "aws"
},
Expand Down
88 changes: 71 additions & 17 deletions tests/integration/nvidia_setup/test_check_kernel_at_least.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
#!/usr/bin/env python3
"""
Tests for check_kernel_at_least in ensure_kernel.sh.
Tests for check_kernel_at_least and check_kernel_exact in ensure_kernel.sh.

The check compares upstream kernel versions (before first '-') so that
e.g. 6.17.0-1007-aws is correctly considered >= 6.14.0-1018-aws (6.17 >= 6.14).
- at_least: compares upstream versions (before first '-'); current >= required passes.
- exact: current upstream must equal required upstream (NVIDIA_SETUP_KERNEL_ALLOW_NEWER=false behavior).
"""

from pathlib import Path
Expand All @@ -15,55 +15,109 @@
_CHECK_SCRIPT_DEST = "skyhook_dir/steps/run_check_kernel_at_least_test.sh"


def _run_check(runner: DockerTestRunner, current_kernel: str, required_kernel: str) -> int:
"""Run the check script; return exit code."""
def _run_check(
runner: DockerTestRunner,
current_kernel: str,
required_kernel: str,
mode: str = "at_least",
) -> int:
"""Run the check script; return exit code. mode is 'at_least' or 'exact'."""
env = {
"CURRENT_KERNEL": current_kernel,
"REQUIRED_KERNEL": required_kernel,
}
if mode != "at_least":
env["KERNEL_CHECK_MODE"] = mode
result = runner.run_script(
script="steps/run_check_kernel_at_least_test.sh",
configmaps={},
env_vars={
"CURRENT_KERNEL": current_kernel,
"REQUIRED_KERNEL": required_kernel,
},
env_vars=env,
extra_files=[(_CHECK_SCRIPT_SOURCE, _CHECK_SCRIPT_DEST)],
)
return result.exit_code


def test_current_newer_upstream_passes():
# --- check_kernel_at_least (allow newer: current >= required) ---


def test_at_least_current_newer_upstream_passes():
"""6.17.0-1007-aws >= 6.14.0-1018-aws (upstream 6.17 >= 6.14); was previously failing with sort -V on full string."""
runner = DockerTestRunner(package="nvidia-setup")
try:
exit_code = _run_check(runner, "6.17.0-1007-aws", "6.14.0-1018-aws")
exit_code = _run_check(runner, "6.17.0-1007-aws", "6.14.0-1018-aws", mode="at_least")
assert exit_code == 0
finally:
runner.cleanup()


def test_current_same_upstream_passes():
def test_at_least_current_same_upstream_passes():
"""6.14.0-1000-aws >= 6.14.0-1018-aws (same upstream)."""
runner = DockerTestRunner(package="nvidia-setup")
try:
exit_code = _run_check(runner, "6.14.0-1000-aws", "6.14.0-1018-aws")
exit_code = _run_check(runner, "6.14.0-1000-aws", "6.14.0-1018-aws", mode="at_least")
assert exit_code == 0
finally:
runner.cleanup()


def test_current_older_upstream_fails():
def test_at_least_current_older_upstream_fails():
"""6.13.0-1000-aws < 6.14.0-1018-aws (upstream 6.13 < 6.14)."""
runner = DockerTestRunner(package="nvidia-setup")
try:
exit_code = _run_check(runner, "6.13.0-1000-aws", "6.14.0-1018-aws")
exit_code = _run_check(runner, "6.13.0-1000-aws", "6.14.0-1018-aws", mode="at_least")
assert exit_code == 1
finally:
runner.cleanup()


def test_current_exact_required_passes():
def test_at_least_current_exact_required_passes():
"""6.14.0-1018-aws >= 6.14.0-1018-aws (equal)."""
runner = DockerTestRunner(package="nvidia-setup")
try:
exit_code = _run_check(runner, "6.14.0-1018-aws", "6.14.0-1018-aws")
exit_code = _run_check(runner, "6.14.0-1018-aws", "6.14.0-1018-aws", mode="at_least")
assert exit_code == 0
finally:
runner.cleanup()


# --- check_kernel_exact (exact upstream match; NVIDIA_SETUP_KERNEL_ALLOW_NEWER=false) ---


def test_exact_current_newer_upstream_fails():
"""6.17.0-1007-aws vs 6.14.0-1018-aws: exact requires same upstream, so fails."""
runner = DockerTestRunner(package="nvidia-setup")
try:
exit_code = _run_check(runner, "6.17.0-1007-aws", "6.14.0-1018-aws", mode="exact")
assert exit_code == 1
finally:
runner.cleanup()


def test_exact_current_same_upstream_passes():
"""6.14.0-1000-aws vs 6.14.0-1018-aws: same upstream 6.14.0, exact passes."""
runner = DockerTestRunner(package="nvidia-setup")
try:
exit_code = _run_check(runner, "6.14.0-1000-aws", "6.14.0-1018-aws", mode="exact")
assert exit_code == 0
finally:
runner.cleanup()


def test_exact_current_older_upstream_fails():
"""6.13.0-1000-aws vs 6.14.0-1018-aws: different upstream, exact fails."""
runner = DockerTestRunner(package="nvidia-setup")
try:
exit_code = _run_check(runner, "6.13.0-1000-aws", "6.14.0-1018-aws", mode="exact")
assert exit_code == 1
finally:
runner.cleanup()


def test_exact_current_exact_required_passes():
"""6.14.0-1018-aws vs 6.14.0-1018-aws: exact match passes."""
runner = DockerTestRunner(package="nvidia-setup")
try:
exit_code = _run_check(runner, "6.14.0-1018-aws", "6.14.0-1018-aws", mode="exact")
assert exit_code == 0
finally:
runner.cleanup()