You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| `--workload-gate` | | string | Taint for skyhook-operator runtime required (format: key=value:effect or key:effect). This is a day 2 option for cluster scaling operations. |
682
+
| `--workload-selector` | | string[] | Label selector for skyhook-customizations to prevent eviction of running training jobs (format: key=value, repeatable). Required when skyhook-customizations is enabled with training intent. |
681
683
682
684
**Behavior:**
683
685
- All components from the recipe are bundled automatically
The `--workload-gate` and `--workload-selector` flags are day 2 operational options for cluster scaling operations:
830
+
831
+
- **`--workload-gate`**: Specifies a taint for skyhook-operator's runtime required feature. This ensures nodes are properly configured before workloads can schedule on them during cluster scaling. The taint is configured in the skyhook-operator Helm values file at `controllerManager.manager.env.runtimeRequiredTaint`. For more information about runtime required, see the [skyhook documentation](hhttps://github.com/NVIDIA/skyhook/blob/main/docs/runtime_required.md).
832
+
833
+
- **`--workload-selector`**: Specifies a label selector for skyhook-customizations to prevent skyhook from evicting running training jobs. This is critical for training workloads where job eviction would cause significant disruption. The selector is set in the Skyhook CR manifest (tuning.yaml) in the `spec.workloadSelector.matchLabels` field.
834
+
835
+
**Validation Warnings:**
836
+
837
+
When generating bundles with skyhook-customizations enabled, validation warnings are displayed for missing configuration:
838
+
839
+
1. **Workload Selector Warning**: When skyhook-customizations is enabled with training intent, if `--workload-selector` is not set, a warning will be displayed:
840
+
841
+
```
842
+
Warning: skyhook-customizations is enabled with training intent but --workload-selector is not set.
843
+
This may cause skyhook to evict running training jobs. Consider setting --workload-selector to prevent eviction.
844
+
```
845
+
846
+
2. **Accelerated Selector Warning**: When skyhook-customizations is enabled with training or inference intent, if `--accelerated-node-selector` is not set, a warning will be displayed:
847
+
848
+
```
849
+
Warning: skyhook-customizations is enabled with {training|inference} intent but --accelerated-node-selector is not set.
850
+
Without this selector, the customization will run on all nodes. Consider setting --accelerated-node-selector to target specific nodes.
851
+
```
852
+
853
+
**Examples:**
854
+
```shell
855
+
# Generate bundle with day 2 options for training workloads
slog.Warn("skyhook-customizations is enabled with training intent but --workload-selector is not set",
512
+
"component", "skyhook-customizations",
513
+
"intent", "training",
514
+
)
515
+
// Store warning to be added to deployment notes
516
+
b.warnings=append(b.warnings, "Warning: skyhook-customizations is enabled with training intent but --workload-selector is not set. This may cause skyhook to evict running training jobs. Consider setting --workload-selector to prevent eviction.")
517
+
}
518
+
}
519
+
520
+
// validateAcceleratedSelector validates that accelerated-node-selector is set when skyhook-customizations
slog.Warn("skyhook-customizations is enabled with training/inference intent but --accelerated-node-selector is not set",
553
+
"component", "skyhook-customizations",
554
+
"intent", intent,
555
+
)
556
+
// Store warning to be added to deployment notes
557
+
warningMsg:=fmt.Sprintf("Warning: skyhook-customizations is enabled with %s intent but --accelerated-node-selector is not set. Without this selector, the customization will run on all nodes. Consider setting --accelerated-node-selector to target specific nodes.", intent)
558
+
b.warnings=append(b.warnings, warningMsg)
559
+
}
436
560
}
437
561
438
562
// writeRecipeFile serializes the recipe to the bundle directory.
0 commit comments