Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions tests/common/support/dscInitialization.go
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,24 @@ func GetApplicationsNamespace(test Test) (string, error) {
return GetApplicationsNamespaceFromDSCI(test, DefaultDSCIName)
}

func GetRHOAIVersionFromDSCI(test Test) string {
dsci, err := GetDSCI(test, DefaultDSCIName)
if err != nil {
test.T().Logf("Failed to get DSCI for version: %v", err)
return ""
}
version, found, err := unstructured.NestedString(dsci.Object, "status", "release", "version")
if err != nil {
test.T().Logf("Failed to read status.release.version from DSCI %s: %v", DefaultDSCIName, err)
return ""
}
if !found {
test.T().Logf("DSCI %s is missing status.release.version", DefaultDSCIName)
return ""
Comment thread
coderabbitai[bot] marked this conversation as resolved.
}
return version
}

func GetApplicationsNamespaceFromDSCI(test Test, dsciName string) (string, error) {
dsci, err := GetDSCI(test, dsciName)
if err != nil {
Expand Down
36 changes: 36 additions & 0 deletions tests/trainer/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,42 @@ go test ./tests/trainer/ -v
go test ./tests/trainer -run TestCustomTrainingRuntimesAvailable -v
```

## Upgrade Tests

Upgrade tests validate that Trainer v2 resources survive an RHOAI upgrade. They run in two phases controlled by `TEST_TIER`:

```bash
# Pre-upgrade: create resources and store baselines
TEST_TIER=Pre-Upgrade go test -v -timeout 10m ./tests/trainer/

# ... perform RHOAI upgrade ...

# Post-upgrade: verify resources survived and complete workloads
TEST_TIER=Post-Upgrade go test -v -timeout 10m ./tests/trainer/
```

### Test Coverage

| Test Pair | What it validates |
|-----------|-------------------|
| `TestSetupSleepTrainJob` / `TestVerifySleepTrainJob` | Running TrainJob survives upgrade with zero pod restarts |
| `TestSetupTrainingRuntime` / `TestVerifyTrainingRuntime` | Custom namespace-scoped TrainingRuntime persists, spec unchanged |
| `TestSetupCustomRuntimeUpgradeTrainJob` / `TestRunCustomRuntimeUpgradeTrainJob` | Custom ClusterTrainingRuntime + Kueue suspend/resume lifecycle |

### Spec Integrity Checks

Post-upgrade tests compare resource `metadata.generation` against pre-upgrade baselines stored in ConfigMaps. When generation changes (indicating a spec mutation), before/after specs are logged as JSON for analysis. The assertion is version-aware — an explicit allowlist in [`utils/utils_upgrade.go`](utils/utils_upgrade.go) defines upgrade paths where spec mutations are expected (e.g., API changes across minor versions). The RHOAI version is read from DSCI `status.release.version`.

### Known Limitations

- **RHOAIENG-48867**: 4 Kueue suspend/resume tests are skipped because the Trainer controller fails updating immutable JobSet `spec.replicatedJobs` when built-in ClusterTrainingRuntime specs change during upgrade. Only affects suspended jobs referencing default/versioned runtimes — running jobs and custom runtimes are not impacted.
- Tests are version-agnostic — which upgrade path is tested depends on Jenkins pipeline deployment configuration.

### Maintenance

- When Trainer API changes introduce spec mutations during upgrade, add the version pair to `specMutationExpectedPaths` in [`utils/utils_upgrade.go`](utils/utils_upgrade.go).
- When RHOAIENG-48867 is fixed upstream, remove the `t.Skip` calls in `trainer_kueue_upgrade_training_test.go` to enable the default and specific runtime Kueue tests.

## GPU Requirements

> **Note:** The TrainingHub SDK tests (`TestOsftTrainingHubMultiNodeMultiGPU`, `TestLoraTrainingHubMultiNodeMultiGPU`, `TestSftTrainingHubMultiNodeMultiGPU`) require **NVIDIA Ampere or newer GPUs** (e.g. A100, H100). The training runtime image (`odh-training-cuda128-torch29-py312-rhel9`, referenced as `DefaultTrainingHubRuntimeCUDA` in [`tests/trainer/utils/utils_runtimes.go`](utils/utils_runtimes.go)) ships with `flash_attn==2.8.3`, which requires compute capability >= 8.0. These tests will not work on pre-Ampere GPUs such as T4 or V100.
Expand Down
Loading
Loading