Skip to content

[CP 1295] feat(tests/k8s-e2e): add GPU Operator e2e test suite with DME verific…#509

Open
ci-penbot-01 wants to merge 1 commit intoROCm:mainfrom
ci-penbot-01:CP.O2O.pensando.gpu-operator.1295.rocm.gpu-operator.main
Open

[CP 1295] feat(tests/k8s-e2e): add GPU Operator e2e test suite with DME verific…#509
ci-penbot-01 wants to merge 1 commit intoROCm:mainfrom
ci-penbot-01:CP.O2O.pensando.gpu-operator.1295.rocm.gpu-operator.main

Conversation

@ci-penbot-01
Copy link
Copy Markdown
Contributor

cp of pensando/gpu-operator#1295


Source PR Description (pensando/gpu-operator#1295):

…ation

Adds a containerized GPU Operator e2e test suite (Op000-Op900) that covers the full operator lifecycle with DME integration:

  • Op000: pre-flight checks
  • Op010-Op020: operator install and readiness
  • Op030: KMM controller verification
  • Op040: DME DaemonSet verification
  • Op050: GPU health check
  • Op060: metrics endpoint validation
  • Op065: partition configuration validation
  • Op070: GPU workload (PyTorch matmul)
  • Op900: operator teardown

Includes Dockerfile for containerized test runner.

2026/04/06 01:07:03 Op000: adding cert-manager helm repository
  2026/04/06 01:07:03 Op000: installing cert-manager
  2026/04/06 01:08:46 Op000: cert-manager installed
  2026/04/06 01:08:46 Op001: installing GPU Operator main-78a1f8d9 (DME: 1.5.0-rocm7.12-caa3d592)
  2026/04/06 01:08:52 Op001: GPU Operator installed
  2026/04/06 01:08:52 Op010: waiting for NFD to label node feature.node.kubernetes.io/amd-gpu=true
  2026/04/06 01:09:03 Op010: 1 node(s) have amd-gpu label
  2026/04/06 01:09:03 Op010: node galena3836 label amd.com/gpu.device-id=75a8
  2026/04/06 01:09:03 Op010: node galena3836 label amd.com/gpu.product-name=AMD_Radeon_Graphics
  2026/04/06 01:09:03 Op010: node galena3836 label amd.com/gpu.family=AI
  2026/04/06 01:09:03 Op010: node galena3836 label amd.com/gpu.driver-version=6.18.8
  2026/04/06 01:09:03 Op010: node galena3836 label amd.com/gpu.vram=144G
  2026/04/06 01:09:03 Op010: node galena3836 label amd.com/gpu.simd-count=512
  2026/04/06 01:09:03 Op010: node galena3836 label amd.com/gpu.cu-count=128
  2026/04/06 01:09:03 Op020: node galena3836 allocatable amd.com/gpu = 1
  2026/04/06 01:09:03 Op030: found 1 KMM controller pod(s); phase=Running
  2026/04/06 01:09:03 Op040: using DME pod default-metrics-exporter-w52mr (phase=Running)
  2026/04/06 01:09:08 Op040: DME returned 98 metric families; all required metrics present
  2026/04/06 01:09:13 Op050: gpu_health — healthy=1 unhealthy=0
  2026/04/06 01:09:19 Op060: category "ecc" OK
  2026/04/06 01:09:19 Op060: category "clock" OK
  2026/04/06 01:09:19 Op060: category "vram" OK
  2026/04/06 01:09:19 Op060: category "pcie" OK
  2026/04/06 01:09:19 Op060: category "power" OK
  2026/04/06 01:09:19 Op060: category "temperature" OK
  2026/04/06 01:09:19 Op060: 98 total metric families verified
  2026/04/06 01:09:24 Op065: detected 1 partitioned GPU instance(s)
  2026/04/06 01:09:24 Op065: GPU 0 partition_id=0 compute_type=SPX memory_type=NPS1
  2026/04/06 01:09:24 Op065: per-partition metric "gpu_gfx_busy_instantaneous" present for all 1 partition(s)
  2026/04/06 01:09:24 Op065: per-partition metric "gpu_total_vram" present for all 1 partition(s)
  2026/04/06 01:09:24 Op065: per-partition metric "gpu_used_vram" present for all 1 partition(s)
  2026/04/06 01:09:24 Op065: partition validation passed for 1 GPU partition(s)
  2026/04/06 01:09:25 Op070: submitting GPU workload pod op-e2e-gpu-workload
  2026/04/06 01:09:35 Op070: workload logs:
                     ROCm available: True
                     DONE
  2026/04/06 01:10:14 Op900: teardown complete
  OK: 11 passed
  --- PASS: Test (211.66s)

Cherrypick triggered by: ACP-Automation

…ation (#1295)

(cherry picked from commit 65785ed10aef1211116af75cf1b4f9484c63d658)
Copy link
Copy Markdown
Contributor

@spraveenio spraveenio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

Copy link
Copy Markdown
Member

@yansun1996 yansun1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants