Skip to content

feat(validator): add Kubeflow Trainer to robust-controller and skip inference-gateway on training clusters#349

Merged
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:feat/robust-controller-kubeflow
Mar 12, 2026
Merged

feat(validator): add Kubeflow Trainer to robust-controller and skip inference-gateway on training clusters#349
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:feat/robust-controller-kubeflow

Conversation

@yuanchen8911
Copy link
Contributor

@yuanchen8911 yuanchen8911 commented Mar 11, 2026

Summary

  • Add Kubeflow Trainer as an alternative target for the robust-controller CNCF conformance check
  • Skip inference-gateway check on training clusters (no kgateway)
  • Add Kubeflow Trainer support to evidence collection script (collect-evidence.sh)
  • Skip inference gateway evidence collection when kgateway is not installed

Problem

  1. robust-controller: Only validated Dynamo operator, skipping on all training clusters (EKS and GKE)
  2. inference-gateway: Failed with "GatewayClass 'kgateway' not found" on training clusters instead of skipping gracefully

Changes

robust-controller (Go + evidence script)

Check selects target based on recipe component presence:

  • dynamo-platform in recipe → validate Dynamo operator (existing)
  • kubeflow-trainer in recipe → validate Kubeflow Trainer (new)
  • neither → skip

Kubeflow Trainer validation:

  1. Controller deployment running (kubeflow-trainer-controller-manager)
  2. Validating webhook operational — exact match on validator.trainer.kubeflow.org
  3. TrainJob CRD exists (trainjobs.trainer.kubeflow.org)
  4. Webhook rejects invalid TrainJob (non-existent runtimeRef, proves webhook logic)

Evidence script (collect-evidence.sh):

  • Auto-detects Dynamo vs Kubeflow Trainer
  • Collects deployment health, CRDs, webhook verification, rejection test

inference-gateway (Go + evidence script)

  • Skip when kgateway component is not in recipe (training clusters)
  • Evidence script skips when kgateway deployment not found

Test plan

  • Unit tests: TestRecipeHasComponent (5 cases), TestCheckRobustControllerRouting (6 cases)
  • go build, go vet, gofmt clean
  • E2E: GKE training cluster — robust-controller PASSED (Kubeflow Trainer)
  • E2E: EKS training cluster — robust-controller PASSED (Kubeflow Trainer)
  • E2E: EKS inference cluster — robust-controller PASSED (Dynamo)
  • Evidence collection: robust-operator.md generated with Kubeflow Trainer evidence
  • No regressions on Dynamo path

@yuanchen8911 yuanchen8911 requested review from a team as code owners March 11, 2026 21:29
@yuanchen8911 yuanchen8911 added enhancement New feature or request area/validator labels Mar 11, 2026
@yuanchen8911 yuanchen8911 force-pushed the feat/robust-controller-kubeflow branch from 1ae1119 to 60c6076 Compare March 11, 2026 21:31
@yuanchen8911 yuanchen8911 requested review from dims and mchmarny March 11, 2026 21:39
@yuanchen8911 yuanchen8911 force-pushed the feat/robust-controller-kubeflow branch from 60c6076 to 0972e98 Compare March 11, 2026 21:44
dims
dims previously approved these changes Mar 11, 2026
@yuanchen8911 yuanchen8911 force-pushed the feat/robust-controller-kubeflow branch from 0972e98 to 1f0f531 Compare March 11, 2026 21:51
@yuanchen8911 yuanchen8911 force-pushed the feat/robust-controller-kubeflow branch from 1f0f531 to 73f5395 Compare March 11, 2026 22:23
@github-actions github-actions bot added size/XL and removed size/L labels Mar 11, 2026
@yuanchen8911 yuanchen8911 force-pushed the feat/robust-controller-kubeflow branch from 73f5395 to 2b9392a Compare March 11, 2026 22:40
@yuanchen8911 yuanchen8911 changed the title feat(validator): add Kubeflow Trainer support to robust-controller conformance check feat(validator): add Kubeflow Trainer to robust-controller and skip inference-gateway on training clusters Mar 11, 2026
@yuanchen8911 yuanchen8911 force-pushed the feat/robust-controller-kubeflow branch 5 times, most recently from 51d0372 to 7e21445 Compare March 11, 2026 23:03
@yuanchen8911 yuanchen8911 requested a review from dims March 11, 2026 23:04
@yuanchen8911 yuanchen8911 force-pushed the feat/robust-controller-kubeflow branch from 7e21445 to 3e0586b Compare March 11, 2026 23:26
@yuanchen8911 yuanchen8911 force-pushed the feat/robust-controller-kubeflow branch from 3e0586b to d2b47f9 Compare March 11, 2026 23:49
The robust-controller conformance check previously only validated the
Dynamo operator, causing it to skip on all training clusters. This adds
Kubeflow Trainer as an alternative target, selected based on recipe
component presence:

- dynamo-platform in recipe → validate Dynamo operator
- kubeflow-trainer in recipe → validate Kubeflow Trainer
- neither → skip

Kubeflow Trainer validation checks:
1. Controller deployment running (kubeflow-trainer-controller-manager)
2. Validating webhook operational with reachable endpoint
3. TrainJob CRD exists (trainjobs.trainer.kubeflow.org)
4. Webhook rejects invalid TrainJob (behavioral test)

Refactored the original Dynamo validation into checkRobustDynamo() and
renamed validateWebhookRejects to validateDynamoWebhookRejects for
clarity.
@yuanchen8911 yuanchen8911 force-pushed the feat/robust-controller-kubeflow branch from d2b47f9 to 3a0c795 Compare March 12, 2026 00:05
@yuanchen8911 yuanchen8911 merged commit 8312960 into NVIDIA:main Mar 12, 2026
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants