Skip to content

Conversation

@shmuel-runai
Copy link
Contributor

/kind feature

ref: GREP-270

Implement ComputeDomain lifecycle management in the PodCliqueSet controller to enable automatic Multi-Node NVLink support for GPU workloads.

  • Add ComputeDomain component with Sync/Delete operations
  • Handle scale-out (create CDs) and scale-in (remove CDs)
  • Use finalizers to protect CDs from accidental deletion
  • Register the component in PCS reconcile order before PodClique
  • Add comprehensive unit tests with fake client support

danbar2
danbar2 previously approved these changes Jan 22, 2026
@gflarity
Copy link
Contributor

gflarity commented Jan 22, 2026

I'm taking a look at why these E2E tests are failing. Something already stands out though. Early tests are failing to clean up resources and causing the remaining tests to be skipped. Try rebasing from main, I added a PR to clean up better which should help.

@gflarity
Copy link
Contributor

From the logs, you can see the controller continuously retrying:

Error deleting managed resources...
FailedTasks: [delete-ComputeDomain]
error: no matches for kind "ComputeDomain" in version "resource.nvidia.com/v1beta1"

Implement ComputeDomain lifecycle management in the PodCliqueSet controller
to enable automatic Multi-Node NVLink support for GPU workloads.

- Add ComputeDomain component with Sync/Delete operations
- Handle scale-out (create CDs) and scale-in (remove CDs)
- Use finalizers to protect CDs from accidental deletion
- Register component in PCS reconcile order before PodClique
- Add comprehensive unit tests with fake client support
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants