Skip to content

Commit 050349f

Browse files
Reconcile PodClique TopologyConstraints (#302)
* Changes for setting Topology aware scheduling constraints in PodGangs. * Renamed ClusterTopologyConfiguration to TopologyAwareSchedulingConfiguration in operator config. * Introduced a new condition TopologyLevelsUnavailable on PCS * PackDomain field in corev1alpha1 TopologyConstraint is now required. * When creating ClusterTopology, if host topology level is not defined in TopologyAwareSchedulingConfiguration then the operator will set this level in ClusterTopology as this is a required level. * Adapted PodGang component to set pack constraints at all hierarchy levels. * Introduced Conditions in PCS status. * Introduced a new condition TopologyLevelsUnavailable in PCS status. * Added reconciliation code to update the PCS status condition. * feat: implement topology-aware scheduling and add annotations for PodGangs * Added constants for TopologyLevelsUnavailable condition reason * Added code to update or remove the condition on PCS. * Create utility function for cluster topology with unit test * Made PodCliqueSetStatus.Conditions optional * Upgraded KAI scheduler version dependency for e2e test to v0.12.0 * Changed polling timeout for e2e tests to 4 mins due to repeated timeouts on GHA * Removed NVIDIA GPU operator to be installed as its not required. * Fixed constant values for condition reasons and reduced the no of control plane servers for e2e from 3 to 1 * Moved `synchronizeTopology` in main to clustertopology package. * Added restartPolicy to Always for Grove operator deployment. * Removed defaulting preferred constraint to Host topology domain. This will be set later after requirements are clear. * Moved GetClusterTopologyLevels to clustertopology package. * Added docstrings to computeTopologyLevelsUnavailableCondition and mutateTopologyLevelUnavailableConditions functions. * Changed the docstring for TopologyPackConstraint Preferred and Required fields. * Reworded docstring for PodGang TopologyConstraint. Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com> --------- Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com> Signed-off-by: Ron Kahn <rkahn@nvidia.com> Co-authored-by: Ron Kahn <rkahn@nvidia.com>
1 parent 907000e commit 050349f

File tree

41 files changed

+2002
-848
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+2002
-848
lines changed

docs/api-reference/operator-api.md

Lines changed: 19 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -424,6 +424,7 @@ _Appears in:_
424424
| Field | Description | Default | Validation |
425425
| --- | --- | --- | --- |
426426
| `observedGeneration` _integer_ | ObservedGeneration is the most recent generation observed by the controller. | | |
427+
| `conditions` _[Condition](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.33/#condition-v1-meta) array_ | Conditions represents the latest available observations of the PodCliqueSet by its controller. | | |
427428
| `lastErrors` _[LastError](#lasterror) array_ | LastErrors captures the last errors observed by the controller when reconciling the PodCliqueSet. | | |
428429
| `replicas` _integer_ | Replicas is the total number of PodCliqueSet replicas created. | | |
429430
| `updatedReplicas` _integer_ | UpdatedReplicas is the number of replicas that have been updated to the desired revision of the PodCliqueSet. | 0 | |
@@ -638,8 +639,8 @@ allowing workload operators a consistent way to reference topology levels when d
638639

639640

640641
_Appears in:_
641-
- [ClusterTopologyConfiguration](#clustertopologyconfiguration)
642642
- [ClusterTopologySpec](#clustertopologyspec)
643+
- [TopologyAwareSchedulingConfiguration](#topologyawareschedulingconfiguration)
643644

644645
| Field | Description | Default | Validation |
645646
| --- | --- | --- | --- |
@@ -689,23 +690,6 @@ _Appears in:_
689690
| `acceptContentTypes` _string_ | AcceptContentTypes defines the Accept header sent by clients when connecting to the server,<br />overriding the default value of 'application/json'. This field will control all connections<br />to the server used by a particular client. | | |
690691

691692

692-
#### ClusterTopologyConfiguration
693-
694-
695-
696-
ClusterTopologyConfiguration defines the configuration for topology-aware scheduling.
697-
698-
699-
700-
_Appears in:_
701-
- [OperatorConfiguration](#operatorconfiguration)
702-
703-
| Field | Description | Default | Validation |
704-
| --- | --- | --- | --- |
705-
| `enabled` _boolean_ | Enabled indicates whether topology-aware scheduling is enabled. | | |
706-
| `levels` _[TopologyLevel](#topologylevel) array_ | Levels is an ordered list of topology levels from broadest to narrowest scope.<br />Used to create/update the ClusterTopology CR at operator startup. | | |
707-
708-
709693
#### ControllerConfiguration
710694

711695

@@ -883,6 +867,23 @@ _Appears in:_
883867
| `metrics` _[Server](#server)_ | Metrics is the configuration for serving the metrics endpoint. | | |
884868

885869

870+
#### TopologyAwareSchedulingConfiguration
871+
872+
873+
874+
TopologyAwareSchedulingConfiguration defines the configuration for topology-aware scheduling.
875+
876+
877+
878+
_Appears in:_
879+
- [OperatorConfiguration](#operatorconfiguration)
880+
881+
| Field | Description | Default | Validation |
882+
| --- | --- | --- | --- |
883+
| `enabled` _boolean_ | Enabled indicates whether topology-aware scheduling is enabled. | | |
884+
| `levels` _[TopologyLevel](#topologylevel) array_ | Levels is an ordered list of topology levels from broadest to narrowest scope.<br />Used to create/update the TopologyAwareScheduling CR at operator startup. | | |
885+
886+
886887
#### WebhookServer
887888

888889

docs/api-reference/scheduler-api.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ _Appears in:_
8686
| --- | --- | --- | --- |
8787
| `podgroups` _[PodGroup](#podgroup) array_ | PodGroups is a list of member pod groups in the PodGang. | | |
8888
| `topologyConstraint` _[TopologyConstraint](#topologyconstraint)_ | TopologyConstraint defines topology packing constraints for entire pod gang.<br />Translated from PodCliqueSet.TopologyConstraint.<br />Updated by operator on each reconciliation when PodCliqueSet topology constraints change. | | |
89-
| `topologyConstraintGroupConfigs` _[TopologyConstraintGroupConfig](#topologyconstraintgroupconfig) array_ | TopologyConstraintGroupConfigs defines groups of PodGroups for topology-aware placement.<br />Enhanced with topology constraints for PCSG-level packing.<br />Updated by operator on each reconciliation when PCSG topology constraints change. | | |
89+
| `topologyConstraintGroupConfigs` _[TopologyConstraintGroupConfig](#topologyconstraintgroupconfig) array_ | TopologyConstraintGroupConfigs defines TopologyConstraints for a group of PodGroups when it is a strict subset<br />of total number of PodGroups for topology-aware placement. | | |
9090
| `priorityClassName` _string_ | PriorityClassName is the name of the PriorityClass for the PodGang. | | |
9191
| `reuseReservationRef` _[NamespacedName](#namespacedname)_ | ReuseReservationRef holds the reference to another PodGang resource scheduled previously.<br />During updates, an operator can suggest to reuse the reservation of the previous PodGang for a newer version of the<br />PodGang resource. This is a suggestion for the scheduler and not a requirement that must be met. If the scheduler plugin<br />finds that the reservation done previously was network optimised and there are no better alternatives available, then it<br />will reuse the reservation. If there are better alternatives available, then the scheduler will ignore this suggestion. | | |
9292

@@ -159,16 +159,17 @@ _Appears in:_
159159

160160
| Field | Description | Default | Validation |
161161
| --- | --- | --- | --- |
162-
| `name` _string_ | Name is the name of the topology constraint group.<br />It will drive from the corresponding PCSG name. | | |
162+
| `name` _string_ | Name is the name of the topology constraint group. | | |
163163
| `podGroupNames` _string array_ | PodGroupNames is the list of PodGroup names in the topology constraint group. | | |
164-
| `topologyConstraint` _[TopologyConstraint](#topologyconstraint)_ | TopologyConstraint defines topology packing constraints for this group.<br />Enables PCSG-level topology constraints.<br />Updated by operator when PodCliqueScalingGroup topology constraints change. | | |
164+
| `topologyConstraint` _[TopologyConstraint](#topologyconstraint)_ | TopologyConstraint defines topology packing constraints for this group. | | |
165165

166166

167167
#### TopologyPackConstraint
168168

169169

170170

171171
TopologyPackConstraint defines a topology packing constraint.
172+
Each of Required and Preferred fields hold a topologyKey, e.g. "kubernetes.io/hostname" ( these are key of labels added on nodes).
172173

173174

174175

@@ -177,7 +178,7 @@ _Appears in:_
177178

178179
| Field | Description | Default | Validation |
179180
| --- | --- | --- | --- |
180-
| `required` _string_ | Required defines topology constraint that must be satisfied.<br />Holds topologyKey (not level name) translated from user's packLevel specification.<br />Example: "topology.kubernetes.io/rack" | | |
181-
| `preferred` _string_ | Preferred defines best-effort topology constraint.<br />Auto-generated by operator using strictest level topologyKey for optimization.<br />Scheduler can fallback to less strict levels if preferred cannot be satisfied.<br />Example: "kubernetes.io/hostname" | | |
181+
| `required` _string_ | Required defines a topology constraint that must be satisfied as a hard requirement. The workload will not be<br />scheduled if this constraint cannot be satisfied. Generally, it is easier for the scheduler to satisfy constraints<br />on topology domains with larger compute capacity, (e.g. zone or datacenter), than smaller domains, (e.g. host or<br />numa). Holds topologyKey (not level name) translated from user's packLevel specification.<br />Example: "topology.kubernetes.io/rack" | | |
182+
| `preferred` _string_ | Preferred defines best-effort topology constraint. Topology domains that provide the most optimized performance<br />with dense packing (e.g. host or numa) are typically used as preferred constraints for topology packing. It might be<br />harder to satisfy these constraints if the topology domains are limited in compute capacity. Since it is preferred<br />constraint, it is therefore not binding on the scheduler to mandatorily satisfy this packing constraint. Scheduler<br />can fall back to higher topology levels (upto Required constraint) if preferred cannot be satisfied.<br />Example: "kubernetes.io/hostname" | | |
182183

183184

operator/api/common/constants/constants.go

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,8 @@ const (
4343
// AnnotationDisableManagedResourceProtection is an annotation set by an operator on a PodCliqueSet to explicitly
4444
// disable protection of managed resources for a PodCliqueSet.
4545
AnnotationDisableManagedResourceProtection = "grove.io/disable-managed-resource-protection"
46+
// AnnotationTopologyName is an annotation set on PodGang to allow KAI scheduler to discover which topology to use.
47+
AnnotationTopologyName = "grove.io/topology-name"
4648
)
4749

4850
// Constants for Grove environment variables
@@ -87,6 +89,9 @@ const (
8789
// ConditionTypePodCliqueScheduled indicates that the PodClique has been successfully scheduled.
8890
// This condition is set to true when number of scheduled pods in the PodClique is greater than or equal to PodCliqueSpec.MinAvailable.
8991
ConditionTypePodCliqueScheduled = "PodCliqueScheduled"
92+
// ConditionTopologyLevelsUnavailable indicates that the required topology levels defined on a PodCliqueSet for topology-aware scheduling are no longer available.
93+
// This can happen when the ClusterTopology resource is modified which removes one or more levels required by the PodCliqueSet.
94+
ConditionTopologyLevelsUnavailable = "TopologyLevelsUnavailable"
9095
)
9196

9297
// Constants for Condition Reasons.
@@ -107,6 +112,14 @@ const (
107112
ConditionReasonSufficientAvailablePCSGReplicas = "SufficientAvailablePodCliqueScalingGroupReplicas"
108113
// ConditionReasonUpdateInProgress indicates that the resource is undergoing rolling update.
109114
ConditionReasonUpdateInProgress = "UpdateInProgress"
115+
// ConditionReasonClusterTopologyNotFound indicates that the ClusterTopology resource required for topology-aware scheduling was not found.
116+
ConditionReasonClusterTopologyNotFound = "ClusterTopologyNotFound"
117+
// ConditionReasonTopologyLevelsUnavailable indicates that the one or more required topology levels defined on a
118+
// PodCliqueSet for topology-aware scheduling are no longer defined in the ClusterTopology resource.
119+
ConditionReasonTopologyLevelsUnavailable = "ClusterTopologyLevelsUnavailable"
120+
// ConditionReasonAllTopologyLevelsAvailable indicates that all required topology levels defined on a
121+
// PodCliqueSet for topology-aware scheduling are defined in the ClusterTopology resource.
122+
ConditionReasonAllTopologyLevelsAvailable = "AllClusterTopologyLevelsAvailable"
110123
)
111124

112125
const (

operator/api/config/v1alpha1/types.go

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -55,16 +55,16 @@ var (
5555

5656
// OperatorConfiguration defines the configuration for the Grove operator.
5757
type OperatorConfiguration struct {
58-
metav1.TypeMeta `json:",inline"`
59-
ClientConnection ClientConnectionConfiguration `json:"runtimeClientConnection"`
60-
LeaderElection LeaderElectionConfiguration `json:"leaderElection"`
61-
Server ServerConfiguration `json:"server"`
62-
Debugging *DebuggingConfiguration `json:"debugging,omitempty"`
63-
Controllers ControllerConfiguration `json:"controllers"`
64-
LogLevel LogLevel `json:"logLevel"`
65-
LogFormat LogFormat `json:"logFormat"`
66-
Authorizer AuthorizerConfig `json:"authorizer"`
67-
ClusterTopology ClusterTopologyConfiguration `json:"clusterTopology"`
58+
metav1.TypeMeta `json:",inline"`
59+
ClientConnection ClientConnectionConfiguration `json:"runtimeClientConnection"`
60+
LeaderElection LeaderElectionConfiguration `json:"leaderElection"`
61+
Server ServerConfiguration `json:"server"`
62+
Debugging *DebuggingConfiguration `json:"debugging,omitempty"`
63+
Controllers ControllerConfiguration `json:"controllers"`
64+
LogLevel LogLevel `json:"logLevel"`
65+
LogFormat LogFormat `json:"logFormat"`
66+
Authorizer AuthorizerConfig `json:"authorizer"`
67+
TopologyAwareScheduling TopologyAwareSchedulingConfiguration `json:"topologyAwareScheduling"`
6868
}
6969

7070
// LeaderElectionConfiguration defines the configuration for the leader election.
@@ -191,12 +191,12 @@ type AuthorizerConfig struct {
191191
ExemptServiceAccountUserNames []string `json:"exemptServiceAccountUserNames,omitempty"`
192192
}
193193

194-
// ClusterTopologyConfiguration defines the configuration for topology-aware scheduling.
195-
type ClusterTopologyConfiguration struct {
194+
// TopologyAwareSchedulingConfiguration defines the configuration for topology-aware scheduling.
195+
type TopologyAwareSchedulingConfiguration struct {
196196
// Enabled indicates whether topology-aware scheduling is enabled.
197197
Enabled bool `json:"enabled"`
198198
// Levels is an ordered list of topology levels from broadest to narrowest scope.
199-
// Used to create/update the ClusterTopology CR at operator startup.
199+
// Used to create/update the TopologyAwareScheduling CR at operator startup.
200200
// +optional
201201
Levels []corev1alpha1.TopologyLevel `json:"levels,omitempty"`
202202
}

operator/api/config/v1alpha1/zz_generated.deepcopy.go

Lines changed: 22 additions & 22 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

operator/api/config/validation/validation.go

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ func ValidateOperatorConfiguration(config *configv1alpha1.OperatorConfiguration)
3737
allErrs = append(allErrs, validateLeaderElectionConfiguration(config.LeaderElection, field.NewPath("leaderElection"))...)
3838
allErrs = append(allErrs, validateClientConnectionConfiguration(config.ClientConnection, field.NewPath("clientConnection"))...)
3939
allErrs = append(allErrs, validateControllerConfiguration(config.Controllers, field.NewPath("controllers"))...)
40-
allErrs = append(allErrs, validateClusterTopologyConfiguration(config.ClusterTopology, field.NewPath("clusterTopology"))...)
40+
allErrs = append(allErrs, validateTopologyAwareSchedulingConfig(config.TopologyAwareScheduling, field.NewPath("topologyAwareScheduling"))...)
4141
return allErrs
4242
}
4343

@@ -116,10 +116,10 @@ func mustBeGreaterThanZeroDuration(duration metav1.Duration, fldPath *field.Path
116116
return allErrs
117117
}
118118

119-
// validateClusterTopologyConfiguration validates the cluster topology configuration.
119+
// validateTopologyAwareSchedulingConfig validates the cluster topology configuration.
120120
// When cluster topology is enabled, it ensures the topology name and levels are provided,
121121
// and validates domain and key uniqueness.
122-
func validateClusterTopologyConfiguration(clusterTopologyCfg configv1alpha1.ClusterTopologyConfiguration, fldPath *field.Path) field.ErrorList {
122+
func validateTopologyAwareSchedulingConfig(clusterTopologyCfg configv1alpha1.TopologyAwareSchedulingConfiguration, fldPath *field.Path) field.ErrorList {
123123
allErrs := field.ErrorList{}
124124
if !clusterTopologyCfg.Enabled {
125125
return allErrs

0 commit comments

Comments
 (0)