Skip to content

Commit ea20550

Browse files
kaovilaiblackpiglet
authored andcommitted
Implement priority class name retrieval and validation for Velero components
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
1 parent 2d99679 commit ea20550

1 file changed

Lines changed: 193 additions & 7 deletions

File tree

design/priority-class-name-support_design.md

Lines changed: 193 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -260,7 +260,7 @@ func (e *csiSnapshotExposer) createBackupPod(
260260
The call to `createBackupPod` in the `Expose` method will be updated to pass the priority class name retrieved from the node-agent-configmap:
261261

262262
```go
263-
priorityClassName, _ := veleroutil.GetDataMoverPriorityClassName(ctx, namespace, kubeClient, configMapName)
263+
priorityClassName, _ := kube.GetDataMoverPriorityClassName(ctx, namespace, kubeClient, configMapName)
264264
backupPod, err := e.createBackupPod(
265265
ctx,
266266
ownerObject,
@@ -277,9 +277,12 @@ backupPod, err := e.createBackupPod(
277277
)
278278
```
279279

280-
A new function, `GetDataMoverPriorityClassName`, will be added to the `veleroutil` package to retrieve the priority class name for data mover pods:
280+
A new function, `GetDataMoverPriorityClassName`, will be added to the `pkg/util/kube` package (in the same file as `ValidatePriorityClass`) to retrieve the priority class name for data mover pods:
281281

282282
```go
283+
// In pkg/util/kube/priority_class.go
284+
285+
// GetDataMoverPriorityClassName retrieves the priority class name for data mover pods from the node-agent-configmap
283286
func GetDataMoverPriorityClassName(ctx context.Context, namespace string, kubeClient kubernetes.Interface, configName string) (string, error) {
284287
// Get from node-agent-configmap
285288
configs, err := nodeagent.GetConfigs(ctx, namespace, kubeClient, configName)
@@ -294,6 +297,95 @@ func GetDataMoverPriorityClassName(ctx context.Context, namespace string, kubeCl
294297

295298
This function will get the priority class name from the node-agent-configmap. If it's not found, it will return an empty string.
296299

300+
### Validation and Logging
301+
302+
To improve observability and help with troubleshooting, the implementation will include:
303+
304+
1. **Optional Priority Class Validation**: A helper function to check if a priority class exists in the cluster. This function will be added to the `pkg/util/kube` package alongside other Kubernetes utility functions:
305+
306+
```go
307+
// In pkg/util/kube/priority_class.go
308+
309+
// ValidatePriorityClass checks if the specified priority class exists in the cluster
310+
// Returns nil if the priority class exists or if priorityClassName is empty
311+
// Returns a warning (not an error) if the priority class doesn't exist
312+
func ValidatePriorityClass(ctx context.Context, kubeClient kubernetes.Interface, priorityClassName string, logger logrus.FieldLogger) {
313+
if priorityClassName == "" {
314+
return
315+
}
316+
317+
_, err := kubeClient.SchedulingV1().PriorityClasses().Get(ctx, priorityClassName, metav1.GetOptions{})
318+
if err != nil {
319+
if apierrors.IsNotFound(err) {
320+
logger.Warnf("Priority class %q not found in cluster. Pod creation may fail if the priority class doesn't exist when pods are scheduled.", priorityClassName)
321+
} else {
322+
logger.WithError(err).Warnf("Failed to validate priority class %q", priorityClassName)
323+
}
324+
} else {
325+
logger.Infof("Validated priority class %q exists in cluster", priorityClassName)
326+
}
327+
}
328+
```
329+
330+
2. **Debug Logging**: Add debug logs when priority classes are applied:
331+
332+
```go
333+
// In deployment creation
334+
if c.priorityClassName != "" {
335+
logger.Debugf("Setting priority class %q for Velero server deployment", c.priorityClassName)
336+
}
337+
338+
// In daemonset creation
339+
if c.priorityClassName != "" {
340+
logger.Debugf("Setting priority class %q for node agent daemonset", c.priorityClassName)
341+
}
342+
343+
// In maintenance job creation
344+
if priorityClassName != "" {
345+
logger.Debugf("Setting priority class %q for maintenance job %s", priorityClassName, job.Name)
346+
}
347+
348+
// In data mover pod creation
349+
if priorityClassName != "" {
350+
logger.Debugf("Setting priority class %q for data mover pod %s", priorityClassName, pod.Name)
351+
}
352+
```
353+
354+
These validation and logging features will help administrators:
355+
- Identify configuration issues early (validation warnings)
356+
- Troubleshoot priority class application issues
357+
- Verify that priority classes are being applied as expected
358+
359+
The `ValidatePriorityClass` function should be called at the following points:
360+
361+
1. **During `velero install`**: Validate the priority classes specified via CLI flags:
362+
- After parsing `--server-priority-class-name` flag
363+
- After parsing `--node-agent-priority-class-name` flag
364+
365+
2. **When reading from ConfigMaps**: Validate priority classes when loading configurations:
366+
- In `GetDataMoverPriorityClassName` when reading from node-agent-configmap
367+
- In maintenance job controller when reading from repo-maintenance-job-configmap
368+
369+
3. **During pod/job creation** (optional, for runtime validation):
370+
- Before creating data mover pods (PVB/PVR/CSI snapshot data movement)
371+
- Before creating maintenance jobs
372+
373+
Example usage:
374+
```go
375+
// During velero install
376+
if o.ServerPriorityClassName != "" {
377+
kube.ValidatePriorityClass(ctx, kubeClient, o.ServerPriorityClassName, logger.WithField("component", "server"))
378+
}
379+
380+
// When reading from ConfigMap
381+
priorityClassName, err := kube.GetDataMoverPriorityClassName(ctx, namespace, kubeClient, configMapName)
382+
if err == nil && priorityClassName != "" {
383+
kube.ValidatePriorityClass(ctx, kubeClient, priorityClassName, logger.WithField("component", "data-mover"))
384+
}
385+
```
386+
387+
Note: Since validation only logs warnings (not errors), it won't block operations if a priority class doesn't exist. This allows for scenarios where priority classes might be created after Velero installation.
388+
297389
## Alternatives Considered
298390

299391
1. **Using a single flag for all components**: We could have used a single flag for all components, but this would not allow for different priority classes for different components. Since maintenance jobs and data movers typically require lower priority than the Velero server, separate flags provide more flexibility.
@@ -323,8 +415,24 @@ The implementation will involve the following steps:
323415
7. Update the `buildJob` function in maintenance job to use the priority class name from JobConfigs (global config only)
324416
8. Update the `Configs` struct in node agent to include `PriorityClassName` field for data mover pods
325417
9. Update the data mover pod creation to use the priority class name from node-agent-configmap
326-
10. Add the priority class name flags for server and node agent to the `velero install` command
327-
11. Update documentation to explain how to configure priority classes
418+
10. Update the PodVolumeBackup controller to retrieve and apply priority class name from node-agent-configmap
419+
11. Update the PodVolumeRestore controller to retrieve and apply priority class name from node-agent-configmap
420+
12. Add the `GetDataMoverPriorityClassName` utility function to retrieve priority class from configmap
421+
13. Add the priority class name flags for server and node agent to the `velero install` command
422+
14. Add unit tests for:
423+
- `WithPriorityClassName` function
424+
- `GetDataMoverPriorityClassName` function
425+
- Priority class application in deployment, daemonset, and job specs
426+
15. Add integration tests to verify:
427+
- Priority class is correctly applied to all component pods
428+
- ConfigMap updates are reflected in new pods
429+
- Empty/missing priority class names are handled gracefully
430+
16. Update user documentation to include:
431+
- How to configure priority classes for each component
432+
- Examples of creating ConfigMaps before installation
433+
- Expected priority class hierarchy recommendations
434+
- Troubleshooting guide for priority class issues
435+
17. Update CLI documentation for new flags (`--server-priority-class-name` and `--node-agent-priority-class-name`)
328436

329437
Note: The server deployment and node agent daemonset will have CLI flags for priority class. Data mover pods and maintenance jobs will use their respective ConfigMaps for priority class configuration.
330438

@@ -347,6 +455,84 @@ Priority class names are configured through different mechanisms:
347455

348456
4. **Maintenance Jobs**: Will use the repository maintenance job ConfigMap (specified via the `--repo-maintenance-job-configmap` flag). Users should create this ConfigMap before running `velero install` with the desired priority class configuration. The ConfigMap can be updated after installation to change priority classes for future maintenance jobs. While the ConfigMap structure supports per-repository configuration for resources and affinity, priority class is intentionally only read from the global configuration to ensure all maintenance jobs have the same priority.
349457

458+
#### ConfigMap Pre-Creation Guide
459+
460+
For components that use ConfigMaps for priority class configuration, the ConfigMaps must be created before running `velero install`. Here's the recommended workflow:
461+
462+
```bash
463+
# Step 1: Create priority classes in your cluster (if not already existing)
464+
kubectl apply -f - <<EOF
465+
apiVersion: scheduling.k8s.io/v1
466+
kind: PriorityClass
467+
metadata:
468+
name: velero-critical
469+
value: 100
470+
globalDefault: false
471+
description: "Critical priority for Velero server"
472+
---
473+
apiVersion: scheduling.k8s.io/v1
474+
kind: PriorityClass
475+
metadata:
476+
name: velero-standard
477+
value: 50
478+
globalDefault: false
479+
description: "Standard priority for Velero node agent"
480+
---
481+
apiVersion: scheduling.k8s.io/v1
482+
kind: PriorityClass
483+
metadata:
484+
name: velero-low
485+
value: 10
486+
globalDefault: false
487+
description: "Low priority for Velero data movers and maintenance jobs"
488+
EOF
489+
490+
# Step 2: Create the namespace
491+
kubectl create namespace velero
492+
493+
# Step 3: Create ConfigMaps for data movers and maintenance jobs
494+
kubectl create configmap node-agent-config -n velero --from-file=config.json=/dev/stdin <<EOF
495+
{
496+
"priorityClassName": "velero-low"
497+
}
498+
EOF
499+
500+
kubectl create configmap repo-maintenance-job-config -n velero --from-file=config.json=/dev/stdin <<EOF
501+
{
502+
"global": {
503+
"priorityClassName": "velero-low"
504+
}
505+
}
506+
EOF
507+
508+
# Step 4: Install Velero with priority class configuration
509+
velero install \
510+
--provider aws \
511+
--server-priority-class-name velero-critical \
512+
--node-agent-priority-class-name velero-standard \
513+
--node-agent-configmap node-agent-config \
514+
--repo-maintenance-job-configmap repo-maintenance-job-config \
515+
--use-node-agent
516+
```
517+
518+
#### Recommended Priority Class Hierarchy
519+
520+
When configuring priority classes for Velero components, consider the following hierarchy based on component criticality:
521+
522+
1. **Velero Server (Highest Priority)**:
523+
- Example: `velero-critical` with value 100
524+
- Rationale: The server must remain running to coordinate backup/restore operations
525+
526+
2. **Node Agent DaemonSet (Medium Priority)**:
527+
- Example: `velero-standard` with value 50
528+
- Rationale: Node agents need to be available on nodes but are less critical than the server
529+
530+
3. **Data Mover Pods & Maintenance Jobs (Lower Priority)**:
531+
- Example: `velero-low` with value 10
532+
- Rationale: These are temporary workloads that can be delayed during resource contention
533+
534+
This hierarchy ensures that core Velero components remain operational even under resource pressure, while allowing less critical workloads to be preempted if necessary.
535+
350536
This approach has several advantages:
351537

352538
- Leverages existing configuration mechanisms, minimizing new CLI flags
@@ -362,7 +548,7 @@ For PVB and PVR pods specifically, the controllers will need to be updated to re
362548

363549
```go
364550
// In pkg/controller/pod_volume_backup_controller.go
365-
priorityClassName, _ := veleroutil.GetDataMoverPriorityClassName(ctx, namespace, kubeClient, configMapName)
551+
priorityClassName, _ := kube.GetDataMoverPriorityClassName(ctx, namespace, kubeClient, configMapName)
366552

367553
// Add priorityClassName to the pod spec
368554
pod := &corev1api.Pod{
@@ -378,7 +564,7 @@ Similarly, in the PodVolumeRestore controller:
378564

379565
```go
380566
// In pkg/controller/pod_volume_restore_controller.go
381-
priorityClassName, _ := veleroutil.GetDataMoverPriorityClassName(ctx, namespace, kubeClient, configMapName)
567+
priorityClassName, _ := kube.GetDataMoverPriorityClassName(ctx, namespace, kubeClient, configMapName)
382568

383569
// Add priorityClassName to the pod spec
384570
pod := &corev1api.Pod{
@@ -394,7 +580,7 @@ pod := &corev1api.Pod{
394580

395581
With the introduction of VGDP micro-services (as described in the VGDP micro-service design), data mover pods are created as dedicated pods for volume snapshot data movement. These pods will also inherit the priority class configuration from the node-agent-configmap. Since VGDP-MS pods (backupPod/restorePod) inherit their configurations from the node-agent, they will automatically use the priority class name specified in the node-agent-configmap.
396582

397-
This ensures that all pods created by Velero for data movement operations (including VGDP micro-service pods, PVB, and PVR) use a consistent approach for priority class name configuration through the node-agent-configmap.
583+
This ensures that all pods created by Velero for data movement operations (CSI snapshot data movement, PVB, and PVR) use a consistent approach for priority class name configuration through the node-agent-configmap.
398584

399585
## Open Issues
400586

0 commit comments

Comments
 (0)