This guide covers advanced features and configurations available in the SR-IOV Network Operator.
Feature gates enable or disable specific operator features. They are configured through the SriovOperatorConfig custom resource.
Description: Allows configuration of NICs in parallel, reducing network setup time.
Default: Disabled
Use Case: Large clusters with many SR-IOV devices requiring faster deployment.
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovOperatorConfig
metadata:
name: default
namespace: sriov-network-operator
spec:
featureGates:
parallelNicConfig: trueImpact:
- Faster node configuration updates
- Reduced maintenance windows
- Higher resource usage during configuration
Description: Switches webhook failure policy from "Ignore" to "Fail" using Kubernetes 1.30+ MatchConditions feature.
Default: Disabled
Requirements: Kubernetes 1.30+
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovOperatorConfig
metadata:
name: default
namespace: sriov-network-operator
spec:
featureGates:
resourceInjectorMatchCondition: trueBenefits:
- Improved webhook reliability
- Only targets pods with SR-IOV network annotations
- Prevents webhook interference with other pods
Description: Enables metrics collection and export on config-daemon nodes.
Default: Disabled
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovOperatorConfig
metadata:
name: default
namespace: sriov-network-operator
spec:
featureGates:
metricsExporter: trueExposed Metrics:
- SR-IOV device utilization
- VF allocation status
- Configuration success/failure rates
- Hardware health indicators
Description: Allows the operator to manage software bridges automatically.
Default: Disabled
Use Case: Environments requiring automated bridge management for complex topologies.
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovOperatorConfig
metadata:
name: default
namespace: sriov-network-operator
spec:
featureGates:
manageSoftwareBridges: trueDescription: Enables firmware reset via mstfwreset before system reboot for Mellanox devices.
Default: Disabled
Use Case: Environments with Mellanox devices requiring firmware reset during maintenance.
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovOperatorConfig
metadata:
name: default
namespace: sriov-network-operator
spec:
featureGates:
mellanoxFirmwareReset: trueWarning: This feature may extend reboot times and should be tested thoroughly.
- Test in Development: Always test feature gates in non-production environments
- Gradual Rollout: Enable features on a subset of nodes initially
- Monitor Impact: Track metrics and logs after enabling features
- Document Changes: Maintain records of enabled features and their purposes
The SriovNetworkPoolConfig enables parallel node operations and advanced pool management.
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkPoolConfig
metadata:
name: parallel-workers
namespace: sriov-network-operator
spec:
maxUnavailable: 3 # Allow 3 nodes to be updated simultaneously
nodeSelector:
matchLabels:
node-role.kubernetes.io/worker: ""apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkPoolConfig
metadata:
name: percentage-pool
namespace: sriov-network-operator
spec:
maxUnavailable: "25%" # Allow 25% of matching nodes
nodeSelector:
matchLabels:
environment: "development"# Production pool - conservative
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkPoolConfig
metadata:
name: production-pool
namespace: sriov-network-operator
spec:
maxUnavailable: 1
nodeSelector:
matchLabels:
environment: "production"
sriov-enabled: "true"
---
# Staging pool - more aggressive
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkPoolConfig
metadata:
name: staging-pool
namespace: sriov-network-operator
spec:
maxUnavailable: "50%"
nodeSelector:
matchLabels:
environment: "staging"
sriov-enabled: "true"- Exclusive membership: Each node can only belong to one pool
- Conflict resolution: Nodes matching multiple pools will not be drained
- Default behavior: Nodes not in any pool use
maxUnavailable: 1
Disable specific plugins when their operation is not needed or conflicts with external tools.
Mellanox Plugin: Handles Mellanox-specific firmware configuration.
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovOperatorConfig
metadata:
name: default
namespace: sriov-network-operator
spec:
disablePlugins:
- mellanoxUse Cases:
- Firmware pre-configured during node provisioning
- External firmware management tools
- Environments requiring custom firmware settings
Use externallyManaged: true when VFs are created outside the operator:
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: external-vfs-policy
namespace: sriov-network-operator
spec:
externallyManaged: true
deviceType: vfio-pci
nicSelector:
pfName: ["ens1f0"]
nodeSelector:
kubernetes.io/hostname: "worker-1"
numVfs: 8
resourceName: external_vfs# /etc/systemd/system/sriov-vfs.service
[Unit]
Description=Create SR-IOV VFs
Before=kubelet.service
After=network.target
[Service]
Type=oneshot
ExecStart=/bin/bash -c 'echo 8 > /sys/class/net/ens1f0/device/sriov_numvfs'
ExecStop=/bin/bash -c 'echo 0 > /sys/class/net/ens1f0/device/sriov_numvfs'
RemainAfterExit=yes
[Install]
WantedBy=multi-user.targetapiVersion: nmstate.io/v1
kind: NodeNetworkConfigurationPolicy
metadata:
name: sriov-vfs
spec:
nodeSelector:
kubernetes.io/hostname: "worker-1"
desiredState:
interfaces:
- name: ens1f0
type: ethernet
state: up
sriov:
total-vfs: 8
vfs:
- id: 0
spoofchk: false
- id: 1
spoofchk: falseConfigure the resource injector webhook for enhanced functionality:
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovOperatorConfig
metadata:
name: default
namespace: sriov-network-operator
spec:
featureGates:
resourceInjectorMatchCondition: true
webhookConfig:
failurePolicy: "Fail" # Strict mode
timeoutSeconds: 30
admissionReviewVersions: ["v1", "v1beta1"]Control NetworkAttachmentDefinition generation:
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
name: advanced-network
namespace: sriov-network-operator
spec:
resourceName: intel_sriov_netdevice
networkNamespace: production
capabilities: |
{
"ips": true,
"mac": true
}
metaPluginsConfig: |
{
"type": "tuning",
"capabilities": {
"mac": true
},
"sysctl": {
"net.core.somaxconn": "1024",
"net.ipv4.tcp_congestion_control": "bbr"
}
},
{
"type": "bandwidth",
"ingressRate": "100M",
"egressRate": "100M"
}Configure for NUMA topology awareness:
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: numa-optimized
namespace: sriov-network-operator
spec:
deviceType: netdevice
nicSelector:
pfName: ["ens1f0"]
nodeSelector:
numa-optimized: "true"
numVfs: 8
resourceName: numa_optimized_vfs# Pod with CPU isolation
apiVersion: v1
kind: Pod
metadata:
name: isolated-sriov-pod
annotations:
k8s.v1.cni.cncf.io/networks: advanced-network
cpu-load-balancing.crio.io: "disable"
cpu-quota.crio.io: "disable"
spec:
containers:
- name: app
image: performance-app:latest
resources:
requests:
numa_optimized_vfs: "1"
cpu: "4"
memory: "8Gi"
limits:
numa_optimized_vfs: "1"
cpu: "4"
memory: "8Gi"
nodeSelector:
node.alpha.kubernetes.io/isolated: "true"Fine-grained RBAC for SR-IOV resources:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: sriov-network-operator
name: sriov-network-manager
rules:
- apiGroups: ["sriovnetwork.openshift.io"]
resources: ["sriovnetworks"]
verbs: ["get", "list", "create", "update", "patch"]
- apiGroups: ["sriovnetwork.openshift.io"]
resources: ["sriovnetworknodepolicies"]
verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: sriov-network-manager-binding
namespace: sriov-network-operator
subjects:
- kind: User
name: network-admin
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role
name: sriov-network-manager
apiGroup: rbac.authorization.k8s.ioapiVersion: sriovnetwork.openshift.io/v1
kind: SriovOperatorConfig
metadata:
name: default
namespace: sriov-network-operator
spec:
featureGates:
metricsExporter: true
metricsConfig:
port: 8080
path: "/metrics"
interval: "30s"apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: sriov-operator-metrics
namespace: sriov-network-operator
spec:
selector:
matchLabels:
app: sriov-network-operator
endpoints:
- port: metrics
interval: 30s
path: /metricsCheck feature gate compatibility:
# Check current operator version
kubectl get deployment sriov-network-operator -n sriov-network-operator -o jsonpath='{.spec.template.spec.containers[0].image}'
# Verify feature gate support
kubectl explain sriovoperatorconfig.spec.featureGates- Backup configurations:
kubectl get sriovoperatorconfig -o yaml > sriov-config-backup.yaml
kubectl get sriovnetworknodepolicy -A -o yaml > sriov-policies-backup.yaml- Test feature gates:
# Enable in test environment first
kubectl patch sriovoperatorconfig default -n sriov-network-operator --type='merge' -p='{"spec":{"featureGates":{"parallelNicConfig":true}}}'- Monitor rollout:
# Watch operator logs
kubectl logs deployment/sriov-network-operator -n sriov-network-operator -f
# Check node state
kubectl get sriovnetworknodestate -n sriov-network-operator# Check operator config
kubectl get sriovoperatorconfig default -n sriov-network-operator -o yaml
# Verify feature gate application
kubectl logs deployment/sriov-network-operator -n sriov-network-operator | grep -i "feature"
# Check webhook configuration
kubectl get validatingwebhookconfiguration operator-webhook-config -o yaml# Check pool membership
kubectl get nodes --show-labels | grep -E "(sriov|pool)"
# Verify pool selection
kubectl get sriovnetworkpoolconfig -n sriov-network-operator -o yaml
# Monitor pool operations
kubectl describe sriovnetworkpoolconfig <pool-name> -n sriov-network-operator- Document Decisions: Maintain records of why features are enabled
- Environment Consistency: Keep feature gates consistent across environments
- Regular Review: Periodically review enabled features for relevance
- Testing Protocol: Establish testing procedures for new features
- Baseline Measurements: Record performance before enabling features
- Incremental Changes: Enable one feature at a time
- Resource Monitoring: Track CPU, memory, and network impact
- Rollback Plan: Prepare procedures to disable features if needed
- Principle of Least Privilege: Enable only necessary features
- Regular Audits: Review feature gate configurations periodically
- Access Control: Restrict who can modify feature gates
- Monitoring: Track configuration changes and their impact
- Pool Configuration - Detailed pool management
- RDMA Configuration - RDMA-specific features
- Monitoring Guide - Comprehensive monitoring setup
- Troubleshooting - Advanced troubleshooting techniques