|
| 1 | +--- |
| 2 | +title: vsphere-configurable-maximum-allowed-number-of-block-volumes-per-node |
| 3 | +authors: |
| 4 | + - "@rbednar" |
| 5 | +reviewers: |
| 6 | + - "@jsafrane" |
| 7 | + - "@gnufied" |
| 8 | + - "@deads2k" |
| 9 | +approvers: |
| 10 | + - "@jsafrane" |
| 11 | + - "@gnufied" |
| 12 | + - "@deads2k" |
| 13 | +api-approvers: |
| 14 | + - "@deads2k" |
| 15 | +creation-date: 2025-01-31 |
| 16 | +last-updated: 2025-01-31 |
| 17 | +tracking-link: |
| 18 | + - https://issues.redhat.com/browse/OCPSTRAT-1829 |
| 19 | +see-also: |
| 20 | + - "None" |
| 21 | +replaces: |
| 22 | + - "None" |
| 23 | +superseded-by: |
| 24 | + - "None" |
| 25 | +--- |
| 26 | + |
| 27 | +# vSphere configurable maximum allowed number of block volumes per node |
| 28 | + |
| 29 | +This document proposes an enhancement to the vSphere CSI driver to allow administrators to configure the maximum number |
| 30 | +of block volumes that can be attached to a single vSphere node. This enhancement addresses the limitations of the |
| 31 | +current driver, which currently relies on a static limit that can not be changed by cluster administrators. |
| 32 | + |
| 33 | +## Summary |
| 34 | + |
| 35 | +The vSphere CSI driver for vSphere version 7 uses a constant to determine the maximum number of block volumes that can |
| 36 | +be attached to a single node. This limit is influenced by the number of SCSI controllers available on the node. |
| 37 | +By default, a node can have up to four SCSI controllers, each supporting up to 15 devices, allowing for a maximum of 60 |
| 38 | +volumes per node (59 + root volume). |
| 39 | + |
| 40 | +However, vSphere version 8 increased the maximum number of volumes per node to 256 (255 + root volume). This enhancement |
| 41 | +aims to leverage this increased limit and provide administrators with finer-grained control over volume allocation |
| 42 | +allowing them to configure the maximum number of block volumes that can be attached to a single node. |
| 43 | + |
| 44 | +Details about configuration maximums: https://configmax.broadcom.com/guest?vmwareproduct=vSphere&release=vSphere%208.0&categories=3-0 |
| 45 | +Volume limit configuration for vSphere storage plug-in: https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/container-storage-plugin/3-0/getting-started-with-vmware-vsphere-container-storage-plug-in-3-0/vsphere-container-storage-plug-in-concepts/configuration-maximums-for-vsphere-container-storage-plug-in.html |
| 46 | +Knowledge base article with node requirements: https://knowledge.broadcom.com/external/article/301189/prerequisites-and-limitations-when-using.html |
| 47 | + |
| 48 | +## Motivation |
| 49 | + |
| 50 | +### User Stories |
| 51 | + |
| 52 | +- As a vSphere administrator, I want to configure the maximum number of volumes that can be attached to a node, so that |
| 53 | + I can optimize resource utilization and prevent oversubscription. |
| 54 | +- As a cluster administrator, I want to ensure that the vSphere CSI driver operates within the limits imposed by the |
| 55 | + underlying vSphere infrastructure. |
| 56 | + |
| 57 | +### Goals |
| 58 | + |
| 59 | +- Provide administrators with control over volume allocation limit on vSphere nodes. |
| 60 | +- Improve resource utilization and prevent oversubscription. |
| 61 | +- Ensure compatibility with existing vSphere infrastructure limitations. |
| 62 | +- Maintain backward compatibility with existing deployments. |
| 63 | + |
| 64 | +### Non-Goals |
| 65 | + |
| 66 | +- Support heterogeneous environments with different ESXi versions on nodes that form OpenShift cluster. |
| 67 | +- Dynamically adjust the limit based on real-time resource usage. |
| 68 | +- Implement per-namespace or per-workload volume limits. |
| 69 | +- Modify the underlying vSphere VM configuration. |
| 70 | + |
| 71 | +## Proposal |
| 72 | + |
| 73 | +1. Driver Feature State Switch (FSS): |
| 74 | + |
| 75 | + - Use FSS (`max-pvscsi-targets-per-vm`) of the vSphere driver to control the activation of the maximum volume limit |
| 76 | + functionality. |
| 77 | + - No changes needed, the feature is enabled by default. |
| 78 | + |
| 79 | +2. API for Maximum Volume Limit: |
| 80 | + |
| 81 | + - Introduce a new field `spec.driverConfig.vSphere.maxAllowedBlockVolumesPerNode` in ClusterCSIDriver API to allow |
| 82 | + administrators to configure the desired maximum number of volumes per node. |
| 83 | + - This field should not have a default value and the actual default will be set by the operator to the current |
| 84 | + maximum limit of 59 volumes per node which matches limit for vSphere 7. |
| 85 | + - API will not allow `0` value to be set or allow the field to be unset. This would lead to |
| 86 | + disabling the limit. |
| 87 | + - Allowed range of values should be 1 to 255. The maximum value matches vSphere 8 limit. |
| 88 | + |
| 89 | +3. Update CSI Pods with hooks: |
| 90 | + |
| 91 | + - After reading the new `maxAllowedBlockVolumesPerNode` API field from ClusterCSIDriver the operator will inject the |
| 92 | + `MAX_VOLUMES_PER_NODE` environment variable into all pods using a DaemonSet and Deployment hooks. |
| 93 | + - Any value that is statically set for the `MAX_VOLUMES_PER_NODE` environment variable in asset files will |
| 94 | + be overwritten. If the variable is omitted in the asset, hooks will add it and set it its value found in |
| 95 | + `maxAllowedBlockVolumesPerNode` field of ClusterCSIDriver. If the field is not set, the default value will be 59 |
| 96 | + to match vSphere 7 limit. |
| 97 | + |
| 98 | +4. Operator behavior: |
| 99 | + |
| 100 | + - The operator will check ESXi versions on all nodes in the cluster. Setting `maxAllowedBlockVolumesPerNode` to a |
| 101 | + higher value than 59 while not having ESXi version 8 or higher on all nodes will result in cluster degradation. |
| 102 | + |
| 103 | +5. Driver Behavior: |
| 104 | + |
| 105 | + - The vSphere CSI driver needs to allow the higher limit with Feature State Switch FSS (`max-pvscsi-targets-per-vm`). |
| 106 | + - The switch is already enabled by default in versions shipped in OpenShift 4.19. |
| 107 | + - The driver will report the volume limit as usual in response to `NodeGetInfo` calls. |
| 108 | + |
| 109 | +6. Documentation: |
| 110 | + |
| 111 | + - Update the vSphere CSI driver documentation to include information about the new feature and how to configure it. |
| 112 | + However, at the time of writing we don't have any official vSphere documentation to refer to that would explain how |
| 113 | + to configure vSphere to support 256 volumes per node. |
| 114 | + - Include a statement informing users of the current requirement of having a homogeneous cluster with all nodes |
| 115 | + running ESXi 8 or higher. Until this requirement is met, the limit set in `maxAllowedBlockVolumesPerNode` must not |
| 116 | + be increased to a higher value than 59. If higher value is set regardless of this requirement the cluster will |
| 117 | + degrade. |
| 118 | + - Currently, there is no Distributed Resource scheduler (DRS) validation in place in the vSphere to make sure we do |
| 119 | + not end up having more VMs with 256 disks on the same host so users might exceed the limit of 2048 Virtual Disks |
| 120 | + per Host. This is a known limitation of vSphere, and we need note this in documentation to make users aware of this |
| 121 | + potential risk. |
| 122 | + |
| 123 | +### Workflow Description |
| 124 | + |
| 125 | +1. Administrator configures the limit: |
| 126 | + - The administrator creates or updates a ClusterCSIDriver object to specify the desired maximum number of volumes per |
| 127 | + node using the new `maxAllowedBlockVolumesPerNode` API field. |
| 128 | +2. Operator reads configuration: |
| 129 | + - The vSphere CSI Operator monitors the ClusterCSIDriver object for changes. |
| 130 | + - Upon detecting a change, the operator reads the configured limit value. |
| 131 | +3. Operator updates the new limit for DaemonSet and Deployment: |
| 132 | + - The operator updates the pods of vSphere CSI driver, injecting the `MAX_VOLUMES_PER_NODE` environment |
| 133 | + variable with the configured limit value into the driver node pods on worker nodes. |
| 134 | + |
| 135 | +### API Extensions |
| 136 | + |
| 137 | +- New field in ClusterCSIDriver CRD: |
| 138 | + - A new CRD field will be introduced to represent the maximum volume limit configuration. |
| 139 | + - This CRD will contain a single new field (e.g., `spec.driverConfig.vSphere.maxAllowedBlockVolumesPerNode`) to define |
| 140 | + the desired limit. |
| 141 | + - The API will validate the value fits within the defined range (1-255). |
| 142 | + |
| 143 | +### Topology Considerations |
| 144 | + |
| 145 | +#### Hypershift / Hosted Control Planes |
| 146 | + |
| 147 | +No unique considerations for Hypershift. The configuration and behavior of the vSphere CSI driver with respect to the |
| 148 | +maximum volume limit will remain consistent across standalone and managed clusters. |
| 149 | + |
| 150 | +#### Standalone Clusters |
| 151 | + |
| 152 | +This enhancement is fully applicable to standalone OpenShift clusters. |
| 153 | + |
| 154 | +#### Single-node Deployments or MicroShift |
| 155 | + |
| 156 | +No unique considerations for MicroShift. The configuration and behavior of the vSphere CSI driver with respect to the |
| 157 | +maximum volume limit will remain consistent across standalone and SNO/MicroShift clusters. |
| 158 | + |
| 159 | +### Implementation Details/Notes/Constraints |
| 160 | + |
| 161 | +One of the possible future constraints might be increasing the limit with newer vSphere versions. However, we expect the |
| 162 | +limit to be increasing rather than decreasing and making the API validation more relaxed is possible. |
| 163 | + |
| 164 | +### Risks and Mitigations |
| 165 | + |
| 166 | +- None. |
| 167 | + |
| 168 | +### Drawbacks |
| 169 | + |
| 170 | +- Increased Complexity: Introducing a new CRD field and operator logic adds complexity to the vSphere CSI driver ecosystem. |
| 171 | +- Missing vSphere documentation: At the time of writing we don't have a clear statement or documentation to refer to |
| 172 | + that would well describe all the necessary details and limitations of this feature. See Documentation in |
| 173 | + the Proposal section for details. |
| 174 | +- Limited Granularity: The current proposal provides a global node-level limit. More fine-grained control |
| 175 | + (e.g., per-namespace or per-workload limits) would require further investigation and development. |
| 176 | + |
| 177 | +## Open Questions [optional] |
| 178 | + |
| 179 | +None. |
| 180 | + |
| 181 | +## Test Plan |
| 182 | + |
| 183 | +- E2E tests will be implemented to verify the correct propagation of the configured limit to the driver pods. |
| 184 | + These tests will be executed only on vSphere 8. |
| 185 | + |
| 186 | +## Graduation Criteria |
| 187 | + |
| 188 | +- TechPreview in 4.19. |
| 189 | + |
| 190 | +### Dev Preview -> Tech Preview |
| 191 | + |
| 192 | +- No Dev Preview phase. |
| 193 | + |
| 194 | +### Tech Preview -> GA |
| 195 | + |
| 196 | +- E2E test coverage demonstrating stability. |
| 197 | +- Available by default. |
| 198 | +- User facing documentation created in [openshift-docs](https://github.com/openshift/openshift-docs/). |
| 199 | +- We have to wait for VMware to GA this feature and document the configuration on vCenter side. |
| 200 | + |
| 201 | +### Removing a deprecated feature |
| 202 | + |
| 203 | +- No. |
| 204 | + |
| 205 | +## Upgrade / Downgrade Strategy |
| 206 | + |
| 207 | +- **Upgrades:** During an upgrade, the operator will apply the new API field value and update the driver pods with |
| 208 | + the new `MAX_VOLUMES_PER_NODE` value. If the field is not set, default value (59) is used to match the current limit |
| 209 | + for vSphere 7. So the limit will not change for existing deployments unless the user explicitly sets it. |
| 210 | +- **Downgrades:** Downgrading to a version without this feature will result in the API field being ignored and the |
| 211 | + operator will revert to its previous hardcoded value configured in DaemonSet (59). If there is a higher count of |
| 212 | + attached volumes than the limit after downgrade, the vSphere CSI driver will not be able to attach new volumes to |
| 213 | + nodes and users will need to manually detach the extra volumes. |
| 214 | + |
| 215 | +## Version Skew Strategy |
| 216 | + |
| 217 | +There are no version skew concerns for this enhancement. |
| 218 | + |
| 219 | +## Operational Aspects of API Extensions |
| 220 | + |
| 221 | +- API extension does not pose any operational challenges. |
| 222 | + |
| 223 | +## Support Procedures |
| 224 | + |
| 225 | +* To check the status of the vSphere CSI operator, use the following command: |
| 226 | + `oc get deployments -n openshift-cluster-csi-drivers`. Ensure that the operator is running and healthy, inspect logs. |
| 227 | +* To inspect the `ClusterCSIDriver` CRs, use the following command: `oc get clustercsidriver/csi.vsphere.vmware.com -o yaml`. |
| 228 | + Examine the `spec.driverConfig.vSphere.maxAllowedBlockVolumesPerNode` field. |
| 229 | + |
| 230 | +## Alternatives |
| 231 | + |
| 232 | +We considered several approaches to handle environments with mixed ESXi versions: |
| 233 | + |
| 234 | +1. **Cluster Degradation (Selected Approach)**: |
| 235 | + - We will degrade the cluster if the user-specified limit exceeds what's supported by the underlying infrastructure. |
| 236 | + - This requires checking the ClusterCSIDriver configuration against actual node capabilities in the `check_nodes.go` implementation. |
| 237 | + - The error messages will be specific about the incompatibility. |
| 238 | + - Documentation will clearly state that increased limits are not supported on environments containing ESXi 7.x hosts. |
| 239 | + |
| 240 | +2. **Warning-Only Approach**: |
| 241 | + - Allow any user-specified limit (up to 255) regardless of ESXi versions in the cluster. |
| 242 | + - Emit metrics and alerts when incompatible configurations are detected. |
| 243 | + - This approach would result in application pods getting stuck in ContainerCreating state when scheduled to ESXi 7.0 nodes that exceed the 59 attachment limit. |
| 244 | + - This option was rejected as it would lead to poor user experience with difficult-to-diagnose failures. |
| 245 | + |
| 246 | +3. **Dynamic Limit Adjustment**: |
| 247 | + - Have the DaemonSet controller ignore user-specified limits that exceed cluster capabilities and automatically switch to a supportable limit. |
| 248 | + - This option is technically complex as it would require: |
| 249 | + - Delaying CSI driver startup until all version checks complete |
| 250 | + - Implementing a DaemonSet hook to perform full cluster scans for ESXi versions (expensive operation) |
| 251 | + - Duplicating node checks already being performed elsewhere |
| 252 | + - This approach was rejected due to implementation complexity. |
| 253 | + |
| 254 | +4. **Driver-Level Detection**: |
| 255 | + - Add code to the DaemonSet pod that would detect limits from BIOS or OS and consider that when reporting attachment capabilities. |
| 256 | + - This would require modifications to the driver code itself, which would be better implemented by VMware. |
| 257 | + - This approach was rejected as it would depend on upstream changes that historically have been slow to implement. |
| 258 | + |
| 259 | +## Infrastructure Needed [optional] |
| 260 | + |
| 261 | +- Current infrastructure needed to support the enhancement is available for testing vSphere version 8. |
| 262 | +- To test the feature we need to a nested vSphere environment and set `pvscsiCtrlr256DiskSupportEnabled` in |
| 263 | + vCenter config to allow the higher volume attachment limit. |
0 commit comments