- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- [] (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- [] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests for meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
We propose to combine the source code of the CSI Sidecars in a monorepo, Instead of just putting the code repositories together, it is expected that the program entries for all sidecars will be consolidated. therefore we can:
- Improve the CSI Sidecar release process by reducing the number of components released
- Decrease the maintenance tasks the SIG Storage community maintainers do to maintain the Sidecars
- Propogate changes in common libraries used by CSI Sidecars immediately instead of through additional PRs
- Reduce the number of components CSI Driver authors and cluster administrators need to keep up to date in k8s clusters
As a side effects of combining the CSI Sidecars into a single component we also
- Reduce the memory usage/API Server calls done by the CSI Sidecars through the usage of a shared informer.
- Reduce the cluster resource requirements need to run the CSI Sidecars
The SIG Storage community maintains many storage related projects, each on its own git repo including:
- CSI Drivers - SMB CSI Driver, NFS CSI Driver, Hostpath CSI Driver, ISCSI CSI Driver, NVMf CSI Driver
- CSI Sidecars
- Typically deployed with the controller component of the CSI Driver: external-attacher, external-provisioner, external-resizer, external-snapshotter, external-health-monitor (alpha), livenessprobe
- Typically deployed with the node component of the CSI Driver: node-driver-registrar, livenessprobe
- Controllers
- snapshot-controller, volume-data-source-validator (beta)
- Webhooks
- csi-snapshot-validation-webhook
- CSI libraries and utilities
- csi-lib-utils, csi-release-tools, csi-test, lib-volume-populator (beta)
- Host binaries
- CSI Proxy As part of the maintenance work of these components the SIG Storage community:
-
Bumps the go runtime, Which usually fix vulnerabilities, then the application binary is rebuild and a new image is released. this is done in csi-release tools and propogated to the other repos(example) The effort is part of point #3 below.
-
Updates the dependencies to the latest version, which usually have new releases fixing vulnerabilities, the SIG Storage community reviewers/approvers look at every PR generated by a bot and LGTM/approve it. Because we have different repos the human effort is multiplied. e.g. review # dependencies * # CSI Sidecars PRs (example)
-
Propogates changes in CSI related dependencies across all the CSI sidecars and CSI Drivers that need them. csi-release-tools has common build utilities used across all the repos, whenever there's a change in this component it's need to be propogated across all the repos.(example). Because we have different repos the human effort is multiplied e.g. make (# updates in csi-release-tools + # new changes in csi-lib-utils) * # CSI Sidecars.
To keep dependencies up to date the SIG Storage community uses https://github.com/dependabot which is a bot that automatically creates a PR whenever a dependency creates a new release. As a side effect, after enabling the bot the number of PRs increased. Also note that because each component is on its own repo a bump in a dependency(assuming that the dependency is shared among many CSI Sidecars) is multiplied accross of them.
Stats for dependency/vuln updates across CSI Sidecars as of Aug 11th, 2023.
CSI Sidecar \ PRs reviewed & merged | Dependabot dependency update | csi-release-tools propagation | csi-lib-utils |
---|---|---|---|
external-attacher | 14(unreleased) 12 (release 4.3.0) 8 (release 4.2.0) |
2 (unreleased)~71 (lifetime) | ~15 (lifetime) |
external-provisioner | 36 (unreleased) 30 (release 3.5.0) 11 (release 3.4.0) |
2 (unreleased)~75 (lifetime) | ~19 (lifetime) |
external-resizer | 5 (release 1.8.0) 5 (release 1.7.0) |
2 (unreleased)~62 (lifetime) | ~10 (lifetime) |
external-snapshotter | 14 (unreleased) | ~90 (lifetime) | ~19 (lifetime) |
node-driver-register | 13 (unreleased) 8 (release 2.8.0) 2 (release 2.7.0) 3 (release 2.6.0) |
~70 (lifetime) | ~7 (lifetime) |
livenessprobe | 9 (unreleased) | ~41 (lifetime) | ~9 (lifetime) |
Table: PR to CSI Sidecars related to vuln fixes and library propagation
The CSI Drivers/CSI Sidecars have an indirect dependency on the k8s version. This could happen because of:
- A new CSI feature that touches CSI Sidecars and k8s component - For example the ReadWriteOncePod feature needs changes in k8s components (kube-apiserver, the kube-scheduler, the kubelet), CSI Sidecars
Because of this indirect dependency the SIG Storage community creates a minor release of each CSI Sidecar for every k8s minor release. We use csi-hospath (a CSI Driver used for testing purposes) to test the compatibility of the new releases with the latest k8s version.
We follow the instructions on SIDECAR_RELEASE_PROCESS.md on every CSI Sidecar to create a minor release.
Kubernetes and CSI are constantly evolving(see the section above on how CSI Sidecars evolve)and so are CSI Drivers, CSI Driver authors must keep their drivers up to date with the new features in k8s and CSI. A CSI Driver implementing most of the CSI features inludes the following components:
A cluster administrator in addition to keeping up with the latest k8s and CSI features might need to manage different aspect of the integration too like security. CSI Sidecars depend on multiple dependencies which might be susceptible to vulnerabilities. In the case these vulnerabilities are fixed in a new release of a dependency it must be propagated all the way until the CSI Sidecar repository.
Usually the above might be enough for the latest release however the vulnerability might also affest older releases of the CSI Sidecars, therefore the fix needs to be appliedto older CSI Sidecar releases
The above increases the work not only for the SIG Storage community which has to cherry pick the fix but also to cluster administrators who have to update existing CSI Driver integrations in previous k8s releases bumping the CSI Sidecars
To avoid this propogation issue, cluster administrators have the following options:
- Use the same version of CSI Sidecars in previous k8s integrations
In Some CSI Driver control plane deployment setups each sidecar is configured with a minimum memory request, some examples of OSS CSI Driver deployments resource allocations:
- Memory request
- EBS CSI Driver
- In a CP node, sets a 40Mi memory request for each CSI Sidecars(5 sidecars), a total of 200Mi per node.
- In a worker node, sets a 40Mi memory request for each CSI Sidecar(2 sidecars), a total of 80Mi per node
- Azuredisk
- In a CP node, sets a 20Mi memory request for each CSI Sidecars(5 sidecars), a total of 100Mi per node
- In a worker node, sets a 20Mi memory request for each CSI Sidecars(2 sidecars), a total of 40Mi per node
- AlibabaCloud Disk
- In a CP node, sets a 16Mi memory request for each CSI Sidecars(average 4 sidecars) a total of 64Mi per node
- In a worker node, sets a 16Mi memory request for each CSI Sidecars(1 sidecars), a total of 40Mi per node The 5x memory request is addtional overhead in the control plane nodes, 2x in the worker nodes
- EBS CSI Driver
- To combine the source code of the CSI Sidecars in a monorepo.
- To comnine the entrance of CSI Sidecars in one binary.
- If we just merge the source code, we won't be able to reuse resources and realize the above advantages
- To minimize impact on users, we can't seperate the whole migration process in to two steps.(merge source code and merge the entrance)
- The sidecars includees the following:
- Retain git history logs of sidecars in new monorepo.
- The sidecars not include sig-storage-lib-external-provisioner.
- Because it doesn't depend on release-tools or csi-lib-utils.
- release-tools and csi-lib-utils are not included in the monorepo.
- we can start with the sidecars only and no utility libraries, after we see that it works in CI then we can consider moving the utilities to the monorepo. we will open another KEP if we need to move them.
The proposal consists of creating a monorepo which creates a single artifact with common sidecars combined in one binary:
- Combine the source code of all common CSI sidecars (external-attacher, external-provisioner, external-resizer, external-snapshotter, livenessprobe, node-driver-registrar), Controllers(snapshot controller, volume-health-monitor controller), Webhooks(csi-snapshot-validation-webhook) in a single repository. A total of 7 repositories including 6 sidecars, 2 controllers and 1 webhook.
- Include the source code of helper utilities in the same repository(csi-release-tools, csi-lib-utils), sidecars/apps use the local modules through go workspaces. A total of 1 release helper and 1 go module.
- Create a new cmd/ entrypoint that enables sidecars selectively, similar to kube-controller-manager and the --controllers flag.
CSI Driver authors would include a single sidecar in their deployments(in both the control plane and node pools). while the artifact version is the same, the command/arguments will be differents.
The CSI Driver deployment manifest would look like this in the control plane:
kind: Deployment
apiVersion: app/v1
metadata:
name: csi-driver-deployment
spec:
replicas: 1
templates:
spec:
containers:
- name: csi-driver
args:
- "--v=5"
- "--endpoint=unix:/csi/csi.sock"
- name: csi-sidecars
command:
- csi-sidecars
- "--csi-address=unix:/csi/csi.sock"
# similar style as kube-controller-manager
- "--controllers=attacher,provisioner,resizer,snapshotter"
- "--feature-gates=Topology=true"
# leader election flags for all the components as one
- "--leader-election"
- "--leader-election-namespace=kube-system"
# global timeouts
- "--timeout=30s"
# per controller specific flags are prefixed with the component name
- "--attacher-timeout=30s"
- "--attacher-worker-thread=100"
- "--provisioner-timeout=30s"
volumeMounts:
- mountPath: /csi
name: socket-dir
The CSI Driver deployment manifest would look like this in the worker node
kind: DaemonSet
apiVersion: apps/v1
metadata:
name: csi-driver-deployment
spec:
template:
spec:
containers:
- name: csi-driver
args:
- "--v=5"
- "--endpoint=unix:/csi/csi.sock"
- name: csi-sidecars
command:
- csi-sidecars
- "--csi-address=unix:/csi/csi.sock"
# similar style as kube-controller-manager
- "--controllers=node-driver-registrar"
- "--kubelet-registration-path=/var/lib/kubelet/plugins/<csi-driver>/csi.sock"
volumeMounts:
- name: registration-dir
mountPath: /registration
- name: plugin-dir
mountPath: /csi
volumes:
- name: registration-dir
hostPath:
path: /var/lib/kubelet/plugins_registry/
type: Directory
- name: plugin-dir
hostPath:
path: /var/lib/kubelet/plugins/<csi-driver>/
type: DirectoryOrCreate
Quantifiable characteristics of the current state and of the proposed state
Characteristics/State | Current state of CSI Sidecars(let #csi-sidecars=6) | CSI Sidecars in signal component |
---|---|---|
Human effort of propogating csi-release-tools | (#csi-release-tools changes * #csi-sidecars) | 0(because csi-release-tools is part of the repo) |
Human effort of propogating csi-lib-utils | (#csi-lib-utils changes * #csi-sidecars) | 0(because csi-lib-utils is part of the repo) |
go mod dependency bumps | (#dependency changes * #csi-sidecars) * CSI release supported(unknown) | #dependency changes * releases supported(follow k8s release) |
runtime udpate | (#csi-release-tools changes related with go runtime updates * #csi-sidecars) | #go runtime updates |
members of CSI releases per k8s minor release | #csi-sidecars | 1 |
Additional properties of a single CSI Sidecar component without a quantifiable benefit:
Dimension | Pros | Cons |
---|---|---|
Releases | ||
Testability | ||
Performance & Reliability |
|
|
Simplicity |
|
|
Integration with CSI Drivers |
|
- Individual repository - An existing repository in the kubernetes-csi/ org in Github e.g. the external-attacher repository.
- Individual component - An existing component of csi sidecars.
- AIO monorepo or monorepo - The monolithic repository where most of the code of the CSI Sidecars will be migrated.
- Monorepo component - The source code of an individual repository that is currently being migrated or already migrated to the monorepo.
We are consider to switch semantic version to k8s version, there are some pros and cons
pros:
- We don't need to reinvent the wheel about what our dev process is going to look like, we follow the same docs as k8s https://kubernetes.io/releases/release/. This is tried and tested for many releases
- Cluster administrators would know which version to use to match their CSI Driver deployment e.g. for a k8s 1.27 cluster they'd use the 1.27 release of the CSI Sidecar.
cons:
- Breaking changes might happen in a minor release, Cluster administrators MUST read sidecar release notes considering breaking changes before working on a big release.
- Version skew scenario becomes confusing for the cluster administrator e.g. they deploy the CSI Sidecars v1.x, cluster is upgraded to v1.{x+3} (CP upgrade first, NP later), nodepools would have CSI sidecar at v1.{x+3} with kubelet at v1.x
- k/k at 1.27.5 - CSI 1.27.0 or (different mapping still)
After investigation, we found that there isn't clear advantage to switch to k8s versioning, so we chose to keep Semantic Versioning in monorepo.
We designed the AIO repo's RBAC policy to mirror that of individual repos, where each controller maintains its own policy. Driver maintainers should apply proper RBAC when enabling specific controllers in AIO
more discuss info in
We plan to combine informer caches of different controllers in the future
Divided the command lines into two types, a generic command line whose configuration is common to all controllers and is configured only once, and the other type of command lines whose configuration is different for each controller. these command lines each has a new unique name. prefix with the controller name.
- name: csi-sidecars
command:
- csi-sidecars
- "--csi-address=unix:/csi/csi.sock"
# similar style as kube-controller-manager
- "--controllers=attacher,provisioner,resizer,snapshotter"
- "--feature-gates=Topology=true"
# leader election flags for all the components as one
- "--leader-election"
- "--leader-election-namespace=kube-system"
# global timeouts
- "--timeout=30s"
# per controller specific flags are prefixed with the component name
- "--attacher-timeout=30s"
- "--attacher-worker-thread=100"
- "--provisioner-timeout=30s"
example PR: kubernetes-csi/external-attacher#620
poc version: https://github.com/mauriciopoppe/csi-sidecars
monorepo attacher: https://github.com/mauriciopoppe/csi-sidecars/tree/main/pkg/attacher
After we see the Monorepo component running fine in integration/e2e tests in k8s, we need to perform a hard cut so that new deployment goes in the monorepo component only.
- Design: Current state of AIO MonoRepo
- Alpha: all six sidecar repo had been integrated into mono repo, All the e2e tests has passed.
- Beta (production-verified): six sidecars working through CSI hostpath, three cloud vendor can using it in its production environment.
- GA (released): Official released, Available for accept PRs from SIG Storage Developer
- standalone: Never need sync codes from individual repos, AIO MonoRepo become the source of truth
- Released: current state of individual repos
- FeatureFreeze:
- Any new feature PRs are not allowed to be filed to the master branch or release-X branches(Controlled by the individual repo maintainer, categorize it and reject it if it's a feature)
- SIG Storage Developer file the feature PRs to AIO MonoRepo
- Except for the serious bugfixes or CVE fixes PRs (only from individual repo maintainer) which can be merged in master and backported to the other release-X branches
- Deprecated:
- Not maintaining this repository
- Eventually the image is going away for the individual repo is going away (although wouldn't possible unless we migrate ALL the sidecars)
- (future) archive it but not at the same time as the deprecation time, this is a terminal state so we can't undo it
-
Breaking changes in one component forces the single release to be a breaking change
-
Vulnerability that might affects one component affects all other components
see details in: https://docs.google.com/document/d/1SD4YRas_qXMP363L4j3WBTV_F9anq-5FM5gdGmJq7h0/edit?usp=sharing
- Panic in one component restarts the sidecar
For each sidecar define the where in the stack a panic should be caught to possibly restart the controller.
List of fixed issues related with panics: - kubernetes-csi/external-provisioner#839 - kubernetes-csi/external-provisioner#582 - kubernetes-csi/external-attacher#502
panic like OOM doesn't count into this type(perhaps no good way to reduce the blast radius)
- Keeping the monorepo and the existing sidecars repo up to date after the migration for X releases
Develop a minimal proof of concept
POC: https://github.com/mauriciopoppe/csi-sidecars-aio-poc
Design phase
Design phase
Design phase
Design phase
Design phase
Design phase
Alpha phase
Beta phase
Beta phase
GA phase
Standalone phase
Standalone phase
It's actually not a feature, but we can enable it by deploy new version of csidriver and disable it by delete the new version and redeploy the old version
This won't make any changes to the default behavior of Kubernetes.
It's actually not a feature, it's kind of architectural change. so user can deploy old version csi driver to disable it.
Nothing happend, it will act as usually
Yes. We will add unit tests with and without the feature gate enabled.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
- Events
- Event Reason:
- API .status
- Condition name:
- Other field:
- Other (treat as last resort)
- Details:
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
plugin_execution_duration_seconds{plugin="VolumeBinding",extension_point="Score"}
- [Optional] Aggregation method:
- Components exposing the metric:
- Metric name:
- Other (treat as last resort)
- Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?
Nothing in particular.
No.