Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3751: Update release signoff Checklist before GA #5024

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sunnylovestiramisu
Copy link
Contributor

@sunnylovestiramisu sunnylovestiramisu commented Jan 7, 2025

  • One-line PR description: Update Release Signoff Checklist
  • Issue link:

/assign @msau42

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Jan 7, 2025
@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 7, 2025
@xing-yang
Copy link
Contributor

  • Can you change the title "Kubernetes Volume Provisioned IO" to "Kubernetes Modify Volume" or "Kubernetes Volume Attributes Class"? It is not just about "provisioned io" any more.

  • "Stress test before GA" is in the Beta section. Do we have stress test?

  • This is in the PRR section "TODO Upgrade and rollback will be tested when the feature gate will change to beta." Is this still a TODO?

@xing-yang
Copy link
Contributor

It looks like this PR does not contain this merged change: https://github.com/kubernetes/enhancements/pull/5028/files
Do you need a rebase?

@sftim
Copy link
Contributor

sftim commented Jan 22, 2025

/retitle KEP-3751: Update release signoff Checklist before GA

@k8s-ci-robot k8s-ci-robot changed the title Update Release Signoff Checklist for KEP-3751 before GA KEP-3751: Update release signoff Checklist before GA Jan 22, 2025
@sunnylovestiramisu
Copy link
Contributor Author

/assign johnbelamaric

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sunnylovestiramisu
Once this PR has been reviewed and has the lgtm label, please ask for approval from johnbelamaric. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 31, 2025
@sunnylovestiramisu
Copy link
Contributor Author

Stress tests still WIP: kubernetes/kubernetes#129918

It is set at a low rate though similar to our other stress tests(10 pods), and creating pod+volume, and then modify the volume.

The pod and volume are both up and running.

4. Turn on both feature flag and beta API in api-server again. ``kubectl get vac`` shows the VACs again. Change PVC back to VAC1, modify is applied.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, everything looks pretty good. The only thing I'd like to see is some updates to the sections below, for example:

  1. SLOs: Did beta confirm the stated SLOs are appropriate?

  2. For this one:

    Yes. The VAC protection controller will be expensive because it needs to LIST PVCs/PVs but the call volume
    should be low.

    Can you explain a bit further? I think when you say "volume should be low" what you mean is "volume is user-initiated, not triggered by any upstream controller, and is expected to be low in the typical use case".

  3. Was this done? Can you include / point to any results?

    Yes, the feature may impact CreateVolume. We will measure this impact during beta and provide feedback to operators.

  4. Was this done? Can you include / point to any results?

    Stress tests will determine increase in resource usage at varying amounts of concurrent volume modifications.

Copy link
Contributor Author

@sunnylovestiramisu sunnylovestiramisu Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes it is still appropriate and it is used by AWS's driver in production workloads right now
  2. Your understanding is correct, volume modify is triggered by the user changing a PVC's VAC, and most CO's has a rate limit(you can modify once per 4 hours etc.). So we are not expecting high call rate
  3. This one needs an update, because not many customers in AWS creating volume with VAC. But that being said, there is no impact to existing CreateVolume if VAC is not set.
  4. Please see the other comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated these in the actual KEP files as well

@@ -953,7 +969,7 @@ Using this feature may result in non-negligible increase of resource usage IF cu
- external-resizer CPU and memory will see a non-negligible increase if users increased the number of concurrent operations via the `--workers` flag. We follow the strategy of sharing that limit between `ControllerExpandVolume` and `ControllerModifyVolume` RPCs, similar to how external-provisioner functions.
- The API-Server may see a spike of CPU when processing relevant changes.

Stress tests will determine increase in resource usage at varying amounts of concurrent volume modifications.
Stress tests will determine increase in resource usage at varying amounts of concurrent volume modifications. Before promoting to Beta in 1.29, 250 modifications at a rate of 4 patches per second was tested on AWS and the bottle-neck is AWS limits.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a link for this stress test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No link, it is their internal tests. I have asked Drew to make a comment here with their numbers, they still need to test CreateVolume with VAC at a much higher rate than the stress tests added in the upstream PR.

Copy link
Contributor

@AndrewSirenko AndrewSirenko Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm not comfortable with putting this existing language about tests done on AWS before promoting to beta into the KEP itself.

AWS EBS did not perform official K8s stress tests when helping promote this KEP from alpha to Beta (because stress tests were required for GA, not beta).

I did perform a spot check of modifying 250 volumes concurrently with EBS CSI Driver before EKS flipped VAC feature gate on by default in EKS 1.31.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And ideally these stress tests would be performed with CSI Mock driver, so cloud provider and CSI Driver implementations do not affect test results, right?

Copy link
Contributor

@AndrewSirenko AndrewSirenko Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, I just performed the following quick test:

Patched 250 PVCs with a new volumeAttributesClassName at 5 patches per second. Took 100 seconds total for all modifications to finish, bottle-neck being the slow patch rate due to AWS modification limits

EKS 1.32 (VolumeAttributesClass feature gate & v1beta1 storage API group enabled)
external-resizer:v1.12.0
AWS EBS CSI Driver v1.39.0

Resizer container peaked at 30 milli-vCPU and 34 MiB RAM (Negligible usage)

Here's a gist with commands and manifests: https://gist.github.com/AndrewSirenko/4ab8ec5a1eee51674b5d3ac18835f4e9

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there's a WIP stress test mentioned here: #5024 (comment)

* E2E tests using mock driver to cause failure on create, update and recovering cases
* [K8s storage framework](https://github.com/kubernetes/kubernetes/tree/master/test/e2e/storage/testsuites)
* [csi-tes](https://github.com/kubernetes-csi/csi-test)
* E2E tests: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/storage/volumeattributesclass.go
* Test coverage of quota usage with ResourceQuota and LimitRange
* Measure latency impact to CreateVolume during beta and provide feedback to operators
* Upgrade and rollback test when the feature gate changes to beta
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has anyone used/tested quota features of this KEP before we move the whole KEP to GA? This seems like a problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest quota change is this: kubernetes/kubernetes#124360

But it did not merged in yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. But it seems weird to merge that code directly with GA feature-gate when it was never tested/merged before.

@sunnylovestiramisu sunnylovestiramisu force-pushed the updateVAC branch 2 times, most recently from 863f877 to 0bfdef2 Compare February 6, 2025 15:57
@johnbelamaric
Copy link
Member

Ok, I see the stress test is in progress, I would expect that do be done as part of this GA. The other PRR answers seem good. Once there is SIG approval, I can add the approval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants