-
Notifications
You must be signed in to change notification settings - Fork 501
Add Short Rotation Period For Certificates #1670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,187 @@ | ||
--- | ||
title: certificate-short-rotation | ||
authors: | ||
- vrutkovs | ||
reviewers: | ||
- deads2k | ||
approvers: | ||
- deads2k | ||
api-approvers: | ||
- deads2k | ||
creation-date: 2024-08-24 | ||
last-updated: 2024-08-24 | ||
tracking-link: | ||
- https://issues.redhat.com/browse/API-1688 | ||
--- | ||
|
||
# Short Rotation Period For Certificates | ||
|
||
## Summary | ||
|
||
Add new feature gate in DevPreview set so that components would issue certificates with shorter | ||
duration - hours instead of days. | ||
|
||
## Motivation | ||
|
||
Currently certificates are issued by Openshift with various validity durations, but at least its 15 | ||
days. This makes testing certificate rotation in CI complicated - we have to emulate passing time | ||
using time skewing. This methods shows how cluster recovers after certificates have expired, but | ||
it doesn't help us with testing happy path when certificates rotate during standard cluster lifecycle. | ||
|
||
Some components (i.e. cluster-kube-apiserver-operator) issue certificate with shorter lifetime in | ||
development branch. This requires us to revert this change every time we branch for new release. | ||
This also doesn't help us in CI, as it needs a similar change in the installer. | ||
Also, most components are not using this, so we end up with some certificates valid for hours but | ||
most would be valid for days. No test currently verifies that certificates have indeed been | ||
rotated and this didn't cause additional disruptions. | ||
|
||
Since the change to revert this setting requires manual pull request, there is chance that this | ||
setting will leak into supported releases. | ||
|
||
This enhancement describes a new feature gate, which would enable this feature for all components | ||
and ensure that stable releases don't have it accidentally enabled as it uses FeatureGates. | ||
Additionally the enhancement describes how certificate rotation is being tested in order to | ||
avoid disruptions during rotation. | ||
|
||
### User Stories | ||
|
||
> As an Openshift developer, I want to have a setting for component to issue shorter living | ||
> certificates so that I could verify that certificate rotation doesn't cause issues | ||
|
||
Note that this lacks any customer userstories - this is a developer-only feature, customers are | ||
not expected to use it | ||
|
||
### Goals | ||
|
||
* Create a new FeatureGate in DevPreview featureset | ||
vrutkovs marked this conversation as resolved.
Show resolved
Hide resolved
vrutkovs marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Each component can decide the new duration for certificates separately. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a specific reason for leaving this up to the components? The dev flag already means an upper bound on the rotation time:
So within 6 hrs all of the components of interest should have cycled through a rotation. I'm guessing we likely don't want all components to rotate at the same time (although not sure if we expect to tolerate any API disruption from these rotations). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Cert durations depend on functionality, as some certs are relatively painless to replace and some require kube-apiserver revision rollout. So setting all certs to say 30 mins would break our distruption tests. Also we don't know yet how the certs are interacting and what are the effects of rotations, so I'd rather not dictate exact certificate durations in this enhancement. That said we do want set an upper limit - if the certificate is rotatable it must be rotated during that test, so cert validity duration is capped at 8 hours (but the less the better as its useful to observe several rotations). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fair enough. We can hold off on the configurability of the "short" durations until we feel like we actually need to tune that per test run. Or if bumping the hardcoded rotation periods per component proves to be too unwieldy when iterating in CI. |
||
* Create e2e tests enabling this featuregate and checking that certificate rotate correctly | ||
* Run e2e periodically to ensure cluster with this featuregate is functional | ||
|
||
### Non-Goals | ||
|
||
* Change validity duration for existing certificates | ||
* Change pre-RC.0 60x faster than normal certificate rotation | ||
|
||
## Proposal | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Details are scant. To achieve the same benefit, this featuregate needs to be enabled in Default during pre-RC builds and moved to TechPreview after RC.0. Can we get this spelled out? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There wasn't any benefit, "developer branch fast rotation" has never worked in CI - see https://github.com/openshift/enhancements/pull/1670/files#diff-1695a5e93f0f7e139919d0e0fac08ce0ea6d442932ff1cc7fab792ef1c616cf1R31-R35 Added more details on how this will be tested in CI and promotion criteria There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The point is not coverage in CI. The point is every pre-production cluster in existence that runs longer than a day. Which covers a wide variety of clusters internal to this group and external to this group including education, TAMs, test platform, and various test environments. Losing this capability is enormous. CI was an objective, but we gained tremendous benefit without ever having a rotation in CI. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It never materialized into bugs though (especially into availability tracking), has it? Testing fast rotation in CI however will have the same effect alongside with a better set of debugging information. In any case, this featuregate doesn't forbid specific components to do pre-release shorter cert rotations There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added clarification that existing devrotation code can be used alongside short rotation featuregate |
||
|
||
In developer branch cluster-kube-apiserver-operator has a patch which makes it issue shorter | ||
certificate durations on rotation (5 days instead of 30 days). This allows us to test certificate | ||
rotation on installations using pre-RC code, but has few significant downsides (most notably | ||
the patch needs to be reverted after branching and initial certificates are still issued for "long" | ||
duration). Cluster kube apiserver operator issues 60x shorter certificates in pre-RC code, which | ||
means 15 day duration certificate becomes 6 hours certificate. The implementation of this featuregate | ||
should not interfere with this feature, so should not issue certificates with duration longer than | ||
6 hours. | ||
|
||
This enhancement proposes to update components to read enabled FeatureGates and | ||
update certificate issuing code in all OpenShift components. | ||
|
||
The featuregate would make components generate certificates which have shorter duration - hours | ||
instead of days, so that we could verify that most certificates can be rotated within duration of | ||
e2e test. This would allow developers to verify that certificates get rotated without breaking | ||
cluster features. This featuregate is independent of short rotation code for developer branch | ||
and can be used alongside it. | ||
|
||
So far we've identified the following components which should use this featuregate: | ||
* installer | ||
* cluster-kube-apiserver-operator | ||
* cluster-kube-controller-manager-operator | ||
* cluster-etcd-operator | ||
* cluster-network-operator | ||
* service-ca-operator | ||
* OLM | ||
|
||
Component developers may also exclude some certificates from rotation. | ||
For example, some signers are meant to last "indefinitely" (currently set to 10 years) | ||
to support specific cluster features, i.e. CSR signer is not meant to expire so that | ||
new nodes could join. | ||
|
||
In order to avoid additional disruptions caused by fast certificate rotation a new set of | ||
periodics should be created. These periodics would start a cluster with this featuregate enabled, | ||
ensuring that installer generates short duration certificates on day 1. | ||
Afterwards a standard minimal conformance test runs to verify that cluster is functional. Test | ||
monitors would record API / service disruptions along with certificate rotation events so that | ||
could measure their effect on availability. | ||
|
||
New test job should run for a reasonable amount of time (i.e. 8 hours) and finish with the test | ||
which makes sure content has changed for every certificate (except "indefinite" certs, see above). | ||
This test will ensure that all certificates are being rotated and effect on availability is being | ||
measured. | ||
|
||
### Workflow Description | ||
|
||
N/A | ||
|
||
### API Extensions | ||
|
||
N/A | ||
|
||
### Topology Considerations | ||
|
||
#### Hypershift / Hosted Control Planes | ||
|
||
N/A | ||
|
||
#### Standalone Clusters | ||
|
||
N/A | ||
|
||
#### Single-node Deployments or MicroShift | ||
|
||
Not applicable to MicroShift - it doesn't issue certificates via operators | ||
|
||
### Implementation Details/Notes/Constraints | ||
|
||
|
||
### Risks and Mitigations | ||
|
||
|
||
### Drawbacks | ||
|
||
|
||
## Open Questions [optional] | ||
|
||
|
||
## Test Plan | ||
|
||
End to end testing this feature would: | ||
* enable ShortCertificateRotation featuregate | ||
* run minimal testsuite to ensure that main cluster functions are not affected | ||
vrutkovs marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* observe the cluster for 6-8 hours | ||
* create a new test which verifies that certificates have rotated | ||
Some certificates - i.e. ingress or csr signer - are expected to remain unrotated, so the test | ||
would have a list of known exceptions | ||
|
||
## Graduation Criteria | ||
|
||
This featuregate is not meant to be graduated to GA - its intended to be developer-only setting | ||
|
||
### Dev Preview -> Tech Preview | ||
A new set of periodics would show if cluster is functional and adheres to availability requirements. | ||
Once this is archieved the featuregate may be promoted to TechPreview. This will give us additional | ||
signal as techpreview jobs run during nightly verification. | ||
|
||
### Tech Preview -> GA | ||
N/A | ||
|
||
### Removing a deprecated feature | ||
|
||
|
||
## Upgrade / Downgrade Strategy | ||
|
||
Setting DevPreview is permanent - there is no way to upgrade or downgrade the cluster. | ||
|
||
## Version Skew Strategy | ||
|
||
N/A | ||
|
||
## Operational Aspects of API Extensions | ||
|
||
N/A | ||
|
||
## Support Procedures | ||
|
||
This setting is unsupported | ||
|
||
## Alternatives |
Uh oh!
There was an error while loading. Please reload this page.