Skip to content

Add Short Rotation Period For Certificates #1670

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
187 changes: 187 additions & 0 deletions enhancements/certificate-short-rotation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
---
title: certificate-short-rotation
authors:
- vrutkovs
reviewers:
- deads2k
approvers:
- deads2k
api-approvers:
- deads2k
creation-date: 2024-08-24
last-updated: 2024-08-24
tracking-link:
- https://issues.redhat.com/browse/API-1688
---

# Short Rotation Period For Certificates

## Summary

Add new feature gate in DevPreview set so that components would issue certificates with shorter
duration - hours instead of days.

## Motivation

Currently certificates are issued by Openshift with various validity durations, but at least its 15
days. This makes testing certificate rotation in CI complicated - we have to emulate passing time
using time skewing. This methods shows how cluster recovers after certificates have expired, but
it doesn't help us with testing happy path when certificates rotate during standard cluster lifecycle.

Some components (i.e. cluster-kube-apiserver-operator) issue certificate with shorter lifetime in
development branch. This requires us to revert this change every time we branch for new release.
This also doesn't help us in CI, as it needs a similar change in the installer.
Also, most components are not using this, so we end up with some certificates valid for hours but
most would be valid for days. No test currently verifies that certificates have indeed been
rotated and this didn't cause additional disruptions.

Since the change to revert this setting requires manual pull request, there is chance that this
setting will leak into supported releases.

This enhancement describes a new feature gate, which would enable this feature for all components
and ensure that stable releases don't have it accidentally enabled as it uses FeatureGates.
Additionally the enhancement describes how certificate rotation is being tested in order to
avoid disruptions during rotation.

### User Stories

> As an Openshift developer, I want to have a setting for component to issue shorter living
> certificates so that I could verify that certificate rotation doesn't cause issues

Note that this lacks any customer userstories - this is a developer-only feature, customers are
not expected to use it

### Goals

* Create a new FeatureGate in DevPreview featureset
* Each component can decide the new duration for certificates separately.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason for leaving this up to the components?

The dev flag already means an upper bound on the rotation time:
i.e:

Test Plan
observe the cluster for 6-8 hours

So within 6 hrs all of the components of interest should have cycled through a rotation.
So why don't we just explicitly dictate the duration (e.g 3hrs) for the relevant components?

I'm guessing we likely don't want all components to rotate at the same time (although not sure if we expect to tolerate any API disruption from these rotations).
But if the motivation for letting components choose, is to stagger the rotations, then they could still overlap rotations if they end up choosing the same "shortened" duration or change it later for whatever reason.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So why don't we just explicitly dictate the duration (e.g 3hrs) for the relevant components?

Cert durations depend on functionality, as some certs are relatively painless to replace and some require kube-apiserver revision rollout. So setting all certs to say 30 mins would break our distruption tests. Also we don't know yet how the certs are interacting and what are the effects of rotations, so I'd rather not dictate exact certificate durations in this enhancement.

That said we do want set an upper limit - if the certificate is rotatable it must be rotated during that test, so cert validity duration is capped at 8 hours (but the less the better as its useful to observe several rotations).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. We can hold off on the configurability of the "short" durations until we feel like we actually need to tune that per test run. Or if bumping the hardcoded rotation periods per component proves to be too unwieldy when iterating in CI.

* Create e2e tests enabling this featuregate and checking that certificate rotate correctly
* Run e2e periodically to ensure cluster with this featuregate is functional

### Non-Goals

* Change validity duration for existing certificates
* Change pre-RC.0 60x faster than normal certificate rotation

## Proposal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Details are scant. To achieve the same benefit, this featuregate needs to be enabled in Default during pre-RC builds and moved to TechPreview after RC.0. Can we get this spelled out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There wasn't any benefit, "developer branch fast rotation" has never worked in CI - see https://github.com/openshift/enhancements/pull/1670/files#diff-1695a5e93f0f7e139919d0e0fac08ce0ea6d442932ff1cc7fab792ef1c616cf1R31-R35

Added more details on how this will be tested in CI and promotion criteria

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There wasn't any benefit, "developer branch fast rotation" has never worked in CI - see https://github.com/openshift/enhancements/pull/1670/files#diff-1695a5e93f0f7e139919d0e0fac08ce0ea6d442932ff1cc7fab792ef1c616cf1R31-R35

Added more details on how this will be tested in CI and promotion criteria

The point is not coverage in CI. The point is every pre-production cluster in existence that runs longer than a day. Which covers a wide variety of clusters internal to this group and external to this group including education, TAMs, test platform, and various test environments. Losing this capability is enormous. CI was an objective, but we gained tremendous benefit without ever having a rotation in CI.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It never materialized into bugs though (especially into availability tracking), has it? Testing fast rotation in CI however will have the same effect alongside with a better set of debugging information.

In any case, this featuregate doesn't forbid specific components to do pre-release shorter cert rotations

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added clarification that existing devrotation code can be used alongside short rotation featuregate


In developer branch cluster-kube-apiserver-operator has a patch which makes it issue shorter
certificate durations on rotation (5 days instead of 30 days). This allows us to test certificate
rotation on installations using pre-RC code, but has few significant downsides (most notably
the patch needs to be reverted after branching and initial certificates are still issued for "long"
duration). Cluster kube apiserver operator issues 60x shorter certificates in pre-RC code, which
means 15 day duration certificate becomes 6 hours certificate. The implementation of this featuregate
should not interfere with this feature, so should not issue certificates with duration longer than
6 hours.

This enhancement proposes to update components to read enabled FeatureGates and
update certificate issuing code in all OpenShift components.

The featuregate would make components generate certificates which have shorter duration - hours
instead of days, so that we could verify that most certificates can be rotated within duration of
e2e test. This would allow developers to verify that certificates get rotated without breaking
cluster features. This featuregate is independent of short rotation code for developer branch
and can be used alongside it.

So far we've identified the following components which should use this featuregate:
* installer
* cluster-kube-apiserver-operator
* cluster-kube-controller-manager-operator
* cluster-etcd-operator
* cluster-network-operator
* service-ca-operator
* OLM

Component developers may also exclude some certificates from rotation.
For example, some signers are meant to last "indefinitely" (currently set to 10 years)
to support specific cluster features, i.e. CSR signer is not meant to expire so that
new nodes could join.

In order to avoid additional disruptions caused by fast certificate rotation a new set of
periodics should be created. These periodics would start a cluster with this featuregate enabled,
ensuring that installer generates short duration certificates on day 1.
Afterwards a standard minimal conformance test runs to verify that cluster is functional. Test
monitors would record API / service disruptions along with certificate rotation events so that
could measure their effect on availability.

New test job should run for a reasonable amount of time (i.e. 8 hours) and finish with the test
which makes sure content has changed for every certificate (except "indefinite" certs, see above).
This test will ensure that all certificates are being rotated and effect on availability is being
measured.

### Workflow Description

N/A

### API Extensions

N/A

### Topology Considerations

#### Hypershift / Hosted Control Planes

N/A

#### Standalone Clusters

N/A

#### Single-node Deployments or MicroShift

Not applicable to MicroShift - it doesn't issue certificates via operators

### Implementation Details/Notes/Constraints


### Risks and Mitigations


### Drawbacks


## Open Questions [optional]


## Test Plan

End to end testing this feature would:
* enable ShortCertificateRotation featuregate
* run minimal testsuite to ensure that main cluster functions are not affected
* observe the cluster for 6-8 hours
* create a new test which verifies that certificates have rotated
Some certificates - i.e. ingress or csr signer - are expected to remain unrotated, so the test
would have a list of known exceptions

## Graduation Criteria

This featuregate is not meant to be graduated to GA - its intended to be developer-only setting

### Dev Preview -> Tech Preview
A new set of periodics would show if cluster is functional and adheres to availability requirements.
Once this is archieved the featuregate may be promoted to TechPreview. This will give us additional
signal as techpreview jobs run during nightly verification.

### Tech Preview -> GA
N/A

### Removing a deprecated feature


## Upgrade / Downgrade Strategy

Setting DevPreview is permanent - there is no way to upgrade or downgrade the cluster.

## Version Skew Strategy

N/A

## Operational Aspects of API Extensions

N/A

## Support Procedures

This setting is unsupported

## Alternatives