Skip to content

Commit a7535bf

Browse files
jtschellingclaude
andcommitted
docs: reframe cert rotation section as forward-looking design
The certificate rotation section read as if describing bugs in existing code. Rewritten to present the design decisions prescriptively — how rotation will be handled, not how to fix what's already there. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 9b58725 commit a7535bf

File tree

1 file changed

+30
-32
lines changed

1 file changed

+30
-32
lines changed

docs/designs/029-grpc-tls-authentication.md

Lines changed: 30 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -282,55 +282,52 @@ self-signed Issuer, reusing NVSentinel's existing PKI infrastructure.
282282

283283
### Certificate Rotation
284284

285-
cert-manager automatically rotates certificates before expiry. The current reference
286-
implementation loads certificates once at startup — neither the server's TLS cert
287-
nor the client's CA bundle is reloaded at runtime. This means a cert rotation by
288-
cert-manager would not take effect until the affected pod restarts.
285+
cert-manager automatically rotates leaf certificates before expiry. The design needs
286+
to account for how each component picks up rotated credentials without manual
287+
intervention.
289288

290-
For SA tokens, rotation is already handled: the client interceptor re-reads the
291-
token file on every gRPC call, so kubelet's automatic token refresh is picked up
292-
immediately.
289+
**SA tokens** are the simplest case: because the client interceptor re-reads the
290+
token file on every gRPC call, kubelet's automatic token refresh is picked up
291+
immediately with no additional machinery.
293292

294-
#### Server-Side Cert Rotation
293+
**TLS certificates** require more consideration. A naive implementation that loads
294+
certs once at startup would serve stale credentials after cert-manager rotates the
295+
certificate, requiring a pod restart to pick up the new cert. The approaches below
296+
avoid that.
295297

296-
The janitor-provider loads its TLS key pair once via `tls.LoadX509KeyPair()` in
297-
`main.go`. When cert-manager rotates the server certificate, the provider continues
298-
serving the old cert until restarted.
298+
#### Server-Side Cert Rotation
299299

300-
**Recommended fix**: Use the `certwatcher` package from `controller-runtime`, which
301-
watches certificate files for changes and hot-reloads them via the
302-
`tls.Config.GetCertificate` callback. This is the same pattern already used in the
303-
janitor for webhook and metrics TLS:
300+
The janitor-provider should use the `certwatcher` package from `controller-runtime`
301+
to watch certificate files and hot-reload them via the `tls.Config.GetCertificate`
302+
callback. This is the same pattern the janitor already uses for webhook and metrics
303+
TLS:
304304

305305
```go
306-
// Existing pattern in janitor/main.go for webhook certs:
307306
watcher, err := certwatcher.New(certPath, keyPath)
308-
mgr.Add(watcher) // starts filesystem watch
307+
go watcher.Start(ctx) // starts filesystem watch
309308
310309
tlsConfig := &tls.Config{
311310
GetCertificate: watcher.GetCertificate,
312311
MinVersion: tls.VersionTLS12,
313312
}
314313
```
315314

316-
The janitor-provider does not use `controller-runtime`'s manager, so it would need
317-
to either:
318-
319-
1. Start the `certwatcher` goroutine manually (it implements `Runnable` with a
320-
`Start(ctx)` method)
321-
2. Use a standalone filesystem watcher (e.g., `fsnotify`) to reload the key pair
315+
Since the janitor-provider does not use `controller-runtime`'s manager, the
316+
`certwatcher` goroutine would be started manually (it implements `Runnable` with a
317+
`Start(ctx)` method). This reuses a well-tested package already in the dependency
318+
tree.
322319

323-
Option 1 is preferred since it reuses the same well-tested package.
320+
An alternative is a standalone `fsnotify` watcher, but `certwatcher` is preferred
321+
for consistency with the janitor's existing pattern.
324322

325323
#### Client-Side CA Bundle Rotation
326324

327-
The janitor controller loads the CA bundle once in `NewCSPProviderDialOptions()` and
328-
builds a static `x509.CertPool`. If the CA itself rotates (e.g., the self-signed
329-
Issuer regenerates its CA key), the client will reject the new server cert because
330-
its cert pool is stale.
325+
The client needs to trust the CA that signed the server's certificate. If the CA
326+
itself rotates (e.g., the self-signed Issuer regenerates its CA key), the client
327+
must pick up the new CA bundle or it will reject the new server cert.
331328

332-
In practice, CA rotation is infrequent — cert-manager rotates leaf certificates
333-
but the CA key typically remains stable. However, for robustness:
329+
In practice, CA rotation is infrequent — cert-manager rotates leaf certificates but
330+
the CA key typically remains stable. Three approaches for handling CA bundle updates:
334331

335332
**Option A — Periodic reload**: A background goroutine re-reads the CA bundle file
336333
on an interval (e.g., every 5 minutes) and swaps the `RootCAs` cert pool atomically.
@@ -342,8 +339,9 @@ file and reload on change.
342339
typically planned, accept that the janitor pod needs a restart when the CA changes.
343340
This is the simplest approach and may be sufficient for most deployments.
344341

345-
The reference implementation currently follows option C implicitly. Options A or B
346-
can be added if CA rotation without downtime becomes a requirement.
342+
This design recommends **Option C** as the initial approach, given that CA rotation
343+
is an infrequent, planned event. Options A or B can be adopted later if zero-downtime
344+
CA rotation becomes a requirement.
347345

348346
### Test Compatibility
349347

0 commit comments

Comments
 (0)