Skip to content

Commit ca928b9

Browse files
jtschellingclaude
andcommitted
docs: add certificate rotation plan to ADR-029
Documents the current gap where server TLS certs and client CA bundles are loaded once at startup without hot-reload. Proposes using the certwatcher pattern (already used for webhook/metrics TLS) for server-side rotation, and outlines three options for client-side CA bundle rotation. Signed-off-by: JT Schelling <jschelling@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: JT Schelling <jschelling@nvidia.com>
1 parent 6ef21e7 commit ca928b9

File tree

1 file changed

+65
-0
lines changed

1 file changed

+65
-0
lines changed

docs/designs/029-grpc-tls-authentication.md

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -282,6 +282,71 @@ When `tls.enabled=true`, the chart creates a cert-manager `Certificate` resource
282282
and mounts the resulting Secret. The `issuerName` defaults to the janitor's existing
283283
self-signed Issuer, reusing NVSentinel's existing PKI infrastructure.
284284

285+
### Certificate Rotation
286+
287+
cert-manager automatically rotates certificates before expiry. The current reference
288+
implementation loads certificates once at startup — neither the server's TLS cert
289+
nor the client's CA bundle is reloaded at runtime. This means a cert rotation by
290+
cert-manager would not take effect until the affected pod restarts.
291+
292+
For SA tokens, rotation is already handled: the client interceptor re-reads the
293+
token file on every gRPC call, so kubelet's automatic token refresh is picked up
294+
immediately.
295+
296+
#### Server-Side Cert Rotation
297+
298+
The janitor-provider loads its TLS key pair once via `tls.LoadX509KeyPair()` in
299+
`main.go`. When cert-manager rotates the server certificate, the provider continues
300+
serving the old cert until restarted.
301+
302+
**Recommended fix**: Use the `certwatcher` package from `controller-runtime`, which
303+
watches certificate files for changes and hot-reloads them via the
304+
`tls.Config.GetCertificate` callback. This is the same pattern already used in the
305+
janitor for webhook and metrics TLS:
306+
307+
```go
308+
// Existing pattern in janitor/main.go for webhook certs:
309+
watcher, err := certwatcher.New(certPath, keyPath)
310+
mgr.Add(watcher) // starts filesystem watch
311+
312+
tlsConfig := &tls.Config{
313+
GetCertificate: watcher.GetCertificate,
314+
MinVersion: tls.VersionTLS12,
315+
}
316+
```
317+
318+
The janitor-provider does not use `controller-runtime`'s manager, so it would need
319+
to either:
320+
321+
1. Start the `certwatcher` goroutine manually (it implements `Runnable` with a
322+
`Start(ctx)` method)
323+
2. Use a standalone filesystem watcher (e.g., `fsnotify`) to reload the key pair
324+
325+
Option 1 is preferred since it reuses the same well-tested package.
326+
327+
#### Client-Side CA Bundle Rotation
328+
329+
The janitor controller loads the CA bundle once in `NewCSPProviderDialOptions()` and
330+
builds a static `x509.CertPool`. If the CA itself rotates (e.g., the self-signed
331+
Issuer regenerates its CA key), the client will reject the new server cert because
332+
its cert pool is stale.
333+
334+
In practice, CA rotation is infrequent — cert-manager rotates leaf certificates
335+
but the CA key typically remains stable. However, for robustness:
336+
337+
**Option A — Periodic reload**: A background goroutine re-reads the CA bundle file
338+
on an interval (e.g., every 5 minutes) and swaps the `RootCAs` cert pool atomically.
339+
340+
**Option B — File watcher**: Use `fsnotify` or `certwatcher` to watch the CA bundle
341+
file and reload on change.
342+
343+
**Option C — Accept restart-on-CA-rotation**: Since CA rotation is rare and
344+
typically planned, accept that the janitor pod needs a restart when the CA changes.
345+
This is the simplest approach and may be sufficient for most deployments.
346+
347+
The reference implementation currently follows option C implicitly. Options A or B
348+
can be added if CA rotation without downtime becomes a requirement.
349+
285350
### Test Compatibility
286351

287352
Existing unit tests use `bufconn` with insecure credentials. The `CSPProviderInsecure`

0 commit comments

Comments
 (0)