[DRAFT] Fix server restarting after cert rotation#553
[DRAFT] Fix server restarting after cert rotation#553zbud-msft wants to merge 3 commits intosonic-net:masterfrom
Conversation
Signed-off-by: Zain Budhwani <zainbudhwani@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Zain Budhwani <zainbudhwani@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
hdwhdw
left a comment
There was a problem hiding this comment.
AI Agent on behalf of Dawei:
Thanks for working on this — the direction of hot-reloading certs via CertCache + GetCertificate instead of restarting the server is the right call. A few things below: one bug, one architectural suggestion, and some smaller items.
Bug: VerifyPeerCertificate will never run when CA cert is configured
With ClientAuth: tls.RequireAndVerifyClientCert and ClientCAs never set, Go's built-in TLS verification runs first against the system root CAs. If the client cert is signed by a custom CA (the whole point of --ca_crt), built-in verification fails and the handshake aborts before VerifyPeerCertificate is ever called.
From Go docs:
If normal verification fails then the handshake will abort before considering this callback.
Fix: use tls.RequireAnyClientCert (requires cert, skips built-in verification) when you have a custom VerifyPeerCertificate. The AllowNoClientCert path (RequestClientCert) accidentally works fine since it's below VerifyClientCertIfGiven.
Suggestion: extract the new cert code into internal/certmgr
You're introducing CertCache, loadAndValidateCert, loadAndValidateCA, and reworking iNotifyCertMonitoring — that's a self-contained cert management domain. Since this is new code you're writing anyway, it would cost very little extra to put it in its own package rather than adding to the already-overloaded telemetry main package (~548 lines).
The repo already has an established internal/ pattern (internal/diskspace, internal/download, internal/hash, etc.), so this fits naturally:
internal/certmgr/
certmgr.go # CertCache, LoadAndValidateCert, LoadAndValidateCA
watcher.go # WatchCerts (the reworked iNotifyCertMonitoring)
Benefits:
certmgrhas zero dependency on gRPC, server config, or gnmi — clean boundary- Independently testable: cert loading/rotation tests don't need a gRPC server
telemetry.gojust importsinternal/certmgrand uses its exported types- Sets a pattern for gradually untangling the rest of the monolith
telemetry.go would still own the tls.Config construction and GetCertificate callback — it just delegates cert loading/caching/watching to certmgr.
Smaller items:
ServerStartandServerRestartconstants are now dead code — nothing sends themlog.FatalfinsideGetCertificatekills the entire process from a TLS handshake callback. Returning an error would just fail that one connection, which seems safer.- The PR needs a rebase — master has added fields (
UnixSocket,CaCertLnk,CertzMetaFile, etc.) that are missing here.
Why I did it
How I did it
How to verify it
Which release branch to backport (provide reason below if selected)
Description for the changelog
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)