Skip to content

[DRAFT] Fix server restarting after cert rotation#553

Draft
zbud-msft wants to merge 3 commits intosonic-net:masterfrom
zbud-msft:stop_server_restart_cert_rotation
Draft

[DRAFT] Fix server restarting after cert rotation#553
zbud-msft wants to merge 3 commits intosonic-net:masterfrom
zbud-msft:stop_server_restart_cert_rotation

Conversation

@zbud-msft
Copy link
Contributor

Why I did it

How I did it

How to verify it

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

Signed-off-by: Zain Budhwani <zainbudhwani@microsoft.com>
@mssonicbld
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Zain Budhwani <zainbudhwani@microsoft.com>
Signed-off-by: Zain Budhwani <zainbudhwani@microsoft.com>
@mssonicbld
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@hdwhdw hdwhdw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI Agent on behalf of Dawei:

Thanks for working on this — the direction of hot-reloading certs via CertCache + GetCertificate instead of restarting the server is the right call. A few things below: one bug, one architectural suggestion, and some smaller items.


Bug: VerifyPeerCertificate will never run when CA cert is configured

With ClientAuth: tls.RequireAndVerifyClientCert and ClientCAs never set, Go's built-in TLS verification runs first against the system root CAs. If the client cert is signed by a custom CA (the whole point of --ca_crt), built-in verification fails and the handshake aborts before VerifyPeerCertificate is ever called.

From Go docs:

If normal verification fails then the handshake will abort before considering this callback.

Fix: use tls.RequireAnyClientCert (requires cert, skips built-in verification) when you have a custom VerifyPeerCertificate. The AllowNoClientCert path (RequestClientCert) accidentally works fine since it's below VerifyClientCertIfGiven.


Suggestion: extract the new cert code into internal/certmgr

You're introducing CertCache, loadAndValidateCert, loadAndValidateCA, and reworking iNotifyCertMonitoring — that's a self-contained cert management domain. Since this is new code you're writing anyway, it would cost very little extra to put it in its own package rather than adding to the already-overloaded telemetry main package (~548 lines).

The repo already has an established internal/ pattern (internal/diskspace, internal/download, internal/hash, etc.), so this fits naturally:

internal/certmgr/
  certmgr.go    # CertCache, LoadAndValidateCert, LoadAndValidateCA
  watcher.go    # WatchCerts (the reworked iNotifyCertMonitoring)

Benefits:

  • certmgr has zero dependency on gRPC, server config, or gnmi — clean boundary
  • Independently testable: cert loading/rotation tests don't need a gRPC server
  • telemetry.go just imports internal/certmgr and uses its exported types
  • Sets a pattern for gradually untangling the rest of the monolith

telemetry.go would still own the tls.Config construction and GetCertificate callback — it just delegates cert loading/caching/watching to certmgr.


Smaller items:

  • ServerStart and ServerRestart constants are now dead code — nothing sends them
  • log.Fatalf inside GetCertificate kills the entire process from a TLS handshake callback. Returning an error would just fail that one connection, which seems safer.
  • The PR needs a rebase — master has added fields (UnixSocket, CaCertLnk, CertzMetaFile, etc.) that are missing here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants