Skip to content

K8SPSMDB-1606: Add CA key persistence for manual TLS to support safe cert re-signing on SAN changes#2277

Merged
hors merged 12 commits into
percona:mainfrom
myJamong:NO-TICKET-YET-manual-tls-ca-persistence
Jun 8, 2026
Merged

K8SPSMDB-1606: Add CA key persistence for manual TLS to support safe cert re-signing on SAN changes#2277
hors merged 12 commits into
percona:mainfrom
myJamong:NO-TICKET-YET-manual-tls-ca-persistence

Conversation

@myJamong

@myJamong myJamong commented Mar 9, 2026

Copy link
Copy Markdown
Contributor

CHANGE DESCRIPTION

Problem:
When cert-manager is not installed, the operator's manual TLS path has two issues: (1) it never detects SAN changes (e.g., when splitHorizons are added), because it skips reconciliation if TLS secrets already exist, and (2) if secrets are manually deleted to force regeneration, a completely new CA is generated without merging with the old one, causing TLS verification failures during SmartUpdate rolling restarts.

I also made an issue here: #2278

Cause:
The manual TLS code path (createSSLManually) only creates secrets when they don't exist and returns immediately if they do — there is no SAN change detection. Additionally, each call to tls.Issue() generates an independent CA, meaning ssl and ssl-internal use different CAs, and any regeneration produces a CA that existing pods cannot trust.

Solution:
Introduce a persistent CA secret ({name}-ca-cert) for manual TLS management, mirroring the cert-manager CA secret structure. The CA key is preserved so that when SANs change (e.g., splitHorizon additions), the operator re-signs TLS certificates using the same CA — no CA merge is needed and rolling restarts are safe. Key changes:

  • IssueCA() / IssueWithCA(): Split certificate generation so CA and TLS certs can be managed independently.
  • getOrCreateManualCA(): Creates or reuses the persistent CA secret.
  • needsManualSSLUpdate(): Detects SAN changes by comparing the current certificate's SANs with expected SANs.
  • updateSSLManually(): Re-signs TLS certs with the existing CA when SANs change.
  • Both ssl and ssl-internal now share the same CA (consistent with cert-manager behavior).

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?
  • Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported MongoDB version?
  • Does the change support oldest and newest supported Kubernetes version?

@pull-request-size pull-request-size Bot added the size/XL 500-999 lines label Mar 9, 2026
@egegunes egegunes changed the title Add CA key persistence for manual TLS to support safe cert re-signing on SAN changes K8SPSMDB-1606: Add CA key persistence for manual TLS to support safe cert re-signing on SAN changes Mar 9, 2026
@egegunes egegunes added this to the v1.23.0 milestone Mar 9, 2026
Comment on lines 459 to 461
if cr.CompareVersion("1.17.0") < 0 {
secretObj.Labels = nil
caLabels = nil
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this condition can be removed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the condition - 0a1aa64

It seems like this line slipped in by accident.

Comment on lines +477 to +479
if cr.CompareVersion("1.17.0") < 0 {
secretLabels = nil
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need this condition

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the condition - 0a1aa64

It seems like this line slipped in by accident.

pooknull
pooknull previously approved these changes Apr 14, 2026
@hors hors requested a review from egegunes May 6, 2026 13:54
egegunes
egegunes previously approved these changes May 11, 2026
hors
hors previously approved these changes May 15, 2026
gkech
gkech previously approved these changes May 25, 2026

@gkech gkech left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving, but minor comment for future consideration

Comment thread pkg/psmdb/tls/tls.go Outdated
}

// GetSANsFromCert extracts DNS SANs from a PEM-encoded TLS certificate.
func GetSANsFromCert(tlsCertPEM []byte) ([]string, error) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got confused by this function name. Either rename to GetDNSNamesFromCert or return all SAN types

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gkech Renamed in ece2836

The function returns only cert.DNSNames, not all SAN types.
Renaming to match actual behavior per PR review feedback.
@myJamong myJamong dismissed stale reviews from hors, egegunes, and pooknull via ece2836 May 26, 2026 02:22
}

// needsManualSSLUpdate checks if the TLS certificate SANs differ from the expected SANs.
func (r *ReconcilePerconaServerMongoDB) needsManualSSLUpdate(ctx context.Context, cr *api.PerconaServerMongoDB, sslSecret *corev1.Secret) bool {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctx is not used here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mayankshah1607 Thanks. Used ctx to add the missing error log below — that resolves your other comment too. Fixed in f502c50

currentSANs, err := tls.GetDNSNamesFromCert(sslSecret.Data["tls.crt"])
if err != nil {
return errors.Wrap(err, "create TLS internal secret")
return false

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are swallowing the error, we should at least log it here (in that case we will need the context logger)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mayankshah1607 Thanks. Added an error log via the context logger. Kept the return false since the caller treats this as advisory and retrying wouldn't recover a corrupt cert. Fixed in f502c50

@hors hors requested review from egegunes, gkech and mayankshah1607 May 26, 2026 10:46
@github-actions github-actions Bot added the tests label Jun 5, 2026
@JNKPercona

Copy link
Copy Markdown
Collaborator
Test Name Result Time
arbiter passed 00:11:24
balancer passed 00:18:48
cross-site-sharded passed 00:19:11
custom-replset-name passed 00:10:42
custom-tls passed 00:14:43
custom-users-roles passed 00:10:41
custom-users-roles-sharded passed 00:11:53
data-at-rest-encryption passed 00:13:23
data-sharded passed 00:23:38
demand-backup passed 00:17:06
demand-backup-eks-credentials-irsa passed 00:00:09
demand-backup-fs passed 00:24:03
demand-backup-if-unhealthy passed 00:08:29
demand-backup-incremental-aws passed 00:12:00
demand-backup-incremental-azure passed 00:11:58
demand-backup-incremental-gcp-native passed 00:12:03
demand-backup-incremental-gcp-s3 passed 00:11:16
demand-backup-incremental-minio passed 00:26:26
demand-backup-incremental-sharded-aws passed 00:18:30
demand-backup-incremental-sharded-azure passed 00:19:06
demand-backup-incremental-sharded-gcp-native passed 00:17:53
demand-backup-incremental-sharded-gcp-s3 passed 00:19:33
demand-backup-incremental-sharded-minio passed 00:28:02
demand-backup-logical-minio-native-tls passed 00:09:01
demand-backup-physical-parallel passed 00:08:58
demand-backup-physical-aws passed 00:12:24
demand-backup-physical-azure passed 00:11:59
demand-backup-physical-gcp-s3 passed 00:12:16
demand-backup-physical-gcp-native passed 00:12:06
demand-backup-physical-minio passed 00:20:48
demand-backup-physical-minio-native passed 00:26:28
demand-backup-physical-minio-native-tls passed 00:20:10
demand-backup-physical-sharded-parallel passed 00:11:16
demand-backup-physical-sharded-aws passed 00:19:15
demand-backup-physical-sharded-azure passed 00:17:55
demand-backup-physical-sharded-gcp-native passed 00:18:01
demand-backup-physical-sharded-minio passed 00:18:08
demand-backup-physical-sharded-minio-native passed 00:17:48
demand-backup-sharded passed 00:27:04
demand-backup-snapshot passed 00:39:13
demand-backup-snapshot-vault passed 00:18:26
disabled-auth passed 00:17:25
expose-sharded passed 00:34:21
finalizer passed 00:10:20
ignore-labels-annotations passed 00:08:14
init-deploy passed 00:12:52
ldap passed 00:09:09
ldap-tls passed 00:12:35
limits passed 00:06:21
liveness passed 00:09:17
mongod-major-upgrade passed 00:13:39
mongod-major-upgrade-sharded passed 00:21:15
monitoring-2-0 passed 00:25:29
monitoring-pmm3 passed 00:29:06
multi-cluster-service passed 00:13:39
multi-storage passed 00:19:42
non-voting-and-hidden passed 00:17:03
one-pod passed 00:08:31
operator-self-healing-chaos passed 00:12:56
pitr passed 00:33:47
pitr-physical passed 01:00:20
pitr-sharded passed 00:21:36
pitr-to-new-cluster passed 00:25:59
pitr-physical-backup-source passed 00:54:03
preinit-updates passed 00:05:12
pvc-auto-resize passed 00:13:06
pvc-resize passed 00:17:23
recover-no-primary passed 00:27:51
replset-overrides passed 00:18:44
replset-remapping passed 00:16:34
replset-remapping-sharded passed 00:18:00
rs-shard-migration passed 00:14:54
scaling passed 00:11:20
scheduled-backup passed 00:18:55
security-context passed 00:07:17
self-healing-chaos passed 00:15:37
service-per-pod passed 00:18:51
serviceless-external-nodes passed 00:07:24
smart-update passed 00:08:13
split-horizon passed 00:14:12
split-horizon-manual-tls passed 00:12:16
stable-resource-version passed 00:04:44
storage passed 00:08:03
tls-issue-cert-manager passed 00:29:38
unsafe-psa passed 00:08:35
upgrade passed 00:10:07
upgrade-consistency passed 00:06:49
upgrade-consistency-sharded-tls passed 00:53:46
upgrade-sharded passed 00:19:46
upgrade-partial-backup passed 00:15:28
users passed 00:17:41
users-vault passed 00:13:28
version-service passed 00:25:21
Summary Value
Tests Run 93/93
Job Duration 03:50:22
Total Test Time 26:59:47

commit: 1aff7fd
image: perconalab/percona-server-mongodb-operator:PR-2277-1aff7fd55

@hors hors merged commit 2c36dca into percona:main Jun 8, 2026
14 checks passed
@hors

hors commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

@myJamong thank you for the contribution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants