fix: reuse existing cluster secrets when scaling Talos nodes during update#4586
Conversation
…pdate When `cluster update` adds new nodes to an existing Talos cluster, ConfigManager.Load() generates fresh CA certificates, tokens, and bootstrap secrets. New nodes receive configs with mismatched PKI and cannot join the cluster. Fix by adding: - Configs.WithSecrets() — regenerates the config bundle using a provided secrets bundle, preserving the cluster's existing PKI (follows the existing WithEndpoint pattern) - Configs.ExtractSecrets() — extracts the secrets bundle from a loaded config for reuse - Provisioner.syncSecretsFromCluster() — connects to a running control-plane node via the saved talosconfig, fetches its machine config via COSI state, extracts the secrets bundle, and rebuilds the in-memory talosConfigs with those secrets The sync is called in Update() before any node scaling or in-place config changes, ensuring all applied configs use the same CA and tokens as the running cluster. The method is a no-op for nil talosConfigs (tests) and Omni-managed clusters (which handle their own config). Fixes #4584 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
✅MegaLinter analysis: Success✅ Linters with no issuesactionlint, git_diff, hadolint, jscpd, jsonlint, lychee, markdown-table-formatter, markdownlint, prettier, prettier, stylelint, syft, trivy-sbom, trufflehog, v8r, v8r, yamllint See detailed reports in MegaLinter artifacts
|
There was a problem hiding this comment.
Pull request overview
This PR aims to fix Talos cluster updates so scale-up operations reuse the running cluster’s existing PKI and bootstrap secrets instead of regenerating fresh ones. It fits into the Talos provisioner/config-manager path by rebuilding in-memory machine configs from live cluster state before update actions run.
Changes:
- Added
Configs.WithSecretsandConfigs.ExtractSecretsto regenerate Talos config bundles while preserving existing secret material. - Added
syncSecretsFromCluster()to read a live control-plane machine config and rebuildtalosConfigsbefore update operations. - Added unit tests for the new config helpers and the no-op branches of secret sync.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
pkg/svc/provisioner/cluster/talos/update.go |
Adds live secret-sync before Talos update workflows. |
pkg/svc/provisioner/cluster/talos/update_test.go |
Adds tests for nil-config and Omni skip behavior. |
pkg/svc/provisioner/cluster/talos/export_test.go |
Exposes the new helper for black-box tests. |
pkg/fsutil/configmanager/talos/configs.go |
Adds secret extraction/regeneration helpers to Talos configs. |
pkg/fsutil/configmanager/talos/configs_test.go |
Adds tests for WithSecrets and ExtractSecrets. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…ng CP - Add needsSecretSync() helper that returns true only when the update will push machine configs (scale-up, in-place changes, or reboot changes) - Gate syncSecretsFromCluster() call behind needsSecretSync() to avoid unnecessary Talos API calls on no-op updates or pure scale-down - Change 'no CP node found' from silent fallback to error (fail closed) to prevent PKI mismatch when secrets are actually needed - Add unit tests for needsSecretSync: scale-up, no-changes, scale-down, nil-configs, and Omni-skipped scenarios Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Use static ErrNoControlPlaneForSecretSync error (err113) - Rename mc to machineConfig for clarity (varnamelen) - Move needsSecretSync check inside syncSecretsFromCluster to reduce Update() cyclomatic complexity from 12 to 11 (cyclop) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extract applyRebootChangesIfNeeded helper to move the HasRebootRequired && RollingReboot guard out of Update(), reducing McCabe complexity from 11 to 9. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Summary
When
cluster updateadds new nodes to an existing Talos cluster,ConfigManager.Load()generates fresh CA certificates, tokens, and bootstrap secrets. New nodes receive configs with mismatched PKI and cannot join the cluster ("certificate signed by unknown authority").Changes
pkg/fsutil/configmanager/talos/configs.goWithSecrets(*secrets.Bundle)— regenerates the config bundle using a provided secrets bundle, preserving the cluster's existing PKI. Follows the existingWithEndpointpattern.ExtractSecrets()— extracts the*secrets.Bundlefrom a loaded config for reuse.pkg/svc/provisioner/cluster/talos/update.gosyncSecretsFromCluster()— connects to a running control-plane node via the saved talosconfig, fetches its machine config via COSI state API, extracts the secrets bundle, and rebuilds the in-memorytalosConfigswith those secrets.Update()beforeapplyNodeScalingChanges()andapplyInPlaceConfigChanges(), ensuring all applied configs use the same CA and tokens as the running cluster.talosConfigs(tests) and Omni-managed clusters.Tests
TestConfigs_WithSecrets— verifies PKI preservation across regeneration and nil no-op.TestConfigs_ExtractSecrets— verifies secret extraction from loaded configs.TestSyncSecretsFromCluster_NilTalosConfigs— verifies no-op for nil configs.TestSyncSecretsFromCluster_OmniSkipped— verifies no-op for Omni clusters.Fixes #4584