Skip to content

fix: reuse existing cluster secrets when scaling Talos nodes during update#4586

Merged
ksail-bot[bot] merged 5 commits into
mainfrom
devantler/fix-reuse-cluster-secrets
May 4, 2026
Merged

fix: reuse existing cluster secrets when scaling Talos nodes during update#4586
ksail-bot[bot] merged 5 commits into
mainfrom
devantler/fix-reuse-cluster-secrets

Conversation

@devantler

Copy link
Copy Markdown
Contributor

Summary

When cluster update adds new nodes to an existing Talos cluster, ConfigManager.Load() generates fresh CA certificates, tokens, and bootstrap secrets. New nodes receive configs with mismatched PKI and cannot join the cluster ("certificate signed by unknown authority").

Changes

pkg/fsutil/configmanager/talos/configs.go

  • WithSecrets(*secrets.Bundle) — regenerates the config bundle using a provided secrets bundle, preserving the cluster's existing PKI. Follows the existing WithEndpoint pattern.
  • ExtractSecrets() — extracts the *secrets.Bundle from a loaded config for reuse.

pkg/svc/provisioner/cluster/talos/update.go

  • syncSecretsFromCluster() — connects to a running control-plane node via the saved talosconfig, fetches its machine config via COSI state API, extracts the secrets bundle, and rebuilds the in-memory talosConfigs with those secrets.
  • Wired into Update() before applyNodeScalingChanges() and applyInPlaceConfigChanges(), ensuring all applied configs use the same CA and tokens as the running cluster.
  • No-op for nil talosConfigs (tests) and Omni-managed clusters.

Tests

  • TestConfigs_WithSecrets — verifies PKI preservation across regeneration and nil no-op.
  • TestConfigs_ExtractSecrets — verifies secret extraction from loaded configs.
  • TestSyncSecretsFromCluster_NilTalosConfigs — verifies no-op for nil configs.
  • TestSyncSecretsFromCluster_OmniSkipped — verifies no-op for Omni clusters.

Fixes #4584

…pdate

When `cluster update` adds new nodes to an existing Talos cluster,
ConfigManager.Load() generates fresh CA certificates, tokens, and
bootstrap secrets. New nodes receive configs with mismatched PKI and
cannot join the cluster.

Fix by adding:
- Configs.WithSecrets() — regenerates the config bundle using a provided
  secrets bundle, preserving the cluster's existing PKI (follows the
  existing WithEndpoint pattern)
- Configs.ExtractSecrets() — extracts the secrets bundle from a loaded
  config for reuse
- Provisioner.syncSecretsFromCluster() — connects to a running
  control-plane node via the saved talosconfig, fetches its machine
  config via COSI state, extracts the secrets bundle, and rebuilds
  the in-memory talosConfigs with those secrets

The sync is called in Update() before any node scaling or in-place config
changes, ensuring all applied configs use the same CA and tokens as the
running cluster. The method is a no-op for nil talosConfigs (tests) and
Omni-managed clusters (which handle their own config).

Fixes #4584

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 4, 2026 18:58
@github-project-automation github-project-automation Bot moved this to 🫴 Ready in 🌊 Project Board May 4, 2026
@github-actions

github-actions Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

MegaLinter analysis: Success

✅ Linters with no issues

actionlint, git_diff, hadolint, jscpd, jsonlint, lychee, markdown-table-formatter, markdownlint, prettier, prettier, stylelint, syft, trivy-sbom, trufflehog, v8r, v8r, yamllint

See detailed reports in MegaLinter artifacts

MegaLinter is graciously provided by OX Security
Show us your support by starring ⭐ the repository

@devantler devantler marked this pull request as ready for review May 4, 2026 19:02

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to fix Talos cluster updates so scale-up operations reuse the running cluster’s existing PKI and bootstrap secrets instead of regenerating fresh ones. It fits into the Talos provisioner/config-manager path by rebuilding in-memory machine configs from live cluster state before update actions run.

Changes:

  • Added Configs.WithSecrets and Configs.ExtractSecrets to regenerate Talos config bundles while preserving existing secret material.
  • Added syncSecretsFromCluster() to read a live control-plane machine config and rebuild talosConfigs before update operations.
  • Added unit tests for the new config helpers and the no-op branches of secret sync.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pkg/svc/provisioner/cluster/talos/update.go Adds live secret-sync before Talos update workflows.
pkg/svc/provisioner/cluster/talos/update_test.go Adds tests for nil-config and Omni skip behavior.
pkg/svc/provisioner/cluster/talos/export_test.go Exposes the new helper for black-box tests.
pkg/fsutil/configmanager/talos/configs.go Adds secret extraction/regeneration helpers to Talos configs.
pkg/fsutil/configmanager/talos/configs_test.go Adds tests for WithSecrets and ExtractSecrets.

Comment thread pkg/svc/provisioner/cluster/talos/update.go
Comment thread pkg/svc/provisioner/cluster/talos/update.go
Comment thread pkg/svc/provisioner/cluster/talos/update.go Outdated
@codecov

codecov Bot commented May 4, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 47.43590% with 41 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
pkg/svc/provisioner/cluster/talos/update.go 28.84% 35 Missing and 2 partials ⚠️
pkg/fsutil/configmanager/talos/configs.go 84.61% 2 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

…ng CP

- Add needsSecretSync() helper that returns true only when the update
  will push machine configs (scale-up, in-place changes, or reboot changes)
- Gate syncSecretsFromCluster() call behind needsSecretSync() to avoid
  unnecessary Talos API calls on no-op updates or pure scale-down
- Change 'no CP node found' from silent fallback to error (fail closed)
  to prevent PKI mismatch when secrets are actually needed
- Add unit tests for needsSecretSync: scale-up, no-changes, scale-down,
  nil-configs, and Omni-skipped scenarios

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@ksail-bot ksail-bot Bot enabled auto-merge May 4, 2026 19:15
- Use static ErrNoControlPlaneForSecretSync error (err113)
- Rename mc to machineConfig for clarity (varnamelen)
- Move needsSecretSync check inside syncSecretsFromCluster to
  reduce Update() cyclomatic complexity from 12 to 11 (cyclop)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 4, 2026 19:30

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Comment thread pkg/fsutil/configmanager/talos/configs.go
Extract applyRebootChangesIfNeeded helper to move the
HasRebootRequired && RollingReboot guard out of Update(),
reducing McCabe complexity from 11 to 9.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 4, 2026 20:13
@devantler devantler removed the request for review from Copilot May 4, 2026 20:13
@ksail-bot ksail-bot Bot added this pull request to the merge queue May 4, 2026
Merged via the queue into main with commit 5e69458 May 4, 2026
71 checks passed
@ksail-bot ksail-bot Bot deleted the devantler/fix-reuse-cluster-secrets branch May 4, 2026 21:44
@github-project-automation github-project-automation Bot moved this from 🫴 Ready to ✅ Done in 🌊 Project Board May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

cluster update generates new CA/token for added worker nodes instead of reusing existing cluster secrets

3 participants