Skip to content

Conversation

Copy link

Copilot AI commented Jan 22, 2026

When TLS certificates expire during node version checks, the controller was treating connection failures as "node doesn't need upgrade" instead of propagating the error. This caused upgrades to complete prematurely with 0 nodes upgraded.

Changes

Error handling fix:

  • Changed nodeNeedsUpgrade() from returning bool to (bool, error) to distinguish between "doesn't need upgrade" and "cannot determine"
  • Refactored findNextNode() from slices.IndexFunc to explicit loop to propagate errors
  • Controller now returns error on certificate failures, triggering retry via existing refreshTalosClient() logic
// Before: swallowed errors, returned false
func (r *TalosUpgradeReconciler) nodeNeedsUpgrade(...) bool {
    if err := r.TalosClient.GetNodeVersion(...); err != nil {
        return false // Treated as "no upgrade needed"
    }
}

// After: propagates errors
func (r *TalosUpgradeReconciler) nodeNeedsUpgrade(...) (bool, error) {
    if err := r.TalosClient.GetNodeVersion(...); err != nil {
        return false, fmt.Errorf("failed to get node version: %w", err)
    }
}

Behavior change:

  • Before: Certificate error → nodes skipped → upgrade completes (0 nodes)
  • After: Certificate error → retry triggered → certificate refresh attempted
Original prompt

This section details on the original issue you should resolve

<issue_title>Failed to get current version from Talos client</issue_title>
<issue_description>Since this has happened now the second time (1.11.6 --> 1.12.0 --> 1.12.1), I thought it might be beneficial to report it here to at least track it as well for others if it is just a configuration issue on my end.

1/5/2026 3:32:07 PM | 2026-01-05T14:32:07Z	DEBUG	Starting upgrade processing	{"controller": "talosupgrade", "controllerGroup": "tuppr.home-operations.com", "controllerKind": "TalosUpgrade", "TalosUpgrade": {"name":"talos"}, "namespace": "", "name": "talos", "reconcileID": "2d38f146-15dc-4d22-b981-959c8956d8d5", "talosupgrade": "talos", "generation": 2}
1/5/2026 3:32:07 PM | 2026-01-05T14:32:07Z	DEBUG	Finding next node to upgrade	{"controller": "talosupgrade", "controllerGroup": "tuppr.home-operations.com", "controllerKind": "TalosUpgrade", "TalosUpgrade": {"name":"talos"}, "namespace": "", "name": "talos", "reconcileID": "2d38f146-15dc-4d22-b981-959c8956d8d5", "talosupgrade": "talos"}
1/5/2026 3:32:07 PM | 2026-01-05T14:32:07Z	DEBUG	Creating new Talos client	{"controller": "talosupgrade", "controllerGroup": "tuppr.home-operations.com", "controllerKind": "TalosUpgrade", "TalosUpgrade": {"name":"talos"}, "namespace": "", "name": "talos", "reconcileID": "2d38f146-15dc-4d22-b981-959c8956d8d5"}
1/5/2026 3:32:07 PM | 2026-01-05T14:32:07Z	DEBUG	Successfully created Talos client	{"controller": "talosupgrade", "controllerGroup": "tuppr.home-operations.com", "controllerKind": "TalosUpgrade", "TalosUpgrade": {"name":"talos"}, "namespace": "", "name": "talos", "reconcileID": "2d38f146-15dc-4d22-b981-959c8956d8d5"}
1/5/2026 3:32:07 PM | 2026-01-05T14:32:07Z	ERROR	Failed to get current version from Talos client, cannot determine upgrade need	{"controller": "talosupgrade", "controllerGroup": "tuppr.home-operations.com", "controllerKind": "TalosUpgrade", "TalosUpgrade": {"name":"talos"}, "namespace": "", "name": "talos", "reconcileID": "2d38f146-15dc-4d22-b981-959c8956d8d5", "node": "k8s-control-1", "nodeIP": "10.0.0.48", "error": "failed to get node version from 10.0.0.48: 1 error(s) occurred:\n\trpc error: code = Unavailable desc = connection error: desc = \"error reading server preface: remote error: tls: expired certificate\""}
1/5/2026 3:32:07 PM | github.com/home-operations/tuppr/internal/controller.(*TalosUpgradeReconciler).nodeNeedsUpgrade
1/5/2026 3:32:07 PM | 	/workspace/internal/controller/talosupgrade_controller.go:640
1/5/2026 3:32:07 PM | github.com/home-operations/tuppr/internal/controller.(*TalosUpgradeReconciler).findNextNode.func1
1/5/2026 3:32:07 PM | 	/workspace/internal/controller/talosupgrade_controller.go:600
1/5/2026 3:32:07 PM | slices.IndexFunc[...]
1/5/2026 3:32:07 PM | 	/usr/local/go/src/slices/slices.go:109
1/5/2026 3:32:07 PM | github.com/home-operations/tuppr/internal/controller.(*TalosUpgradeReconciler).findNextNode
1/5/2026 3:32:07 PM | 	/workspace/internal/controller/talosupgrade_controller.go:586
1/5/2026 3:32:07 PM | github.com/home-operations/tuppr/internal/controller.(*TalosUpgradeReconciler).processNextNode
1/5/2026 3:32:07 PM | 	/workspace/internal/controller/talosupgrade_controller.go:265
1/5/2026 3:32:07 PM | github.com/home-operations/tuppr/internal/controller.(*TalosUpgradeReconciler).processUpgrade
1/5/2026 3:32:07 PM | 	/workspace/internal/controller/talosupgrade_controller.go:152
1/5/2026 3:32:07 PM | github.com/home-operations/tuppr/internal/controller.(*TalosUpgradeReconciler).Reconcile
1/5/2026 3:32:07 PM | 	/workspace/internal/controller/talosupgrade_controller.go:83
1/5/2026 3:32:07 PM | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile
1/5/2026 3:32:07 PM | 	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.22.4/pkg/internal/controller/controller.go:216
1/5/2026 3:32:07 PM | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler
1/5/2026 3:32:07 PM | 	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.22.4/pkg/internal/controller/controller.go:461
1/5/2026 3:32:07 PM | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem
1/5/2026 3:32:07 PM | 	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.22.4/pkg/internal/controller/controller.go:421
1/5/2026 3:32:07 PM | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func1.1
1/5/2026 3:32:07 PM | 	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.22.4/pkg/internal/controller/controller.go:296
1/5/2026 3:32:07 PM | 2026-01-05T14:32:07Z	DEBUG	No nodes need upgrade	{"controller": "talosupgrade", "controllerGroup": "tuppr.home-operations.com", "controllerKind": "TalosUpgrade", "TalosUpgrade": {"name":"talos"}, "namespace": "", "name": "talos", "reconcileID": "2d38f146-15dc-4d22-b981-959c8956d8d5"}
1/5/2026 3:32:07 PM | 2026-01-05T14:32:07Z	INFO	Upgrade completed successfully	{"controller": "talosupgrade", "controllerGroup": "tupp...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

- Fixes home-operations/tuppr#65

<!-- START COPILOT CODING AGENT TIPS -->
---

💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey).

- Change nodeNeedsUpgrade to return (bool, error) instead of just bool
- Propagate errors when unable to determine upgrade need (e.g., cert errors)
- Refactor findNextNode to handle errors properly
- Controller will now retry on certificate errors instead of completing upgrade prematurely

Co-authored-by: onedr0p <213795+onedr0p@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix failure to get current version from Talos client Fix certificate expiration causing premature upgrade completion Jan 22, 2026
Copilot AI requested a review from onedr0p January 22, 2026 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants