Skip to content

Native SSH Infrastructure Provider for CAPRKE2 (airgap-compatible, RKE2-aware lifecycle) #873

@CYBERBUGJR

Description

@CYBERBUGJR

Describe the solution you'd like:
A native SSH infrastructure provider for CAPRKE2 that bootstraps RKE2 on pre-provisioned machines (bare metal or VMs) via SSH, without requiring a cloud-specific provider that creates machines dynamically.

The provider would manage the full machine lifecycle:

  • SSH into a pre-existing node,
  • Deliver the CAPRKE2-generated bootstrap payload,
  • Monitor readiness,
  • Cleanly uninstall RKE2 on deletion.
    A pool of pre-registered machines (provisioned by existing tooling like Ansible or Puppet) would be claimed on demand by CAPI.

A native RKE2SSHMachine / RKE2SSHMachineTemplate infrastructure provider built into or alongside CAPRKE2, designed for pre-provisioned bare metal and VM environments reachable via SSH.
This is distinct from using k0smotron's RemoteMachine provider as a shim the key requirement is RKE2-aware lifecycle management, particularly for rolling upgrades, which is where the k0smotron combination breaks down fundamentally.

Proposed CRDs:

  • RKE2SSHCluster / RKE2SSHClusterTemplate - infrastructure cluster object (VIP, ports)
  • RKE2SSHMachine / RKE2SSHMachineTemplate - per-machine SSH target (address, port, user, SSH key Secret ref, useSudo)
  • PooledRKE2SSHMachine - pool of pre-registered machines assigned on demand.

Why do you want this feature:

We validated the closest existing alternative by combining k0smotron's RemoteMachine provider with CAPRKE2 in a production airgapped environment and hit two blockers that cannot be worked around.
I understand the two providers are not compatible, but it's just to give a feedback on the experience I had using Rancher's cluster api providers :

The need to adapt k0smotron :

k0smotron Remote SSH produces a providerID format incompatible with what RKE2 writes to the Node, causing machines to stall in WaitingForNodeRef, and has no built-in RKE2 drain/uninstall on deletion.
For reference, the two components set it independently with incompatible formats:

k0smotron sets Machine.spec.providerID = remote-machine://X.X.X.X:22
RKE2 sets Node.spec.providerID = rke2://node-name

This has been hacked by putting some kubelet-arg overrides inside :

kubelet-arg:
  - provider-id="remote-machine:${NODE_IP}:22"

Etcd TLS CA mismatch:

CAPRKE2 authenticates to etcd using the kube-apiserver CA. RKE2 uses a separate dedicated CA for etcd. Even when port 2379 is reachable the TLS handshake fails with no supported workaround. A native RKE2-aware provider would resolve both by performing etcd operations over SSH on the node itself rather than requiring a direct TCP connection from the management cluster the correct etcd CA path is known, no firewall changes needed.

Anything else you would like to add:

We know this is not the philosophy of Cluster API (takes care of the end to end provisioning of the Kubernetes cluster).
This use case covers any organization running RKE2 on bare metal or on-premises VMs with existing provisioning pipelines and segmented networks not just our specific environment. We would be willing to contribute to or test such a provider given our validated deployment setup.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/featureneeds-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions