Common issues and solutions, organized by component.
Symptom: Missing prerequisites.
Fix: Install the missing tools, then re-run:
# Python deps
pip install -r requirements.txt
# Ansible collections
ansible-galaxy collection install -r requirements.yml
# or
make collectionsSymptom: Error: Inventory file not found
Cause: make cluster or make rancher was run before make infra-up.
Fix: Either run make infra-up first, or create a manual inventory.
Symptom: make ping or ansible ... -m ping returns UNREACHABLE.
Checklist:
- Verify the IP is reachable:
ssh <user>@<ip> - Check
ansible_userhas SSH key access (or setansible_ssh_private_key_file) - Ensure the security group / firewall allows SSH (port 22) from your IP
- For airgap nodes, verify the bastion is reachable first
Cause: SSH key mismatch between what Tofu injected and what Ansible is using.
Fix:
- Verify
public_ssh_keyinterraform.tfvarsmatches the key pair Ansible uses - For BYO nodes, ensure the public key is in
~/.ssh/authorized_keyson the target
Cause: SSH ProxyCommand through the bastion fails.
Checklist:
- Can you SSH to the bastion directly?
ssh -i <key> <bastion_user>@<bastion_ip> - From the bastion, can you reach airgap nodes?
ssh <airgap_node_private_ip> - Check
ssh_private_key_filein group vars matches the key on disk
See also: SSH Troubleshooting (airgap)
Cause: Network issues downloading providers, or version constraints.
Fix:
- Check internet connectivity
- If behind a proxy, set
HTTPS_PROXY - Delete
.terraform/and retry:rm -rf tofu/aws/modules/cluster_nodes/.terraform && make infra-init
Fix:
- Check
aws_access_keyandaws_secret_keyinterraform.tfvars - Alternatively, use environment variables:
export AWS_ACCESS_KEY_ID=... AWS_SECRET_ACCESS_KEY=... - Verify the IAM user has permissions for EC2, VPC, and (optionally) Route53
Cause: Running tofu apply in the default workspace when a previous deployment wasn't destroyed.
Fix:
- Destroy first:
make infra-down - Or use workspaces:
cd tofu/aws/modules/cluster_nodes && tofu workspace new my-test
Symptom: Ansible task hangs or fails at "start rke2-server".
Debug:
ssh <user>@<node_ip>
sudo journalctl -u rke2-server --no-pager -n 100
sudo systemctl status rke2-serverCommon causes:
- Port conflicts: ensure 6443, 9345, 10250 are open between nodes
- Insufficient resources: RKE2 needs at least 2 vCPU, 4 GB RAM
- CNI issues: check
/var/lib/rancher/rke2/agent/logs/
Cause: Usually CNI-related — pods can't get network assigned.
Debug:
kubectl --kubeconfig <kubeconfig> get pods -n kube-system
kubectl --kubeconfig <kubeconfig> describe node <node-name>Fix: Ensure the security group allows the CNI port range (e.g., VXLAN 8472/UDP for Canal, BGP 179/TCP for Calico).
Cause: Cluster is still converging. Nodes or system pods may need more time.
Fix: Wait 2–3 minutes and re-run just the health check:
ansible-playbook -i <inventory> ansible/rke2/default/rke2-playbook.yml --tags healthCause: Master node didn't start successfully, so the token file wasn't created.
Fix: Check that the master node's RKE2 service is running, then re-run:
ansible-playbook -i <inventory> ansible/rke2/default/rke2-playbook.yml --tags clusterDebug:
ssh <user>@<node_ip>
sudo journalctl -u k3s --no-pager -n 100Common causes:
- Port 6443 already in use
- Insufficient permissions (K3s needs root or a user with full sudo)
Fix: Check that all nodes can reach the master on port 6443. Verify the join token was distributed correctly in the Ansible output.
Debug:
kubectl --kubeconfig <kubeconfig> get pods -n cattle-system
kubectl --kubeconfig <kubeconfig> logs -n cattle-system -l app=rancher
kubectl --kubeconfig <kubeconfig> get pods -n cert-managerCommon causes:
- cert-manager not ready — wait and check
cert-managernamespace pods - Image pull errors — check node internet access (or registry configuration in airgap)
Checklist:
- DNS:
nslookup <fqdn>— must resolve to your cluster LB or node IP - Ports: 80 and 443 must be open
- Ingress:
kubectl get ingress -n cattle-system - For local testing: add
<node-ip> <fqdn>to/etc/hosts
After the first login, Rancher switches to the permanent password. The bootstrap_password is single-use.
Cause: Bastion can't reach GitHub to download RKE2 releases.
Fix: Verify bastion has internet: ssh bastion && curl -I https://github.com
Cause: Disk space or SSH connectivity.
Fix:
- Check bastion disk:
df -h(need ~5 GB for tarballs) - Check airgap node disk:
df -h(need ~20 GB for installation) - Test SSH:
make test-ssh ENV=airgap
Symptom: download sha256 does not match
Fix:
ansible-playbook -i ansible/rke2/airgap/inventory/inventory.yml \
ansible/rke2/airgap/playbooks/debug/fix-checksum-issues.ymlSymptom: failed to pull image after RKE2 is running.
Debug:
# Check registry config
ssh <bastion> -> ssh <airgap-node>
cat /etc/rancher/rke2/registries.yaml
# Test registry connectivity
curl -k https://<registry-url>/v2/
# Check containerd logs
cat /var/lib/rancher/rke2/agent/containerd/containerd.logFix: Re-apply registry config: make registry ENV=airgap
The airgap deployment includes several diagnostic playbooks:
# Test SSH connectivity
ansible-playbook -i <inventory> ansible/rke2/airgap/playbooks/debug/test-ssh-connectivity.yml
# Fix checksum issues
ansible-playbook -i <inventory> ansible/rke2/airgap/playbooks/debug/fix-checksum-issues.yml
# Fix RKE2 config
ansible-playbook -i <inventory> ansible/rke2/airgap/playbooks/debug/fix-rke2-config.yml
# Validate upgrade readiness
ansible-playbook -i <inventory> ansible/rke2/airgap/playbooks/debug/validate-upgrade-readiness.yml- Add
-v,-vv, or-vvvto anyansible-playbookcommand for more verbose output - Check the FAQ
- Review the component-specific READMEs in
ansible/rke2/default/README.md,ansible/k3s/default/README.md, oransible/rke2/airgap/README.md