Skip to content

Commit 1df9283

Browse files
committed
cleanup
1 parent 952242d commit 1df9283

13 files changed

Lines changed: 1023 additions & 821 deletions

File tree

.github/copilot-instructions.md

Lines changed: 5 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
This is a **production-grade GitOps Kubernetes cluster** running on Talos OS with self-managing ArgoCD. The key differentiator is that ArgoCD manages its own configuration and automatically discovers applications through directory structure - no manual Application manifests needed.
66

7-
**Tech Stack**: Talos OS + K3s + ArgoCD + Cilium + Gateway API + Longhorn + 1Password + GPU support
7+
**Tech Stack**: Talos OS + ArgoCD + Cilium + Gateway API + Longhorn + 1Password + GPU support
88

99
## How This Project Works
1010

@@ -28,16 +28,11 @@ monitoring/prometheus-stack/ → ArgoCD Application "prometheus-stack"
2828
## Essential Commands for This Project
2929

3030
### Talos Cluster Management
31+
Nodes are managed via **Omni UI** (upgrades, configuration, patches). No `talhelper` or manual `talosctl` needed.
32+
3133
```bash
32-
# Node health check
34+
# Node health check (if needed from CLI)
3335
talosctl health --nodes <node-ip>
34-
35-
# Apply config changes (for Talos settings)
36-
talosctl apply-config --nodes <node-ip> --file iac/talos/clusterconfig/<node>.yaml
37-
38-
# Upgrade nodes (for Talos version/extensions changes)
39-
INSTALLER_URL=$(talhelper genurl installer -c iac/talos/talconfig.yaml -n "<node-name>")
40-
talosctl upgrade --nodes "<node-ip>" --image "$INSTALLER_URL"
4136
```
4237

4338
### ArgoCD Bootstrap (Critical Sequence)
@@ -139,7 +134,7 @@ kubectl get pods -n gpu-operator
139134
## Key Reference Files
140135

141136
- **GitOps Core**: `infrastructure/controllers/argocd/root.yaml` + `infrastructure/controllers/argocd/apps/*-appset.yaml`
142-
- **Talos Config**: `iac/talos/talconfig.yaml` (complete node definitions)
137+
- **Omni Config**: `omni/` (machine classes, cluster templates, patches)
143138
- **GPU Example**: `my-apps/ai/comfyui/` (complete GPU app pattern)
144139
- **Helm Pattern**: `infrastructure/controllers/1passwordconnect/kustomization.yaml`
145140
- **Web Access**: `my-apps/home/frigate/httproute.yaml` + service with named ports
Lines changed: 32 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -1,55 +1,44 @@
11
---
2-
applies_to:
3-
- "iac/talos/**"
2+
applies_to:
3+
- "omni/**"
44
- "**/*talos*"
55
---
66

77
# Talos OS Management Instructions
88

99
## Overview
10-
Talos OS is an immutable Linux distribution designed for Kubernetes - no shell, no SSH, API-only management.
10+
Talos OS is an immutable Linux distribution designed for Kubernetes - no shell, no SSH, API-only management. This cluster is managed via **Omni** (Sidero's Talos management platform) with the Proxmox Infrastructure Provider.
1111

1212
## Key Concepts
1313
- **Immutable OS**: No package manager, all changes via configuration
14-
- **API-only**: All management via `talosctl`, never SSH
15-
- **Declarative**: Configuration defined in `iac/talos/talconfig.yaml`
14+
- **API-only**: All management via Omni UI, never SSH
15+
- **Declarative**: Configuration managed in Omni (machine classes, cluster templates)
1616
- **System Extensions**: Drivers and modules loaded at boot time
1717

18-
## Configuration Management
18+
## Cluster Management via Omni
1919

20-
### Talhelper Workflow
21-
```bash
22-
# Generate machine configs from talconfig.yaml
23-
cd iac/talos
24-
talhelper genconfig
20+
### Node Operations
21+
- **Provisioning**: Omni + Sidero Proxmox Provider handles VM creation and Talos installation
22+
- **Upgrades**: Managed through Omni UI (Talos version, system extensions)
23+
- **Configuration**: Machine classes and patches in `omni/` directory
24+
- **Kubeconfig**: Download from Omni UI > cluster > "Download Kubeconfig"
2525

26-
# Generate installer URLs for upgrades
27-
talhelper genurl installer -c talconfig.yaml -n "<node-name>"
28-
```
26+
### Machine Classes
27+
Defined in `omni/machine-classes/`:
28+
- `control-plane.yaml` - Control plane nodes
29+
- `worker.yaml` - Regular worker nodes
30+
- `gpu-worker.yaml` - GPU worker nodes with NVIDIA extensions
2931

30-
### Applying Changes
31-
```bash
32-
# For configuration changes (non-image changes)
33-
talosctl apply-config --nodes <node-ip> --file iac/talos/clusterconfig/<node>.yaml
32+
### Cluster Template
33+
`omni/cluster-template/cluster-template.yaml` defines the cluster layout with patches in `omni/cluster-template/patches/`.
3434

35-
# For Talos version or system extension changes (requires reboot)
36-
INSTALLER_URL=$(talhelper genurl installer -c iac/talos/talconfig.yaml -n "<node-name>")
37-
talosctl upgrade --nodes "<node-ip>" --image "$INSTALLER_URL"
38-
```
39-
40-
### Secrets Management
41-
- `talsecret.sops.yaml` contains cluster encryption keys
42-
- Always encrypted with SOPS before committing
43-
- Generated once with `talhelper gensecret > talsecret.sops.yaml`
44-
45-
## Node Types and Configuration
35+
## Node Types
4636

4737
### Control Plane Nodes
4838
- Run etcd, kube-apiserver, kube-controller-manager
4939
- Default container runtime: `runc`
50-
- Label: `node.kubernetes.io/exclude-from-external-load-balancers`
5140

52-
### GPU Worker Nodes
41+
### GPU Worker Nodes
5342
- NVIDIA system extensions: `nonfree-kmod-nvidia-production`, `nvidia-container-toolkit-production`
5443
- Default container runtime: `nvidia`
5544
- Kernel modules: `nvidia`, `nvidia_uvm`, `nvidia_drm`, `nvidia_modeset`
@@ -65,7 +54,7 @@ System extensions are loaded at boot time and cannot be changed at runtime.
6554

6655
### Common Extensions
6756
- `siderolabs/amd-ucode`: AMD CPU microcode
68-
- `siderolabs/gasket-driver`: Google Coral TPU support
57+
- `siderolabs/gasket-driver`: Google Coral TPU support
6958
- `siderolabs/iscsi-tools`: iSCSI storage support
7059
- `siderolabs/nfsd`: NFS server support
7160
- `siderolabs/qemu-guest-agent`: VM guest tools
@@ -83,7 +72,11 @@ System extensions are loaded at boot time and cannot be changed at runtime.
8372

8473
## Troubleshooting
8574

86-
### Health Checks
75+
### From Omni UI
76+
- View node health, logs, and events directly in Omni
77+
- Trigger upgrades and configuration changes
78+
79+
### From CLI (if needed)
8780
```bash
8881
# Check node health
8982
talosctl health --nodes <node-ip>
@@ -97,13 +90,11 @@ talosctl logs -n <node-ip> kubelet # kubelet logs
9790
```
9891

9992
### Common Issues
100-
- **Config changes not applied**: Use `talosctl apply-config`, not `kubectl edit`
101-
- **GPU not available**: Verify system extensions in talconfig.yaml, may need upgrade
102-
- **Network issues**: Check static IP configuration in networkInterfaces
93+
- **Config changes not applied**: Use Omni UI, not `kubectl edit`
94+
- **GPU not available**: Verify system extensions in machine class, may need upgrade via Omni
95+
- **Network issues**: Check static IP configuration in Omni node patches
10396

10497
## Critical Rules
105-
-**Never SSH to nodes** - API-only management
106-
-**Never use `kubectl edit` for node config** - changes are ephemeral
107-
-**Always regenerate configs** when changing talconfig.yaml
108-
-**Use `upgrade` command** for system extension changes
109-
-**Encrypt secrets with SOPS** before committing
98+
- Never SSH to nodes - API-only management
99+
- Never use `kubectl edit` for node config - changes are ephemeral
100+
- Use Omni UI for all node lifecycle operations (upgrades, patches, extensions)

0 commit comments

Comments
 (0)