Skip to content

Commit 1cfda15

Browse files
committed
cleanup
1 parent 2ec84e3 commit 1cfda15

4 files changed

Lines changed: 122 additions & 21 deletions

File tree

README.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,18 @@ graph TD;
7272
- **GPU Integration**: Full NVIDIA GPU support via Talos system extensions and GPU Operator
7373
- **Zero SSH**: All node management via Talosctl API
7474

75+
### 🌊 Sync Wave Architecture
76+
77+
The cluster uses **ArgoCD Sync Waves** to strictly order deployments, preventing "chicken-and-egg" dependency issues:
78+
79+
1. **Wave 0 (Foundation)**: Networking (Cilium) & Secrets (1Password/External Secrets)
80+
2. **Wave 1 (Storage)**: Persistent Storage (Longhorn) & Object Storage (Garage)
81+
3. **Wave 2 (System)**: Core Infrastructure (Cert-Manager, Databases, GPU)
82+
4. **Wave 3 (Monitoring)**: Observability Stack (Prometheus, Grafana)
83+
5. **Wave 4 (Apps)**: User Workloads
84+
85+
*See [docs/argocd.md](docs/argocd.md) for the deep dive on health checks and dependency management.*
86+
7587
## 🚀 Quick Start (Manual Talos Method)
7688

7789
> **Note:** If you're using Omni + Sidero Proxmox Provider, see **[BOOTSTRAP.md](BOOTSTRAP.md)** instead.

docs/argocd.md

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
# ArgoCD & GitOps Architecture
2+
3+
This document details the "App of Apps" GitOps architecture used in this cluster, specifically focusing on the **Sync Wave** strategy and **Health Check Customizations** that enable a fully self-managing cluster.
4+
5+
## 🏗️ The "App of Apps" Pattern
6+
7+
We use a hierarchical "App of Apps" pattern to manage the entire cluster state.
8+
9+
```mermaid
10+
graph TD;
11+
RootApp[Root Application] -->|Manages| AppSets[ApplicationSets];
12+
AppSets -->|Generates| Apps[Applications];
13+
Apps -->|Deploys| Resources[Kubernetes Resources];
14+
```
15+
16+
### The Root Application
17+
The entry point is `infrastructure/controllers/argocd/root.yaml`. This application:
18+
1. Points to `infrastructure/controllers/argocd/apps/`
19+
2. Deploys the `ApplicationSet` definitions found there.
20+
3. Is the *only* thing applied manually (during bootstrap).
21+
22+
### ApplicationSets
23+
We use three primary ApplicationSets to categorize workloads:
24+
1. **Infrastructure** (`infrastructure-appset.yaml`): Core system components (Cilium, Longhorn, Cert-Manager).
25+
2. **Monitoring** (`monitoring-appset.yaml`): Observability stack (Prometheus, Grafana).
26+
3. **My Apps** (`my-apps-appset.yaml`): User workloads.
27+
28+
## 🌊 Sync Waves & Dependency Management
29+
30+
To solve the "chicken-and-egg" problem of bootstrapping a cluster (e.g., needing storage for apps, but networking for storage), we use **ArgoCD Sync Waves**.
31+
32+
### The Wave Strategy
33+
34+
| Wave | Phase | Components | Description |
35+
|------|-------|------------|-------------|
36+
| **0** | **Foundation** | `cilium`, `1password-connect`, `external-secrets` | **Networking & Secrets**. The absolute minimum required for other pods to start and pull credentials. |
37+
| **1** | **Storage** | `longhorn`, `garage` | **Persistence**. Depends on Wave 0 for Pod-to-Pod communication and S3 backup credentials. |
38+
| **2** | **System** | `cert-manager`, `gpu-operator`, `databases` | **Core Services**. Depends on Storage (PVCs) and Networking (Ingress/Gateway). |
39+
| **3** | **Observability** | `kube-prometheus-stack`, `loki` | **Monitoring**. Monitors the healthy stack. |
40+
| **4** | **User** | `my-apps/*` | **Workloads**. The actual applications running on the cluster. |
41+
42+
### How It Works
43+
Each `Application` resource in `infrastructure/controllers/argocd/apps/` is annotated with a sync wave:
44+
45+
```yaml
46+
apiVersion: argoproj.io/v1alpha1
47+
kind: Application
48+
metadata:
49+
name: cilium
50+
annotations:
51+
argocd.argoproj.io/sync-wave: "0"
52+
```
53+
54+
ArgoCD processes these waves sequentially. **Wave 1 will NOT start until Wave 0 is healthy.**
55+
56+
## 🏥 Health Check Customizations
57+
58+
Standard ArgoCD behavior is to mark a parent Application as "Healthy" as soon as the child Application resource is created, *even if the child app is still syncing or degraded*. This breaks the Sync Wave logic for App-of-Apps.
59+
60+
To fix this, we inject a custom Lua health check in `infrastructure/controllers/argocd/values.yaml`.
61+
62+
### The "Wait for Child" Script
63+
64+
```lua
65+
resource.customizations.health.argoproj.io_Application: |
66+
hs = {}
67+
hs.status = "Progressing"
68+
hs.message = ""
69+
if obj.status ~= nil then
70+
if obj.status.health ~= nil then
71+
hs.status = obj.status.health.status
72+
if obj.status.health.message ~= nil then
73+
hs.message = obj.status.health.message
74+
end
75+
end
76+
end
77+
return hs
78+
```
79+
80+
**What this does:**
81+
1. It overrides the health assessment of `Application` resources.
82+
2. It forces the parent (Root App) to report the *actual status* of the child Application.
83+
3. If `cilium` (Wave 0) is "Progressing", the Root App sees it as "Progressing".
84+
4. The Root App **pauses** processing Wave 1 until all Wave 0 apps report "Healthy".
85+
86+
## 🔄 Self-Management Loop
87+
88+
1. **Bootstrap**: You apply `root.yaml`.
89+
2. **Adoption**: ArgoCD sees `cilium` defined in Git (Wave 0). It adopts the running Cilium instance.
90+
3. **Expansion**: ArgoCD deploys `external-secrets` (Wave 0).
91+
4. **Wait**: ArgoCD waits for Cilium and External Secrets to be green.
92+
5. **Storage**: ArgoCD deploys `longhorn` (Wave 1).
93+
6. **Completion**: The process continues until all waves are healthy.
94+
95+
This ensures a deterministic, reliable boot sequence every time.

infrastructure/controllers/argocd/apps/infrastructure-appset.yaml

Lines changed: 13 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -11,27 +11,21 @@ spec:
1111
repoURL: https://github.com/mitchross/talos-argocd-proxmox.git
1212
revision: main
1313
directories:
14-
# Controllers (argocd managed via Helm, cilium/external-secrets/1password have their own apps)
15-
- path: infrastructure/controllers/cert-manager
16-
- path: infrastructure/controllers/gpu-priority-classes
17-
- path: infrastructure/controllers/intel-device-plugins
18-
- path: infrastructure/controllers/node-feature-discovery
19-
- path: infrastructure/controllers/nvidia-device-plugin
20-
- path: infrastructure/controllers/nvidia-gpu-operator
21-
# Databases
14+
- path: infrastructure/controllers/*
15+
- path: infrastructure/networking/*
16+
- path: infrastructure/storage/*
2217
- path: infrastructure/database/*/*
23-
# Networking (cilium has its own app)
24-
- path: infrastructure/networking/cloudflared
25-
- path: infrastructure/networking/coredns
26-
- path: infrastructure/networking/gateway
27-
# Storage (longhorn has its own app)
28-
- path: infrastructure/storage/container-registry
29-
- path: infrastructure/storage/csi-driver-nfs
30-
- path: infrastructure/storage/csi-driver-smb
31-
- path: infrastructure/storage/local-storage
32-
- path: infrastructure/storage/openebs
33-
# CRDs
3418
- path: infrastructure/crds
19+
20+
# EXCLUDES: These are managed by standalone Apps for Sync Wave control (Wave 0/1)
21+
# or are the ArgoCD controller itself.
22+
exclude:
23+
- path: infrastructure/controllers/argocd
24+
- path: infrastructure/controllers/1passwordconnect
25+
- path: infrastructure/controllers/external-secrets
26+
- path: infrastructure/networking/cilium
27+
- path: infrastructure/storage/longhorn
28+
- path: infrastructure/storage/garage
3529
template:
3630
metadata:
3731
name: '{{path.basename}}'

scripts/bootstrap-argocd.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -52,8 +52,8 @@ echo ""
5252
echo "✅ ArgoCD bootstrap complete!"
5353
echo ""
5454
echo "📊 ArgoCD will now sync applications in this order:"
55-
echo " Wave 0: Cilium (networking)"
56-
echo " Wave 1: Longhorn (storage)"
55+
echo " Wave 0: Cilium (networking) & Secrets"
56+
echo " Wave 1: Longhorn (storage) & Garage (S3)"
5757
echo " Wave 2: Infrastructure (core services)"
5858
echo " Wave 3: Monitoring (observability)"
5959
echo " Wave 4: My-Apps (workloads)"

0 commit comments

Comments
 (0)