|
| 1 | +# ArgoCD & GitOps Architecture |
| 2 | + |
| 3 | +This document details the "App of Apps" GitOps architecture used in this cluster, specifically focusing on the **Sync Wave** strategy and **Health Check Customizations** that enable a fully self-managing cluster. |
| 4 | + |
| 5 | +## 🏗️ The "App of Apps" Pattern |
| 6 | + |
| 7 | +We use a hierarchical "App of Apps" pattern to manage the entire cluster state. |
| 8 | + |
| 9 | +```mermaid |
| 10 | +graph TD; |
| 11 | + RootApp[Root Application] -->|Manages| AppSets[ApplicationSets]; |
| 12 | + AppSets -->|Generates| Apps[Applications]; |
| 13 | + Apps -->|Deploys| Resources[Kubernetes Resources]; |
| 14 | +``` |
| 15 | + |
| 16 | +### The Root Application |
| 17 | +The entry point is `infrastructure/controllers/argocd/root.yaml`. This application: |
| 18 | +1. Points to `infrastructure/controllers/argocd/apps/` |
| 19 | +2. Deploys the `ApplicationSet` definitions found there. |
| 20 | +3. Is the *only* thing applied manually (during bootstrap). |
| 21 | + |
| 22 | +### ApplicationSets |
| 23 | +We use three primary ApplicationSets to categorize workloads: |
| 24 | +1. **Infrastructure** (`infrastructure-appset.yaml`): Core system components (Cilium, Longhorn, Cert-Manager). |
| 25 | +2. **Monitoring** (`monitoring-appset.yaml`): Observability stack (Prometheus, Grafana). |
| 26 | +3. **My Apps** (`my-apps-appset.yaml`): User workloads. |
| 27 | + |
| 28 | +## 🌊 Sync Waves & Dependency Management |
| 29 | + |
| 30 | +To solve the "chicken-and-egg" problem of bootstrapping a cluster (e.g., needing storage for apps, but networking for storage), we use **ArgoCD Sync Waves**. |
| 31 | + |
| 32 | +### The Wave Strategy |
| 33 | + |
| 34 | +| Wave | Phase | Components | Description | |
| 35 | +|------|-------|------------|-------------| |
| 36 | +| **0** | **Foundation** | `cilium`, `1password-connect`, `external-secrets` | **Networking & Secrets**. The absolute minimum required for other pods to start and pull credentials. | |
| 37 | +| **1** | **Storage** | `longhorn`, `garage` | **Persistence**. Depends on Wave 0 for Pod-to-Pod communication and S3 backup credentials. | |
| 38 | +| **2** | **System** | `cert-manager`, `gpu-operator`, `databases` | **Core Services**. Depends on Storage (PVCs) and Networking (Ingress/Gateway). | |
| 39 | +| **3** | **Observability** | `kube-prometheus-stack`, `loki` | **Monitoring**. Monitors the healthy stack. | |
| 40 | +| **4** | **User** | `my-apps/*` | **Workloads**. The actual applications running on the cluster. | |
| 41 | + |
| 42 | +### How It Works |
| 43 | +Each `Application` resource in `infrastructure/controllers/argocd/apps/` is annotated with a sync wave: |
| 44 | + |
| 45 | +```yaml |
| 46 | +apiVersion: argoproj.io/v1alpha1 |
| 47 | +kind: Application |
| 48 | +metadata: |
| 49 | + name: cilium |
| 50 | + annotations: |
| 51 | + argocd.argoproj.io/sync-wave: "0" |
| 52 | +``` |
| 53 | +
|
| 54 | +ArgoCD processes these waves sequentially. **Wave 1 will NOT start until Wave 0 is healthy.** |
| 55 | +
|
| 56 | +## 🏥 Health Check Customizations |
| 57 | +
|
| 58 | +Standard ArgoCD behavior is to mark a parent Application as "Healthy" as soon as the child Application resource is created, *even if the child app is still syncing or degraded*. This breaks the Sync Wave logic for App-of-Apps. |
| 59 | +
|
| 60 | +To fix this, we inject a custom Lua health check in `infrastructure/controllers/argocd/values.yaml`. |
| 61 | + |
| 62 | +### The "Wait for Child" Script |
| 63 | + |
| 64 | +```lua |
| 65 | +resource.customizations.health.argoproj.io_Application: | |
| 66 | + hs = {} |
| 67 | + hs.status = "Progressing" |
| 68 | + hs.message = "" |
| 69 | + if obj.status ~= nil then |
| 70 | + if obj.status.health ~= nil then |
| 71 | + hs.status = obj.status.health.status |
| 72 | + if obj.status.health.message ~= nil then |
| 73 | + hs.message = obj.status.health.message |
| 74 | + end |
| 75 | + end |
| 76 | + end |
| 77 | + return hs |
| 78 | +``` |
| 79 | + |
| 80 | +**What this does:** |
| 81 | +1. It overrides the health assessment of `Application` resources. |
| 82 | +2. It forces the parent (Root App) to report the *actual status* of the child Application. |
| 83 | +3. If `cilium` (Wave 0) is "Progressing", the Root App sees it as "Progressing". |
| 84 | +4. The Root App **pauses** processing Wave 1 until all Wave 0 apps report "Healthy". |
| 85 | + |
| 86 | +## 🔄 Self-Management Loop |
| 87 | + |
| 88 | +1. **Bootstrap**: You apply `root.yaml`. |
| 89 | +2. **Adoption**: ArgoCD sees `cilium` defined in Git (Wave 0). It adopts the running Cilium instance. |
| 90 | +3. **Expansion**: ArgoCD deploys `external-secrets` (Wave 0). |
| 91 | +4. **Wait**: ArgoCD waits for Cilium and External Secrets to be green. |
| 92 | +5. **Storage**: ArgoCD deploys `longhorn` (Wave 1). |
| 93 | +6. **Completion**: The process continues until all waves are healthy. |
| 94 | + |
| 95 | +This ensures a deterministic, reliable boot sequence every time. |
0 commit comments