Skip to content

Commit 8156697

Browse files
mitchrossclaude
andauthored
Claude/root readme bootstrap gv3b2t (#1516)
* docs: overhaul root README around the real bootstrap sequence Rewrite the root README to lead with the clean, linear provision → bootstrap flow that's actually run end-to-end (machine classes → template sync → Omni access → Gateway CRDs → Cilium → secrets → bootstrap-argocd.sh), with a copy-paste "whole sequence" quick reference plus annotated per-step gotchas. Correct the backup system everywhere: the retired pvc-plumber + VolSync stack is replaced by kopiur (sync-wave table, dedicated Backup System section, troubleshooting row, and the bootstrap script's wave echo all updated to match the live waves). Refresh version pins (Omni v1.9.0, ArgoCD chart 9.7.0) and drop the duplicated cluster-access steps. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01P936TdJY9XieDQYKRfq9g8 * docs: refresh stale references across docs/ and omni/ trees Audit-and-fix pass aligning the documentation with the current cluster state (same staleness class as the root README overhaul): Backup system — replace the retired pvc-plumber + VolSync + Kyverno references with kopiur where they were presented as the live system: - argocd.md / entrypoints.md: wave table + entrypoint map now show the kopiur operator (Wave 2) and kopiur config (Wave 3); dropped the removed volsync/pvc-plumber entrypoints; added the two VPA entrypoints; fixed the manual root seed path to root.yaml. - cnpg/disaster-recovery.md + cnpg/explained.md: kopiur wording, removed the obsolete mutating-webhook/SYSTEM_NAMESPACES mechanics, fixed a dead docs/plans/ link. - rustfs/credential-runbook.md: VolSync per-PVC ExternalSecrets -> the kopiur-rustfs ClusterExternalSecret fanout. - index.md: Longhorn replica count (single-node) + AI line now leads with the vLLM default. Hardware — correct the GPU host from a misattributed Xeon DL360 to the actual AMD Threadripper 2950X / X399 in the ai-gpu docs. Omni docs — version pins (Omni 1.8 -> 1.9, Cilium 1.19.4 -> 1.19.5), kubeProxyReplacement=strict -> true, authoritative port numbers (8090/8100/8091/50180-udp), multi-disk support note, and repointed broken links (talos-configs/, examples/, proxmox-provider/README.md, test-multi-disk.yaml) to real paths. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01P936TdJY9XieDQYKRfq9g8 * docs: add root llms.txt with bootstrap/DR operational truths Capture the non-inferable operational facts surfaced by the first kopiur-only full nuke (2026-06-28): the two independent restore systems (kopiur PVC restore-before-bind vs CNPG/Barman), self-healing vs. real-intervention pod states, the now-automated gitea-actions runner token, the cosmetic kopiur SnapshotPolicy OutOfSync, and the Omni/Proxmox infra-provider gotchas (stuck finalizers, API-token format, USB-dongle passthrough). Curated LLM-facing context, separate from the docs/ tree. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01P936TdJY9XieDQYKRfq9g8 * docs: correct llms.txt against committed repo state Two claims in the restore-context block didn't match the repo: - gitea-actions runner token is still a MANUAL rebuild step, not automated — externalsecret.yaml is committed but commented out of gitea-actions/kustomization.yaml until the 1Password field exists. Rewrote the section + headline DR claim to reflect the staged-but- disabled automation and the real Secret name + token-gen command. - No AppSet ignoreDifferences masks kopiur SnapshotPolicy defaults; softened the cosmetic-OutOfSync note to drop the false mechanism. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01P936TdJY9XieDQYKRfq9g8 --------- Co-authored-by: Claude <noreply@anthropic.com>
1 parent 602cfe3 commit 8156697

18 files changed

Lines changed: 396 additions & 439 deletions

README.md

Lines changed: 206 additions & 330 deletions
Large diffs are not rendered by default.

docs/domains/ai-gpu/3090-llm-optimization.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,8 @@
1616
GPU1 = ComfyUI. **Pool both on-demand** (layer-split) only for the occasional
1717
256K research / full-context coding burst. Keep both 3090s in the box; do
1818
**not** redistribute a card to the gaming PC.
19-
- **Why not single-card + CPU offload:** the node is a **Xeon E5 v4 (Broadwell)
20-
DL360 Gen9, DDR4-2400, no AVX-512, PCIe 3.0, under Proxmox**. MoE expert
19+
- **Why not single-card + CPU offload:** the node is an **AMD Threadripper 2950X
20+
(16c/32t, Zen+), 128GB ECC DDR4, no AVX-512, PCIe 3.0, under Proxmox**. MoE expert
2121
offload is memory-bandwidth-bound and would land ~8–12 TPS on this CPU. The
2222
second 3090 is what keeps the long-context path on-GPU and fast — on this box
2323
48GB is close to essential, not a nice-to-have.
@@ -57,7 +57,7 @@ three ways to buy more usable context, in priority order **on this CPU**:
5757

5858
1. **More VRAM** — pool the second 3090 (cleanest here).
5959
2. **Smaller KV** — quantize KV (q8→q4 ≈ half; TurboQuant `turbo3` ≈ ⅕, see below).
60-
3. **CPU expert offload***last resort* on Broadwell; avoid.
60+
3. **CPU expert offload***last resort* on the Threadripper 2950X; avoid.
6161

6262
This ordering is **inverted** from a modern DDR5 / AVX-512 box, where single-card
6363
+ offload is fine.
@@ -79,7 +79,7 @@ both, on-demand ── pool layer-split for 256K research / full-context coding
7979
- **Rejected: redistribute** (single 3090 + AMD 6800 in cluster + 3090 →
8080
gaming PC). The 6800 is RDNA2/ROCm and much of the stack is CUDA-locked
8181
(faster-whisper = CTranslate2, many ComfyUI nodes), and a single cluster 3090
82-
would be stuck on the slow Broadwell offload path. Only revisit if the LLM
82+
would be stuck on the slow Threadripper-2950X offload path. Only revisit if the LLM
8383
workload moves off this node.
8484

8585
## Daily driver + single-vs-dual (decision)
@@ -338,9 +338,9 @@ offload, no spill**:
338338
used by `my-apps/ai/llmfit/` dual-GPU jobs.)
339339
3. On that 2-GPU deployment: drop `GGML_CUDA_ENABLE_UNIFIED_MEMORY`, keep KV
340340
symmetric q8/q8, optionally raise `-ub 1024`.
341-
4. **Proxmox/DL360 checks:** NUMA-pin the VM to one socket with both 3090s on
342-
that socket's PCIe lanes; confirm both cards are x16 Gen3; NVLink optional
343-
(helps spec-decode, not layer-split much).
341+
4. **Proxmox/X399 checks:** confirm both 3090s are passed through on PCIe Gen3
342+
x16 lanes; pin the VM's vCPUs to physical cores and enable hugepages; NVLink
343+
optional (helps spec-decode, not layer-split much).
344344

345345
## Model download checklist (NFS: `192.168.10.133:/mnt/ai-pool/llama-cpp`)
346346

docs/domains/ai-gpu/model-catalog.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ coding, tool calls, and vision.
7878
- **For:** the occasional very long research dig or full-repo context, when both
7979
3090s are pooled (layer-split, 48GB → 256K KV resident).
8080
- **On a single 3090 it's a trap:** 256K KV (~14GB) spills ~7GB to host RAM →
81-
CPU-driven attention on the Broadwell CPU = slow, *and* it's a distinct
81+
CPU-driven attention on the Threadripper 2950X CPU = slow, *and* it's a distinct
8282
instance that thrashes against the daily driver. Perplexica's embedding filter
8383
keeps prompts <64K anyway, so it does **not** default here. Select manually
8484
only when you've pooled both cards (scale ComfyUI→0). See optimization doc.

docs/domains/argocd/argocd.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ Do not install Prometheus Operator CRDs early just to satisfy bootstrap apps. Se
1919
ArgoCD starts from the manually seeded root application:
2020

2121
```text
22-
infrastructure/controllers/argocd/bootstrap/root-application.yaml
22+
infrastructure/controllers/argocd/root.yaml
2323
```
2424

2525
The root application renders three layers:
@@ -35,14 +35,14 @@ See [ArgoCD entrypoints](entrypoints.md) for the concrete files.
3535
| Wave | Applications |
3636
|---|---|
3737
| `0` | ArgoCD projects/bootstrap, Cilium, 1Password Connect, External Secrets |
38-
| `1` | cert-manager, Longhorn, snapshot-controller, VolSync |
39-
| `2` | pvc-plumber core, VolSync backup-cluster wiring |
40-
| `3` | CNPG Barman plugin |
38+
| `1` | cert-manager, Longhorn, snapshot-controller |
39+
| `2` | kopiur operator (CRDs + controller + webhook; volume populator) |
40+
| `3` | CNPG Barman plugin, kopiur config (ClusterRepository `cluster-kopia` + credential fanout + VolumeSnapshotClass) |
4141
| `4` | KEDA core, Temporal worker, infrastructure and database AppSets |
4242
| `5` | OpenTelemetry operator core, monitoring AppSet including `kube-prometheus-stack` |
4343
| `6` | KEDA observability, OpenTelemetry operator observability, workload AppSet |
4444

45-
cert-manager is intentionally Wave `1`: the CNPG Barman plugin depends on it. pvc-plumber Wave `2` is core-only. KEDA and OpenTelemetry ServiceMonitor resources render from Wave `6` observability overlays.
45+
cert-manager is intentionally Wave `1`: the CNPG Barman plugin depends on it. The kopiur operator is Wave `2` (CRDs + controller + webhook), with its repo/credential config at Wave `3`. KEDA and OpenTelemetry ServiceMonitor resources render from Wave `6` observability overlays.
4646

4747
CNPG `enablePodMonitor: true` remains an accepted runtime soft-coupling. It can log transient errors before monitoring exists, but it is not an ArgoCD dry-run blocker.
4848

docs/domains/argocd/entrypoints.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -25,14 +25,15 @@ This is the review map for everything directly rendered by the root Application
2525
| `core-dependencies/cert-manager-app.yaml` | Application | 1 | Certificate controller required before CNPG Barman plugin | No, required before cert-dependent apps |
2626
| `core-dependencies/longhorn-app.yaml` | Application | 1 | Storage foundation before PVC consumers | No, required before restore/app PVC flows |
2727
| `core-dependencies/snapshot-controller-app.yaml` | Application | 1 | VolumeSnapshot CRDs and controller | No, required by backup/restore flows |
28-
| `core-dependencies/volsync-app.yaml` | Application | 1 | Backup/restore engine | No, required before PVC Plumber and restore policies |
29-
| `core-dependencies/pvc-plumber-app.yaml` | Application | 2 | pvc-plumber v4.0.1 bootstrap-core: permissive RS/RD controller with no monitoring dependency | No, required before managed app PVCs |
30-
| `core-dependencies/volsync-backup-cluster-app.yaml` | Application | 2 | Shared Kopia credentials and VolSync backup-cluster wiring | No, required before managed app PVCs |
28+
| `core-dependencies/kopiur-operator-app.yaml` | Application | 2 | kopiur operator (Kopia-native backup): CRDs + controller + webhook + volume populator; no monitoring dependency | No, required before managed app PVCs |
29+
| `core-dependencies/kopiur-config-app.yaml` | Application | 3 | kopiur repo config: `ClusterRepository cluster-kopia` + credential fanout + `VolumeSnapshotClass longhorn-snapclass` | No, required before managed app PVCs |
3130
| `custom-entrypoints/cnpg-barman-plugin-app.yaml` | Application | 3 | CNPG clusters reference the plugin in wave 4 | Not now, dependency must precede database AppSet |
3231
| `custom-entrypoints/keda-app.yaml` | Application | 4 | Standalone after prior AppSet generator/render-cache loop | Maybe, after proving AppSet render stability |
32+
| `custom-entrypoints/vertical-pod-autoscaler-app.yaml` | Application | 4 | VPA controller (recommender/updater/admission) | Maybe, after proving AppSet render stability |
3333
| `custom-entrypoints/temporal-worker-controller-app.yaml` | Application | 4 | Same AppSet render-cache history as KEDA | Maybe, after proving AppSet render stability |
3434
| `custom-entrypoints/opentelemetry-operator-app.yaml` | Application | 5 | Core operator after cert-manager; ServiceMonitor removed from core | Maybe, if cert-manager dependency is otherwise enforced |
3535
| `custom-entrypoints/keda-observability-app.yaml` | Application | 6 | Optional KEDA ServiceMonitor resources after monitoring CRDs exist | No, keeps observability out of core |
36+
| `custom-entrypoints/vpa-recommendations-app.yaml` | Application | 6 | Optional VPA recommendation CRs after monitoring CRDs exist | No, keeps observability out of core |
3637
| `custom-entrypoints/opentelemetry-operator-observability-app.yaml` | Application | 6 | Optional OpenTelemetry ServiceMonitor after monitoring CRDs exist | No, keeps observability out of core |
3738
| `appsets/infrastructure-appset.yaml` | ApplicationSet | 4 | Explicit list of core infrastructure directories | N/A |
3839
| `appsets/database-appset.yaml` | ApplicationSet | 4 | Discovers `infrastructure/database/*/*`; uses `selfHeal: false` for DR | N/A |
@@ -45,8 +46,7 @@ This is the review map for everything directly rendered by the root Application
4546
PrometheusRule, Probe, AlertmanagerConfig) in its core kustomization.** Those CRDs don't exist until
4647
kube-prometheus-stack (Wave 5); an earlier-wave app shipping them fails dry-run and deadlocks the
4748
App-of-Apps wave gate (proven by the 2026-06-01 nuke drill). Put observability CRs in a **separate
48-
optional app that syncs after Wave 5** (e.g. `keda-observability` at Wave 6; pvc-plumber's were
49-
removed from its Wave-2 core). We deliberately do **not** install Prometheus Operator CRDs early —
49+
optional app that syncs after Wave 5** (e.g. `keda-observability` at Wave 6, split out of KEDA's Wave-4 core). We deliberately do **not** install Prometheus Operator CRDs early —
5050
`SkipDryRunOnMissingResource` is only an escape hatch / observability-app option, never a core fix.
5151
`cert-manager` is at **Wave 1** (not 4) so cert-dependent apps (cnpg-barman-plugin, Wave 3) can start.
5252
Full detail: [cluster DR nuke restore runbook](../../disaster-recovery.md).

docs/domains/cnpg/disaster-recovery.md

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -22,11 +22,7 @@ CNPG databases live in two layers:
2222
| **Postgres data** | inside the CNPG `Cluster` CR | Barman Cloud → RustFS S3 | `spec.bootstrap.recovery` + `externalClusters` |
2323
| **App state** | outside (ExternalSecret, ScheduledBackup) | committed to Git as declarative state | ArgoCD sync |
2424

25-
**Barman ≠ PVC backups.** The PVC/Kopia system (pvc-plumber v4 operator + VolSync,
26-
writing to RustFS S3) handles *file-level* PVC backups. (Kyverno was removed from
27-
this path in 2026-05 and is no longer involved — see `docs/storage-architecture.md`.)
28-
CNPG has its own SQL-aware backup path: Barman Cloud → RustFS S3. The two never
29-
touch each other. See
25+
**Barman ≠ PVC backups.** The PVC/Kopia system (the **kopiur** operator, writing to RustFS S3) handles *file-level* PVC backups. CNPG has its own SQL-aware backup path: Barman Cloud → RustFS S3. The two never touch each other. See
3026
[docs/disaster-recovery.md](../../disaster-recovery.md) for why both exist.
3127

3228
### How recovery works (the 30-second version)

docs/domains/cnpg/explained.md

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ Two layers of database state, two backup paths, **never confuse them**:
1919
| **Postgres data** (the actual database content — tables, rows, WAL) | Barman Cloud → S3 | Cluster CR + PVCs |
2020
| **App-side stuff** (ExternalSecret, ScheduledBackup, Cluster YAML) | Git | The repo |
2121

22-
When you do a disaster recovery, you're using **Barman** to restore Postgres data into a fresh PVC. **pvc-plumber/VolSync has nothing to do with this.** The two backup systems run side-by-side and never touch each other. PVC-level kopia backups would corrupt a running Postgres mid-snapshot — that's why CNPG PVCs explicitly do NOT carry the `backup` label.
22+
When you do a disaster recovery, you're using **Barman** to restore Postgres data into a fresh PVC. **kopiur (the PVC-backup system) has nothing to do with this.** The two backup systems run side-by-side and never touch each other. PVC-level kopia backups would corrupt a running Postgres mid-snapshot — that's why CNPG PVCs explicitly do NOT carry the `backup` label.
2323

2424
---
2525

@@ -255,19 +255,21 @@ operator orchestration was ~2 minutes.
255255
256256
---
257257
258-
## Why this is separate from pvc-plumber
258+
## Why this is separate from the kopiur PVC backups
259259
260-
A reasonable question: pvc-plumber backs up PVCs to kopia. CNPG database files live on PVCs. Why not just label the CNPG PVCs and let pvc-plumber back them up?
260+
A reasonable question: kopiur backs up PVCs to kopia. CNPG database files live on PVCs. Why not just back the CNPG PVCs up with kopiur too?
261261
262-
Three reasons, in increasing severity:
262+
kopiur only backs up PVCs that carry an explicit per-PVC `SnapshotPolicy`/`Restore` stub (via the `kopiur-backup` Kustomize component) in a namespace labeled `kopiur.home-operations.com/repo: cluster-kopia`. CNPG PVCs deliberately get none of that, so kopiur never touches them. There is no admission webhook injecting `dataSourceRef` anymore — coverage is opt-in by the per-PVC stub, not enforced at admission.
263+
264+
Three reasons it stays that way, in increasing severity:
263265
264266
1. **Snapshot consistency.** A kopia snapshot of a running Postgres data directory is *not* a consistent backup. The on-disk state at any moment includes half-written files, WAL not yet flushed, etc. Restoring from a CSI snapshot of running Postgres almost works, but recovery is unsafe and Postgres might not even start. Barman uses `pg_basebackup` which IS Postgres-aware and produces a consistent backup.
265267
266268
2. **WAL.** Postgres recovery requires both a base backup AND the WAL records that follow it. Barman archives WAL continuously. PVC snapshots don't archive WAL — they just snapshot whatever WAL was on disk at snapshot time, which could be hours old.
267269
268270
3. **PITR.** With Barman + WAL archiving, you can restore to any point in time within retention. With PVC snapshots, you can only restore to whenever the last snapshot was taken (default 1h or 1d).
269271
270-
So: CNPG PVCs are explicitly **NOT** labeled `backup: hourly|daily`. pvc-plumber's mutating webhook would refuse to inject `dataSourceRef` on those anyway (operator's `SYSTEM_NAMESPACES` excludes `cloudnative-pg`), but as defense-in-depth, the manifest convention is to omit the label.
272+
So: CNPG PVCs explicitly carry **no** kopiur backup stub and the `cloudnative-pg` namespace is **not** labeled `kopiur.home-operations.com/repo: cluster-kopia`, so kopiur never enrolls them.
271273
272274
---
273275
@@ -302,6 +304,6 @@ Yes — CNPG supports `bootstrap.recovery` with a different `metadata.name`. The
302304
## Where to go deeper
303305
304306
- [docs/domains/cnpg/disaster-recovery.md](disaster-recovery.md) — the technical runbook (this doc's reference)
305-
- [docs/plans/cnpg-plugin-migration.md](../../disaster-recovery.md) — why this cluster uses the Barman Cloud Plugin instead of the deprecated `spec.backup.barmanObjectStore`
307+
- [disaster-recovery.md](disaster-recovery.md) — why this cluster uses the Barman Cloud Plugin instead of the deprecated `spec.backup.barmanObjectStore`
306308
- [docs/disaster-recovery.md](../../disaster-recovery.md) — the OTHER backup system (PVC-level, kopia, NEVER use on CNPG PVCs)
307309
- [docs/pvc-plumber-explained.md](https://github.com/mitchross/pvc-plumber#readme) — pvc-plumber walkthrough for comparison

docs/domains/rustfs/credential-runbook.md

Lines changed: 6 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -110,12 +110,9 @@ These GitOps-managed ExternalSecrets read `rustfs-workload-access-key` and `rust
110110
| `monitoring/tempo-s3-credentials` | `tempo-s3-credentials` |
111111
| `posthog/posthog-secrets` | `posthog-secrets` |
112112
| `rustfs-lifecycle/rustfs-admin-credentials` | `rustfs-admin-credentials` |
113-
| Each chart-rendered `<ns>/volsync-<pvc>` per backed-up PVC | `volsync-<pvc>` |
113+
| `kopiur/kopiur-rustfs` (ClusterExternalSecret → every namespace labeled `kopiur.home-operations.com/repo: cluster-kopia`) | `kopiur-rustfs` |
114114

115-
The `volsync-system/pvc-plumber-kopia` ExternalSecret was removed
116-
2026-05-21 along with the pvc-plumber operator decommission. Per-PVC
117-
ExternalSecrets are now rendered by the `volsync-backup` Helm chart
118-
at `infrastructure/storage/volsync-backup/` rather than the operator.
115+
Per-PVC backup credentials are now delivered by the single `kopiur-rustfs` ClusterExternalSecret (`infrastructure/controllers/kopiur/externalsecret.yaml`), which fans the repo credentials into every namespace labeled `kopiur.home-operations.com/repo: cluster-kopia`. The retired `volsync-backup` per-PVC ExternalSecrets are gone.
119116

120117
Force ESO refresh after changing 1Password:
121118

@@ -128,13 +125,9 @@ kubectl annotate externalsecret -n monitoring tempo-s3-credentials force-sync="$
128125
kubectl annotate externalsecret -n posthog posthog-secrets force-sync="$TS" --overwrite
129126
kubectl annotate externalsecret -n rustfs-lifecycle rustfs-admin-credentials force-sync="$TS" --overwrite
130127

131-
# Also force every chart-rendered per-PVC ES:
132-
kubectl get externalsecret -A -l app.kubernetes.io/managed-by=volsync-backup-chart \
133-
-o jsonpath='{range .items[*]}{.metadata.namespace}{" "}{.metadata.name}{"\n"}{end}' | \
134-
while read ns name; do
135-
[ -z "$ns" ] && continue
136-
kubectl annotate externalsecret -n "$ns" "$name" force-sync="$TS" --overwrite
137-
done
128+
# Also refresh the kopiur repo-credential fanout (one ClusterExternalSecret
129+
# feeds the per-namespace kopiur-rustfs Secret into every backed-up namespace):
130+
kubectl annotate clusterexternalsecret kopiur-rustfs force-sync="$TS" --overwrite
138131
```
139132

140133
Restart consumers that load S3 credentials from environment variables:
@@ -153,8 +146,6 @@ kubectl rollout restart deploy/db deploy/feature-flags deploy/plugins deploy/web
153146
-n posthog
154147
```
155148

156-
VolSync mover Jobs read the per-PVC Secret at Job creation time, so the
157-
NEXT scheduled (or manually triggered) backup run picks up the new
158-
credentials automatically — no restart of VolSync itself needed.
149+
kopiur mover Jobs read the namespace `kopiur-rustfs` Secret at Job creation time, so the next scheduled (or manually triggered) Snapshot picks up rotated credentials automatically — no operator restart needed.
159150
RustFS lifecycle Job is spawned by its CronJob — next scheduled run
160151
uses the refreshed Secret.

docs/index.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,12 +20,12 @@ cluster can be destroyed and rebuilt **unattended** — restores included.
2020
- **OS**: Talos Linux on Proxmox VMs, provisioned via Omni / Sidero
2121
- **CNI**: Cilium with Gateway API + LoadBalancer
2222
- **GitOps**: ArgoCD (self-managing) + ApplicationSets for auto-discovery
23-
- **Storage**: Longhorn (V1 engine, 2× replicas)
23+
- **Storage**: Longhorn (V1 engine, 1 replica — single-node)
2424
- **Backup**: [kopiur](https://github.com/home-operations/kopiur) (Kopia-native) → RustFS S3, per-PVC `SnapshotPolicy`/`Restore` with restore-before-bind
2525
- **Database**: CloudNativePG (Postgres) with Barman backups to S3
2626
- **Secrets**: 1Password Connect + External Secrets Operator
2727
- **Observability**: kube-prometheus-stack, Loki, Tempo, OpenTelemetry
28-
- **AI**: llama-cpp (Qwen3.6-35B multimodal) + ComfyUI on dedicated GPUs
28+
- **AI**: vLLM (Qwen3.6-27B, default app inference) + llama-cpp (Qwen3.6-35B multimodal, for ComfyUI) on mutually-exclusive whole-card GPUs
2929

3030
## Documentation
3131

0 commit comments

Comments
 (0)