Document post-nuke Gitea recovery

mitchross · mitchross · commit e7e17d559cd8 · 2026-06-22T16:10:26.000-04:00
diff --git a/docs/disaster-recovery.md b/docs/disaster-recovery.md
@@ -116,6 +116,80 @@ provision, or VMs are built from stale state and must be reprovisioned.
   an RD with no `latestImage` long after its mover completed, or `/audit`
   showing `needs-human-review`.
 
+## In-cluster registry and Gitea Actions
+
+`registry.vanillax.me` is an in-cluster registry backed by cluster storage.
+After a full nuke, the registry pod, Service, and HTTPRoute can all be healthy
+while the registry catalog is still empty. Any workload pinned to
+`registry.vanillax.me/...` will then fail with `ImagePullBackOff` until those
+images are rebuilt or repushed.
+
+Check the catalog from inside the registry pod:
+
+```bash
+kubectl exec -n kube-system deploy/registry -- \
+  wget -qO- http://127.0.0.1:5000/v2/_catalog
+```
+
+Restore Gitea first, then get the Gitea Actions runner online. The runner
+needs `Secret/gitea-actions/act-runner-token`; Git declares that as an
+ExternalSecret and 1Password stores the generated token:
+
+- vault: `homelab-prod`
+- item: `gitea-actions`
+- field: `act_runner_token`
+
+Generate or rotate the token from the restored Gitea pod:
+
+```bash
+kubectl exec -n gitea deploy/gitea -- gitea actions generate-runner-token
+```
+
+If 1Password is not updated yet, this manual patch gets the live runner moving:
+
+```bash
+TOKEN="$(kubectl exec -n gitea deploy/gitea -- \
+  gitea actions generate-runner-token | tail -n 1 | tr -d '\r\n')"
+kubectl create secret generic act-runner-token \
+  -n gitea-actions \
+  --from-literal=token="$TOKEN" \
+  --dry-run=client -o yaml | kubectl apply -f -
+kubectl rollout restart -n gitea-actions deploy/act-runner
+kubectl logs -n gitea-actions deploy/act-runner -c runner --tail=50
+```
+
+Expected runner log:
+
+```text
+runner: cluster-runner-1 ... declare successfully
+```
+
+For radar-ng, the recovery images are pinned in
+`my-apps/development/radar-ng/`. If the registry is empty and the runner is not
+usable yet, manually refill the exact pinned tags from local checkouts:
+
+```bash
+cd ~/programming/radar-ng/backend
+VERSION=v1.1.4 ./scripts/build-push.sh tile-server
+VERSION=v1.1.1 ./scripts/build-push.sh basemap open-meteo-worker
+VERSION=v1.1.7 ./scripts/build-push.sh temporal-worker
+
+cd ~/programming/talos-argocd-proxmox
+./scripts/build-push-custom-apps.sh basemap-bootstrap
+kubectl -n radar-ng delete job basemap-bootstrap
+kubectl -n radar-ng rollout restart deploy/tile-server deploy/basemap deploy/open-meteo
+kubectl -n radar-ng delete pod -l app=radar-ng-worker
+```
+
+On this single-worker cluster, `Insufficient cpu` during recovery usually means
+requested CPU is saturated, not that the Proxmox host is busy. Verify with:
+
+```bash
+kubectl describe node talos-singlenode-gpu-prod-gpu-workers-f7x5ct \
+  | sed -n '/Allocated resources:/,/Events:/p'
+kubectl top nodes
+```
+
 ## Post-restore acceptance
 
 State BOTH claims, with live numbers:
@@ -158,4 +232,3 @@ Worked fixes for everything the hostile rebuild threw at us — stale CSI
 attachments, read-only filesystems, wedged clone PVCs, finalizer-stuck
 resources — live in the
 [common failure modes table](storage-architecture.md#common-failure-modes).
-
diff --git a/docs/domains/cnpg/disaster-recovery.md b/docs/domains/cnpg/disaster-recovery.md
@@ -429,6 +429,22 @@ Remove the target (restore to latest-WAL-available) OR pick an earlier
 timestamp. Symptom: `full-recovery` pods CrashLoopBackOff with this FATAL in
 the Postgres log.
 
+### `barman-cloud-check-wal-archive`: `Expected empty archive`
+
+The new forward write lineage is already dirty. Do not reuse that
+`serverName`, and do not delete random RustFS objects unless you have already
+identified the exact abandoned prefix. The safe recovery is:
+
+1. Keep the recovery source pointed at the last known-good lineage.
+2. Bump `base/cluster.yaml` `spec.plugins[0].parameters.serverName` to the next
+   clean forward lineage.
+3. Hard-refresh Argo before deleting the Cluster/PVCs.
+4. Delete the live Cluster, recovery Jobs, and CNPG PVCs.
+5. Let Argo recreate the Cluster from the current render.
+
+2026-06-22 Gitea example: v6 held real data, v7 was polluted, v8 brought the
+restore online, and v9 became the steady clean write target.
+
 ### "relation does not exist" after a successful recovery
 
 The restored DB is empty (or has a subset of data). Common causes:
@@ -493,13 +509,16 @@ kubectl -n cloudnative-pg patch externalsecret <name> \
 
 ### Polluted S3 lineage after a failed DR attempt
 
-If post-DR scheduled backups wrote EMPTY base backups to the wrong `serverName`
-(happened in our session), the cleanest fix is:
+If post-DR scheduled backups wrote empty base backups to the wrong
+`serverName`, the cleanest fix is:
+
+1. Leave the known-good recovery source alone.
+2. Bump `base/cluster.yaml` `spec.plugins[0].parameters.serverName` to the next
+   clean forward lineage.
+3. Let the next scheduled backup populate the clean prefix.
 
-1. Wipe the polluted `serverName` directory on RustFS (`postgres-backups/cnpg/<db>/<serverName>/`).
-2. Bump `base/cluster.yaml` `backup.barmanObjectStore.serverName` to the next
-   lineage (e.g. `-v4`).
-3. Let the next scheduled backup populate cleanly.
+Only wipe an abandoned RustFS prefix after confirming no live Cluster points at
+it as a write target or recovery source.
 
 ---
 
diff --git a/infrastructure/database/CLAUDE.md b/infrastructure/database/CLAUDE.md
@@ -56,24 +56,26 @@ The `serverName` values below live in each DB's `base/cluster.yaml` and
 
 | Database  | Current write target (base)  | Prior lineage (recovery source) |
 |-----------|------------------------------|---------------------------------|
-| gitea     | `gitea-database-v6`          | `gitea-database-v5`             |
+| gitea     | `gitea-database-v9`          | `gitea-database-v6`             |
 | immich    | `immich-database-v4`         | `immich-database-v3`            |
 | paperless | `paperless-database-v4`      | `paperless-database-v3`         |
 | temporal  | `temporal-database-v6`       | `temporal-database-v5`          |
 
-All four bumped TWICE on 2026-06-11: once for the Longhorn V2 rebuild
-nuke, and again for the same-day re-nuke (SPDK cpu-mask validation run)
-because the aborted first attempt dirtied the fresh prefixes (immich and
-paperless archived WALs before the SPDK wedge stalled the rebuild).
-Fresh initdb on clean prefixes keeps the WAL-archive empty check passing. The prior lineages exist on RustFS but are
-**unrestorable** until the RustFS multipart bug is fixed — all Barman base
-backups upload multipart and RustFS cannot serve multipart objects
-("encrypted object metadata is incomplete"). DB DR via Barman is therefore
-non-functional cluster-wide; treat DB data as disposable until RustFS is
-fixed or backups are rerouted. History: all DBs reset to `-v1` on
+All four bumped TWICE on 2026-06-11: once for the Longhorn V2 rebuild nuke,
+and again for the same-day re-nuke (SPDK cpu-mask validation run) because the
+aborted first attempt dirtied the fresh prefixes (immich and paperless archived
+WALs before the SPDK wedge stalled the rebuild). Fresh initdb on clean prefixes
+keeps the WAL-archive empty check passing. History: all DBs reset to `-v1` on
 2026-04-19 (S3 wipe); gitea `-v2` 2026-05-02 (GPU node loss, real Barman
 restore); gitea/temporal `-v3` opened around the 2026-06-02 first nuke.
 
+2026-06-22: Gitea proved the Barman path is usable again. v6 contained the real
+data; v7 was polluted by an aborted restore attempt and failed
+`barman-cloud-check-wal-archive` with `Expected empty archive`. v8 brought the
+successful restore online; v9 is the steady clean forward write target after
+that recovery. Until the next real Gitea DR event: **last restore read v6,
+current writes go to v9**.
+
 ## Normal operation (add a new CNPG DB)
 
 1. Copy an existing DB directory (e.g. `gitea/`) to `<newapp>/`.
@@ -113,6 +115,9 @@ See the full runbook in [`docs/domains/cnpg/disaster-recovery.md`](../../docs/do
 - **Specify `database` + `owner` + `secret` in recovery bootstrap.** CNPG
   defaults to `database: app, owner: app` if omitted.
 - **Don't add CNPG PVCs to Kyverno backup labels.** They use Barman, not Kopia.
+- **If Barman says `Expected empty archive`, do not reuse that forward
+  `serverName`.** Bump the write target to the next clean lineage and keep the
+  recovery source pointed at the last known-good lineage.
 
 ## Deprecation warnings
 
diff --git a/infrastructure/database/cloudnative-pg/gitea/base/cluster.yaml b/infrastructure/database/cloudnative-pg/gitea/base/cluster.yaml
@@ -70,7 +70,8 @@ spec:
         # archive was already non-empty.
         # v7 opened 2026-06-21 and was polluted before this rebuild completed,
         # so the 2026-06-22 restore failed barman-cloud-check-wal-archive with
-        # "Expected empty archive". v8 is the next clean forward write target
-        # while restoring FROM v6.
+        # "Expected empty archive". v8 was used for the successful restore
+        # bring-up from v6; v9 is the steady forward write target after that
+        # recovery proved good.
         # DO NOT change this value without bumping lineage in lockstep.
-        serverName: gitea-database-v8
+        serverName: gitea-database-v9
diff --git a/infrastructure/database/cloudnative-pg/gitea/kustomization.yaml b/infrastructure/database/cloudnative-pg/gitea/kustomization.yaml
@@ -21,7 +21,7 @@ commonAnnotations:
   argocd.argoproj.io/sync-wave: "-5"
 resources:
   - overlays/recovery        # ← ACTIVE: restore gitea DB FROM v6.
-                             # Writes forward to v8 because v7 is polluted and
+                             # Writes forward to v9 because v7 is polluted and
                              # fails barman-cloud-check-wal-archive with
                              # "Expected empty archive".
   # - overlays/initdb        # fresh empty DB (1 admin, 0 repos). Was active
diff --git a/my-apps/development/gitea-actions/README.md b/my-apps/development/gitea-actions/README.md
@@ -0,0 +1,77 @@
+# Gitea Actions runner recovery
+
+`act-runner` needs a Gitea runner registration token in the Kubernetes Secret
+`gitea-actions/act-runner-token`. The token is secret material, so Git only
+declares the `ExternalSecret`; 1Password stores the value.
+
+## 1Password item
+
+Vault: `homelab-prod`
+
+Item: `gitea-actions`
+
+Field: `act_runner_token`
+
+Generate or rotate the value from the restored Gitea pod:
+
+```bash
+kubectl exec -n gitea deploy/gitea -- gitea actions generate-runner-token
+```
+
+Paste the printed token into the 1Password field above. External Secrets then
+creates:
+
+```text
+Secret/gitea-actions/act-runner-token
+  token: <Gitea runner registration token>
+```
+
+After the 1Password item exists, uncomment `externalsecret.yaml` in this
+directory's `kustomization.yaml` and push the GitOps change. Until then, use
+the manual Secret patch below during rebuilds.
+
+## Post-nuke order
+
+1. Restore Gitea CNPG and the Gitea app.
+2. Verify 1Password Connect and External Secrets are healthy.
+3. Verify this ExternalSecret synced:
+
+```bash
+kubectl get externalsecret -n gitea-actions act-runner-token
+kubectl get secret -n gitea-actions act-runner-token
+kubectl rollout status -n gitea-actions deploy/act-runner
+```
+
+If `act-runner-token` is missing and `act-runner` is stuck in
+`CreateContainerConfigError`, the 1Password item/field is missing or stale.
+
+## Registry refill after a nuke
+
+The in-cluster registry (`registry.vanillax.me`) uses a cluster PVC. After a
+full nuke it can come back empty even though the registry pod and HTTPRoute are
+healthy:
+
+```bash
+kubectl exec -n kube-system deploy/registry -- \
+  wget -qO- http://127.0.0.1:5000/v2/_catalog
+```
+
+An empty catalog means apps pinned to `registry.vanillax.me/...` will hit
+`ImagePullBackOff` until their images are rebuilt or repushed. For radar-ng,
+use Gitea Actions once the runner is healthy, or build locally from the
+`radar-ng` repo:
+
+```bash
+cd ~/programming/radar-ng/backend
+VERSION=v1.1.4 ./scripts/build-push.sh tile-server
+VERSION=v1.1.1 ./scripts/build-push.sh basemap open-meteo-worker
+VERSION=v1.1.7 ./scripts/build-push.sh temporal-worker
+```
+
+`basemap-bootstrap:latest` is maintained in this GitOps repo:
+
+```bash
+cd ~/programming/talos-argocd-proxmox
+./scripts/build-push-custom-apps.sh basemap-bootstrap
+kubectl -n radar-ng delete job basemap-bootstrap
+```
diff --git a/my-apps/development/gitea-actions/externalsecret.yaml b/my-apps/development/gitea-actions/externalsecret.yaml
@@ -0,0 +1,23 @@
+apiVersion: external-secrets.io/v1
+kind: ExternalSecret
+metadata:
+  name: act-runner-token
+  namespace: gitea-actions
+  annotations:
+    argocd.argoproj.io/sync-wave: "-1"
+spec:
+  refreshInterval: "1h"
+  secretStoreRef:
+    kind: ClusterSecretStore
+    name: 1password
+  target:
+    name: act-runner-token
+    creationPolicy: Owner
+  data:
+    - secretKey: token
+      remoteRef:
+        key: gitea-actions
+        property: act_runner_token
+        conversionStrategy: Default
+        decodingStrategy: None
+        metadataPolicy: None
diff --git a/my-apps/development/gitea-actions/kustomization.yaml b/my-apps/development/gitea-actions/kustomization.yaml
@@ -3,10 +3,14 @@ kind: Kustomization
 namespace: gitea-actions
 resources:
   - namespace.yaml
+  # Enable after 1Password has homelab-prod/gitea-actions field
+  # act_runner_token. Until then, the live cluster uses the manual
+  # Secret/gitea-actions/act-runner-token created during rebuild.
+  # - externalsecret.yaml
   - configmap.yaml
   - pvc.yaml
   - deployment.yaml
 
-# act-runner-token Secret is created out-of-band via kubectl — the
-# token comes from `gitea actions generate-runner-token` inside the
-# gitea pod and isn't checked into git. See README in this dir.
+# Preferred steady state: source act-runner-token from 1Password via
+# externalsecret.yaml. The token itself is generated by Gitea and stored in
+# homelab-prod/gitea-actions.