Skip to content

Commit e7e17d5

Browse files
committed
Document post-nuke Gitea recovery
1 parent 4b59086 commit e7e17d5

8 files changed

Lines changed: 227 additions & 25 deletions

File tree

docs/disaster-recovery.md

Lines changed: 74 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,80 @@ provision, or VMs are built from stale state and must be reprovisioned.
116116
an RD with no `latestImage` long after its mover completed, or `/audit`
117117
showing `needs-human-review`.
118118

119+
## In-cluster registry and Gitea Actions
120+
121+
`registry.vanillax.me` is an in-cluster registry backed by cluster storage.
122+
After a full nuke, the registry pod, Service, and HTTPRoute can all be healthy
123+
while the registry catalog is still empty. Any workload pinned to
124+
`registry.vanillax.me/...` will then fail with `ImagePullBackOff` until those
125+
images are rebuilt or repushed.
126+
127+
Check the catalog from inside the registry pod:
128+
129+
```bash
130+
kubectl exec -n kube-system deploy/registry -- \
131+
wget -qO- http://127.0.0.1:5000/v2/_catalog
132+
```
133+
134+
Restore Gitea first, then get the Gitea Actions runner online. The runner
135+
needs `Secret/gitea-actions/act-runner-token`; Git declares that as an
136+
ExternalSecret and 1Password stores the generated token:
137+
138+
- vault: `homelab-prod`
139+
- item: `gitea-actions`
140+
- field: `act_runner_token`
141+
142+
Generate or rotate the token from the restored Gitea pod:
143+
144+
```bash
145+
kubectl exec -n gitea deploy/gitea -- gitea actions generate-runner-token
146+
```
147+
148+
If 1Password is not updated yet, this manual patch gets the live runner moving:
149+
150+
```bash
151+
TOKEN="$(kubectl exec -n gitea deploy/gitea -- \
152+
gitea actions generate-runner-token | tail -n 1 | tr -d '\r\n')"
153+
kubectl create secret generic act-runner-token \
154+
-n gitea-actions \
155+
--from-literal=token="$TOKEN" \
156+
--dry-run=client -o yaml | kubectl apply -f -
157+
kubectl rollout restart -n gitea-actions deploy/act-runner
158+
kubectl logs -n gitea-actions deploy/act-runner -c runner --tail=50
159+
```
160+
161+
Expected runner log:
162+
163+
```text
164+
runner: cluster-runner-1 ... declare successfully
165+
```
166+
167+
For radar-ng, the recovery images are pinned in
168+
`my-apps/development/radar-ng/`. If the registry is empty and the runner is not
169+
usable yet, manually refill the exact pinned tags from local checkouts:
170+
171+
```bash
172+
cd ~/programming/radar-ng/backend
173+
VERSION=v1.1.4 ./scripts/build-push.sh tile-server
174+
VERSION=v1.1.1 ./scripts/build-push.sh basemap open-meteo-worker
175+
VERSION=v1.1.7 ./scripts/build-push.sh temporal-worker
176+
177+
cd ~/programming/talos-argocd-proxmox
178+
./scripts/build-push-custom-apps.sh basemap-bootstrap
179+
kubectl -n radar-ng delete job basemap-bootstrap
180+
kubectl -n radar-ng rollout restart deploy/tile-server deploy/basemap deploy/open-meteo
181+
kubectl -n radar-ng delete pod -l app=radar-ng-worker
182+
```
183+
184+
On this single-worker cluster, `Insufficient cpu` during recovery usually means
185+
requested CPU is saturated, not that the Proxmox host is busy. Verify with:
186+
187+
```bash
188+
kubectl describe node talos-singlenode-gpu-prod-gpu-workers-f7x5ct \
189+
| sed -n '/Allocated resources:/,/Events:/p'
190+
kubectl top nodes
191+
```
192+
119193
## Post-restore acceptance
120194

121195
State BOTH claims, with live numbers:
@@ -158,4 +232,3 @@ Worked fixes for everything the hostile rebuild threw at us — stale CSI
158232
attachments, read-only filesystems, wedged clone PVCs, finalizer-stuck
159233
resources — live in the
160234
[common failure modes table](storage-architecture.md#common-failure-modes).
161-

docs/domains/cnpg/disaster-recovery.md

Lines changed: 25 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -429,6 +429,22 @@ Remove the target (restore to latest-WAL-available) OR pick an earlier
429429
timestamp. Symptom: `full-recovery` pods CrashLoopBackOff with this FATAL in
430430
the Postgres log.
431431

432+
### `barman-cloud-check-wal-archive`: `Expected empty archive`
433+
434+
The new forward write lineage is already dirty. Do not reuse that
435+
`serverName`, and do not delete random RustFS objects unless you have already
436+
identified the exact abandoned prefix. The safe recovery is:
437+
438+
1. Keep the recovery source pointed at the last known-good lineage.
439+
2. Bump `base/cluster.yaml` `spec.plugins[0].parameters.serverName` to the next
440+
clean forward lineage.
441+
3. Hard-refresh Argo before deleting the Cluster/PVCs.
442+
4. Delete the live Cluster, recovery Jobs, and CNPG PVCs.
443+
5. Let Argo recreate the Cluster from the current render.
444+
445+
2026-06-22 Gitea example: v6 held real data, v7 was polluted, v8 brought the
446+
restore online, and v9 became the steady clean write target.
447+
432448
### "relation does not exist" after a successful recovery
433449

434450
The restored DB is empty (or has a subset of data). Common causes:
@@ -493,13 +509,16 @@ kubectl -n cloudnative-pg patch externalsecret <name> \
493509

494510
### Polluted S3 lineage after a failed DR attempt
495511

496-
If post-DR scheduled backups wrote EMPTY base backups to the wrong `serverName`
497-
(happened in our session), the cleanest fix is:
512+
If post-DR scheduled backups wrote empty base backups to the wrong
513+
`serverName`, the cleanest fix is:
514+
515+
1. Leave the known-good recovery source alone.
516+
2. Bump `base/cluster.yaml` `spec.plugins[0].parameters.serverName` to the next
517+
clean forward lineage.
518+
3. Let the next scheduled backup populate the clean prefix.
498519

499-
1. Wipe the polluted `serverName` directory on RustFS (`postgres-backups/cnpg/<db>/<serverName>/`).
500-
2. Bump `base/cluster.yaml` `backup.barmanObjectStore.serverName` to the next
501-
lineage (e.g. `-v4`).
502-
3. Let the next scheduled backup populate cleanly.
520+
Only wipe an abandoned RustFS prefix after confirming no live Cluster points at
521+
it as a write target or recovery source.
503522

504523
---
505524

infrastructure/database/CLAUDE.md

Lines changed: 16 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -56,24 +56,26 @@ The `serverName` values below live in each DB's `base/cluster.yaml` and
5656

5757
| Database | Current write target (base) | Prior lineage (recovery source) |
5858
|-----------|------------------------------|---------------------------------|
59-
| gitea | `gitea-database-v6` | `gitea-database-v5` |
59+
| gitea | `gitea-database-v9` | `gitea-database-v6` |
6060
| immich | `immich-database-v4` | `immich-database-v3` |
6161
| paperless | `paperless-database-v4` | `paperless-database-v3` |
6262
| temporal | `temporal-database-v6` | `temporal-database-v5` |
6363

64-
All four bumped TWICE on 2026-06-11: once for the Longhorn V2 rebuild
65-
nuke, and again for the same-day re-nuke (SPDK cpu-mask validation run)
66-
because the aborted first attempt dirtied the fresh prefixes (immich and
67-
paperless archived WALs before the SPDK wedge stalled the rebuild).
68-
Fresh initdb on clean prefixes keeps the WAL-archive empty check passing. The prior lineages exist on RustFS but are
69-
**unrestorable** until the RustFS multipart bug is fixed — all Barman base
70-
backups upload multipart and RustFS cannot serve multipart objects
71-
("encrypted object metadata is incomplete"). DB DR via Barman is therefore
72-
non-functional cluster-wide; treat DB data as disposable until RustFS is
73-
fixed or backups are rerouted. History: all DBs reset to `-v1` on
64+
All four bumped TWICE on 2026-06-11: once for the Longhorn V2 rebuild nuke,
65+
and again for the same-day re-nuke (SPDK cpu-mask validation run) because the
66+
aborted first attempt dirtied the fresh prefixes (immich and paperless archived
67+
WALs before the SPDK wedge stalled the rebuild). Fresh initdb on clean prefixes
68+
keeps the WAL-archive empty check passing. History: all DBs reset to `-v1` on
7469
2026-04-19 (S3 wipe); gitea `-v2` 2026-05-02 (GPU node loss, real Barman
7570
restore); gitea/temporal `-v3` opened around the 2026-06-02 first nuke.
7671

72+
2026-06-22: Gitea proved the Barman path is usable again. v6 contained the real
73+
data; v7 was polluted by an aborted restore attempt and failed
74+
`barman-cloud-check-wal-archive` with `Expected empty archive`. v8 brought the
75+
successful restore online; v9 is the steady clean forward write target after
76+
that recovery. Until the next real Gitea DR event: **last restore read v6,
77+
current writes go to v9**.
78+
7779
## Normal operation (add a new CNPG DB)
7880

7981
1. Copy an existing DB directory (e.g. `gitea/`) to `<newapp>/`.
@@ -113,6 +115,9 @@ See the full runbook in [`docs/domains/cnpg/disaster-recovery.md`](../../docs/do
113115
- **Specify `database` + `owner` + `secret` in recovery bootstrap.** CNPG
114116
defaults to `database: app, owner: app` if omitted.
115117
- **Don't add CNPG PVCs to Kyverno backup labels.** They use Barman, not Kopia.
118+
- **If Barman says `Expected empty archive`, do not reuse that forward
119+
`serverName`.** Bump the write target to the next clean lineage and keep the
120+
recovery source pointed at the last known-good lineage.
116121

117122
## Deprecation warnings
118123

infrastructure/database/cloudnative-pg/gitea/base/cluster.yaml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,8 @@ spec:
7070
# archive was already non-empty.
7171
# v7 opened 2026-06-21 and was polluted before this rebuild completed,
7272
# so the 2026-06-22 restore failed barman-cloud-check-wal-archive with
73-
# "Expected empty archive". v8 is the next clean forward write target
74-
# while restoring FROM v6.
73+
# "Expected empty archive". v8 was used for the successful restore
74+
# bring-up from v6; v9 is the steady forward write target after that
75+
# recovery proved good.
7576
# DO NOT change this value without bumping lineage in lockstep.
76-
serverName: gitea-database-v8
77+
serverName: gitea-database-v9

infrastructure/database/cloudnative-pg/gitea/kustomization.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ commonAnnotations:
2121
argocd.argoproj.io/sync-wave: "-5"
2222
resources:
2323
- overlays/recovery # ← ACTIVE: restore gitea DB FROM v6.
24-
# Writes forward to v8 because v7 is polluted and
24+
# Writes forward to v9 because v7 is polluted and
2525
# fails barman-cloud-check-wal-archive with
2626
# "Expected empty archive".
2727
# - overlays/initdb # fresh empty DB (1 admin, 0 repos). Was active
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# Gitea Actions runner recovery
2+
3+
`act-runner` needs a Gitea runner registration token in the Kubernetes Secret
4+
`gitea-actions/act-runner-token`. The token is secret material, so Git only
5+
declares the `ExternalSecret`; 1Password stores the value.
6+
7+
## 1Password item
8+
9+
Vault: `homelab-prod`
10+
11+
Item: `gitea-actions`
12+
13+
Field: `act_runner_token`
14+
15+
Generate or rotate the value from the restored Gitea pod:
16+
17+
```bash
18+
kubectl exec -n gitea deploy/gitea -- gitea actions generate-runner-token
19+
```
20+
21+
Paste the printed token into the 1Password field above. External Secrets then
22+
creates:
23+
24+
```text
25+
Secret/gitea-actions/act-runner-token
26+
token: <Gitea runner registration token>
27+
```
28+
29+
After the 1Password item exists, uncomment `externalsecret.yaml` in this
30+
directory's `kustomization.yaml` and push the GitOps change. Until then, use
31+
the manual Secret patch below during rebuilds.
32+
33+
## Post-nuke order
34+
35+
1. Restore Gitea CNPG and the Gitea app.
36+
2. Verify 1Password Connect and External Secrets are healthy.
37+
3. Verify this ExternalSecret synced:
38+
39+
```bash
40+
kubectl get externalsecret -n gitea-actions act-runner-token
41+
kubectl get secret -n gitea-actions act-runner-token
42+
kubectl rollout status -n gitea-actions deploy/act-runner
43+
```
44+
45+
If `act-runner-token` is missing and `act-runner` is stuck in
46+
`CreateContainerConfigError`, the 1Password item/field is missing or stale.
47+
48+
## Registry refill after a nuke
49+
50+
The in-cluster registry (`registry.vanillax.me`) uses a cluster PVC. After a
51+
full nuke it can come back empty even though the registry pod and HTTPRoute are
52+
healthy:
53+
54+
```bash
55+
kubectl exec -n kube-system deploy/registry -- \
56+
wget -qO- http://127.0.0.1:5000/v2/_catalog
57+
```
58+
59+
An empty catalog means apps pinned to `registry.vanillax.me/...` will hit
60+
`ImagePullBackOff` until their images are rebuilt or repushed. For radar-ng,
61+
use Gitea Actions once the runner is healthy, or build locally from the
62+
`radar-ng` repo:
63+
64+
```bash
65+
cd ~/programming/radar-ng/backend
66+
VERSION=v1.1.4 ./scripts/build-push.sh tile-server
67+
VERSION=v1.1.1 ./scripts/build-push.sh basemap open-meteo-worker
68+
VERSION=v1.1.7 ./scripts/build-push.sh temporal-worker
69+
```
70+
71+
`basemap-bootstrap:latest` is maintained in this GitOps repo:
72+
73+
```bash
74+
cd ~/programming/talos-argocd-proxmox
75+
./scripts/build-push-custom-apps.sh basemap-bootstrap
76+
kubectl -n radar-ng delete job basemap-bootstrap
77+
```
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
apiVersion: external-secrets.io/v1
2+
kind: ExternalSecret
3+
metadata:
4+
name: act-runner-token
5+
namespace: gitea-actions
6+
annotations:
7+
argocd.argoproj.io/sync-wave: "-1"
8+
spec:
9+
refreshInterval: "1h"
10+
secretStoreRef:
11+
kind: ClusterSecretStore
12+
name: 1password
13+
target:
14+
name: act-runner-token
15+
creationPolicy: Owner
16+
data:
17+
- secretKey: token
18+
remoteRef:
19+
key: gitea-actions
20+
property: act_runner_token
21+
conversionStrategy: Default
22+
decodingStrategy: None
23+
metadataPolicy: None

my-apps/development/gitea-actions/kustomization.yaml

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,14 @@ kind: Kustomization
33
namespace: gitea-actions
44
resources:
55
- namespace.yaml
6+
# Enable after 1Password has homelab-prod/gitea-actions field
7+
# act_runner_token. Until then, the live cluster uses the manual
8+
# Secret/gitea-actions/act-runner-token created during rebuild.
9+
# - externalsecret.yaml
610
- configmap.yaml
711
- pvc.yaml
812
- deployment.yaml
913

10-
# act-runner-token Secret is created out-of-band via kubectl — the
11-
# token comes from `gitea actions generate-runner-token` inside the
12-
# gitea pod and isn't checked into git. See README in this dir.
14+
# Preferred steady state: source act-runner-token from 1Password via
15+
# externalsecret.yaml. The token itself is generated by Gitea and stored in
16+
# homelab-prod/gitea-actions.

0 commit comments

Comments
 (0)