@@ -116,6 +116,80 @@ provision, or VMs are built from stale state and must be reprovisioned.
116116 an RD with no ` latestImage ` long after its mover completed, or ` /audit `
117117 showing ` needs-human-review ` .
118118
119+ ## In-cluster registry and Gitea Actions
120+
121+ ` registry.vanillax.me ` is an in-cluster registry backed by cluster storage.
122+ After a full nuke, the registry pod, Service, and HTTPRoute can all be healthy
123+ while the registry catalog is still empty. Any workload pinned to
124+ ` registry.vanillax.me/... ` will then fail with ` ImagePullBackOff ` until those
125+ images are rebuilt or repushed.
126+
127+ Check the catalog from inside the registry pod:
128+
129+ ``` bash
130+ kubectl exec -n kube-system deploy/registry -- \
131+ wget -qO- http://127.0.0.1:5000/v2/_catalog
132+ ```
133+
134+ Restore Gitea first, then get the Gitea Actions runner online. The runner
135+ needs ` Secret/gitea-actions/act-runner-token ` ; Git declares that as an
136+ ExternalSecret and 1Password stores the generated token:
137+
138+ - vault: ` homelab-prod `
139+ - item: ` gitea-actions `
140+ - field: ` act_runner_token `
141+
142+ Generate or rotate the token from the restored Gitea pod:
143+
144+ ``` bash
145+ kubectl exec -n gitea deploy/gitea -- gitea actions generate-runner-token
146+ ```
147+
148+ If 1Password is not updated yet, this manual patch gets the live runner moving:
149+
150+ ``` bash
151+ TOKEN=" $( kubectl exec -n gitea deploy/gitea -- \
152+ gitea actions generate-runner-token | tail -n 1 | tr -d ' \r\n' ) "
153+ kubectl create secret generic act-runner-token \
154+ -n gitea-actions \
155+ --from-literal=token=" $TOKEN " \
156+ --dry-run=client -o yaml | kubectl apply -f -
157+ kubectl rollout restart -n gitea-actions deploy/act-runner
158+ kubectl logs -n gitea-actions deploy/act-runner -c runner --tail=50
159+ ```
160+
161+ Expected runner log:
162+
163+ ``` text
164+ runner: cluster-runner-1 ... declare successfully
165+ ```
166+
167+ For radar-ng, the recovery images are pinned in
168+ ` my-apps/development/radar-ng/ ` . If the registry is empty and the runner is not
169+ usable yet, manually refill the exact pinned tags from local checkouts:
170+
171+ ``` bash
172+ cd ~ /programming/radar-ng/backend
173+ VERSION=v1.1.4 ./scripts/build-push.sh tile-server
174+ VERSION=v1.1.1 ./scripts/build-push.sh basemap open-meteo-worker
175+ VERSION=v1.1.7 ./scripts/build-push.sh temporal-worker
176+
177+ cd ~ /programming/talos-argocd-proxmox
178+ ./scripts/build-push-custom-apps.sh basemap-bootstrap
179+ kubectl -n radar-ng delete job basemap-bootstrap
180+ kubectl -n radar-ng rollout restart deploy/tile-server deploy/basemap deploy/open-meteo
181+ kubectl -n radar-ng delete pod -l app=radar-ng-worker
182+ ```
183+
184+ On this single-worker cluster, ` Insufficient cpu ` during recovery usually means
185+ requested CPU is saturated, not that the Proxmox host is busy. Verify with:
186+
187+ ``` bash
188+ kubectl describe node talos-singlenode-gpu-prod-gpu-workers-f7x5ct \
189+ | sed -n ' /Allocated resources:/,/Events:/p'
190+ kubectl top nodes
191+ ```
192+
119193## Post-restore acceptance
120194
121195State BOTH claims, with live numbers:
@@ -158,4 +232,3 @@ Worked fixes for everything the hostile rebuild threw at us — stale CSI
158232attachments, read-only filesystems, wedged clone PVCs, finalizer-stuck
159233resources — live in the
160234[ common failure modes table] ( storage-architecture.md#common-failure-modes ) .
161-
0 commit comments