@@ -114,6 +114,27 @@ kubectl delete cluster immich-database -n cloudnative-pg --wait=false; \
114114
115115The 15-second sleep ensures old PVCs are cleaned up by Longhorn.
116116
117+ If you get ` AlreadyExists ` , ArgoCD recreated the Cluster first. Use this fallback, then retry step 4:
118+
119+ ``` bash
120+ # Temporarily pause reconcile for both immich apps
121+ kubectl annotate application immich -n argocd argocd.argoproj.io/skip-reconcile=true --overwrite
122+ kubectl annotate application my-apps-immich -n argocd argocd.argoproj.io/skip-reconcile=true --overwrite
123+
124+ # Retry delete/create with explicit delete wait
125+ kubectl delete cluster immich-database -n cloudnative-pg --wait=false
126+ kubectl wait --for=delete cluster/immich-database -n cloudnative-pg --timeout=180s
127+ kubectl create -f /tmp/immich-recovery.yaml
128+ ```
129+
130+ ** 4b. Confirm live cluster is actually in recovery mode:**
131+
132+ ``` bash
133+ kubectl get cluster immich-database -n cloudnative-pg -o yaml | sed -n ' /bootstrap:/,/storage:/p'
134+ # Must show: bootstrap.recovery
135+ # Must NOT show: bootstrap.initdb
136+ ```
137+
117138** 5. Monitor recovery:**
118139
119140``` bash
@@ -180,12 +201,36 @@ aws --endpoint-url http://192.168.10.133:30293 s3 ls s3://postgres-backups/cnpg/
180201
181202** Fix** : Use ` delete --wait=false; sleep 15; kubectl create ` in rapid succession. The sleep gives PVCs time to terminate.
182203
204+ If Argo still wins and ` kubectl create ` returns ` AlreadyExists ` , temporarily annotate both Applications with ` argocd.argoproj.io/skip-reconcile=true ` , then retry delete/wait/create.
205+
206+ ### ` Error from server (AlreadyExists) ` during ` kubectl create `
207+
208+ ** Cause** : ArgoCD recreated ` immich-database ` before your manual create landed.
209+
210+ ** Fix** :
211+ 1 . Pause reconcile for ` immich ` and ` my-apps-immich ` Applications.
212+ 2 . ` kubectl delete ... --wait=false ` + ` kubectl wait --for=delete ... ` .
213+ 3 . ` kubectl create -f /tmp/immich-recovery.yaml ` .
214+ 4 . Verify live spec shows ` bootstrap.recovery ` .
215+
183216### Recovery pod stuck in Pending
184217
185218** Cause** : Old PVCs from previous cluster still terminating (Longhorn cleanup).
186219
187220** Fix** : Wait 15-30 seconds for PVCs to fully delete, then recreate the cluster.
188221
222+ ### Recovery pod stuck at ` Init:0/1 ` with ` volume is not ready for workloads `
223+
224+ ** Cause** : Longhorn data/WAL volume is still attaching/remounting after restore.
225+
226+ ** Fix** :
227+ ``` bash
228+ kubectl get pods -n cloudnative-pg -l cnpg.io/cluster=immich-database -o wide
229+ kubectl -n longhorn-system get volumes.longhorn.io | grep immich-database-1
230+ kubectl -n longhorn-system describe volumes.longhorn.io < wal-volume-name>
231+ ```
232+ Wait for Longhorn volume ` state=attached ` and ` robustness=healthy ` ; CNPG will proceed automatically.
233+
189234### "Only one bootstrap method can be specified"
190235
191236** Cause** : Both ` initdb ` and ` recovery ` present in manifest (ArgoCD SSA merged them).
@@ -230,3 +275,59 @@ kubectl run -it --rm barman-ls --image=amazon/aws-cli:latest \
230275│ │ │ │
231276└──────────────────────────────────┘ └──────────────────────────────────┘
232277```
278+
279+ ## LLM Recovery Prompt Templates
280+
281+ Use these prompts when you want an AI assistant to guide or execute CNPG disaster recovery safely.
282+
283+ ### Option A: System Prompt (for agent/custom mode)
284+
285+ ``` text
286+ You are assisting with CloudNativePG disaster recovery in this repository.
287+
288+ Hard rules:
289+ 1) Recovery must bypass ArgoCD apply/SSA path for Cluster creation.
290+ 2) Never use kubectl apply for recovery cluster creation; use kubectl create.
291+ 3) Verify rendered recovery manifest contains bootstrap.recovery and does not contain bootstrap.initdb.
292+ 4) If create fails with AlreadyExists, treat as ArgoCD race; pause reconcile on both immich applications, then retry delete/wait/create.
293+ 5) After recovery, revert manifest to initdb mode but keep bumped backup serverName lineage (do not roll back lineage).
294+ 6) Always validate restored data with SQL query before declaring success.
295+
296+ Required sequence:
297+ - Confirm backup source lineage (e.g., externalClusters serverName=v2) and backup target lineage (backup serverName=v3).
298+ - Render /tmp/immich-recovery.yaml from kustomize output and verify recovery-only bootstrap.
299+ - Delete cluster and create recovery cluster from /tmp/immich-recovery.yaml.
300+ - Monitor cluster/pods until ready.
301+ - If pod is stuck with volume not ready, check Longhorn volume state and wait for attached/healthy.
302+ - Validate SQL (e.g., SELECT count(*) FROM "user";).
303+ - Revert cluster.yaml to normal initdb mode; keep backup lineage bumped.
304+ - Summarize exactly what changed and next operator actions.
305+
306+ Output requirements:
307+ - Be explicit, command-by-command.
308+ - Explain failures and fallback commands.
309+ - Do not skip verification steps.
310+ ```
311+
312+ ### Option B: Copy/Paste User Prompt (for ChatGPT/Copilot/Claude)
313+
314+ ``` text
315+ Help me perform CloudNativePG disaster recovery for Immich in this repo.
316+
317+ Context:
318+ - This cluster uses ArgoCD with self-heal and server-side apply.
319+ - CNPG recovery must be created with kubectl create (not apply).
320+ - Current backup lineage is [FILL ME, e.g. immich-database-v2].
321+ - New backup lineage target is [FILL ME, e.g. immich-database-v3].
322+
323+ What I need from you:
324+ 1) Give exact commands to render /tmp/immich-recovery.yaml from kustomize.
325+ 2) Include checks to confirm manifest has recovery and no initdb.
326+ 3) Give safe delete/create commands for immich-database.
327+ 4) Include fallback if kubectl create returns AlreadyExists (Argo race).
328+ 5) Include readiness checks and Longhorn attach troubleshooting.
329+ 6) Include SQL validation commands to confirm data restored.
330+ 7) Include exact post-recovery steps to revert manifest to initdb mode while keeping bumped backup serverName.
331+
332+ Do not skip any verification commands. Explain what success/failure looks like at each step.
333+ ```
0 commit comments