Skip to content

Commit 101c192

Browse files
committed
Update cnpg-disaster-recovery.md
1 parent 272f5bd commit 101c192

1 file changed

Lines changed: 101 additions & 0 deletions

File tree

docs/cnpg-disaster-recovery.md

Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,27 @@ kubectl delete cluster immich-database -n cloudnative-pg --wait=false; \
114114

115115
The 15-second sleep ensures old PVCs are cleaned up by Longhorn.
116116

117+
If you get `AlreadyExists`, ArgoCD recreated the Cluster first. Use this fallback, then retry step 4:
118+
119+
```bash
120+
# Temporarily pause reconcile for both immich apps
121+
kubectl annotate application immich -n argocd argocd.argoproj.io/skip-reconcile=true --overwrite
122+
kubectl annotate application my-apps-immich -n argocd argocd.argoproj.io/skip-reconcile=true --overwrite
123+
124+
# Retry delete/create with explicit delete wait
125+
kubectl delete cluster immich-database -n cloudnative-pg --wait=false
126+
kubectl wait --for=delete cluster/immich-database -n cloudnative-pg --timeout=180s
127+
kubectl create -f /tmp/immich-recovery.yaml
128+
```
129+
130+
**4b. Confirm live cluster is actually in recovery mode:**
131+
132+
```bash
133+
kubectl get cluster immich-database -n cloudnative-pg -o yaml | sed -n '/bootstrap:/,/storage:/p'
134+
# Must show: bootstrap.recovery
135+
# Must NOT show: bootstrap.initdb
136+
```
137+
117138
**5. Monitor recovery:**
118139

119140
```bash
@@ -180,12 +201,36 @@ aws --endpoint-url http://192.168.10.133:30293 s3 ls s3://postgres-backups/cnpg/
180201

181202
**Fix**: Use `delete --wait=false; sleep 15; kubectl create` in rapid succession. The sleep gives PVCs time to terminate.
182203

204+
If Argo still wins and `kubectl create` returns `AlreadyExists`, temporarily annotate both Applications with `argocd.argoproj.io/skip-reconcile=true`, then retry delete/wait/create.
205+
206+
### `Error from server (AlreadyExists)` during `kubectl create`
207+
208+
**Cause**: ArgoCD recreated `immich-database` before your manual create landed.
209+
210+
**Fix**:
211+
1. Pause reconcile for `immich` and `my-apps-immich` Applications.
212+
2. `kubectl delete ... --wait=false` + `kubectl wait --for=delete ...`.
213+
3. `kubectl create -f /tmp/immich-recovery.yaml`.
214+
4. Verify live spec shows `bootstrap.recovery`.
215+
183216
### Recovery pod stuck in Pending
184217

185218
**Cause**: Old PVCs from previous cluster still terminating (Longhorn cleanup).
186219

187220
**Fix**: Wait 15-30 seconds for PVCs to fully delete, then recreate the cluster.
188221

222+
### Recovery pod stuck at `Init:0/1` with `volume is not ready for workloads`
223+
224+
**Cause**: Longhorn data/WAL volume is still attaching/remounting after restore.
225+
226+
**Fix**:
227+
```bash
228+
kubectl get pods -n cloudnative-pg -l cnpg.io/cluster=immich-database -o wide
229+
kubectl -n longhorn-system get volumes.longhorn.io | grep immich-database-1
230+
kubectl -n longhorn-system describe volumes.longhorn.io <wal-volume-name>
231+
```
232+
Wait for Longhorn volume `state=attached` and `robustness=healthy`; CNPG will proceed automatically.
233+
189234
### "Only one bootstrap method can be specified"
190235

191236
**Cause**: Both `initdb` and `recovery` present in manifest (ArgoCD SSA merged them).
@@ -230,3 +275,59 @@ kubectl run -it --rm barman-ls --image=amazon/aws-cli:latest \
230275
│ │ │ │
231276
└──────────────────────────────────┘ └──────────────────────────────────┘
232277
```
278+
279+
## LLM Recovery Prompt Templates
280+
281+
Use these prompts when you want an AI assistant to guide or execute CNPG disaster recovery safely.
282+
283+
### Option A: System Prompt (for agent/custom mode)
284+
285+
```text
286+
You are assisting with CloudNativePG disaster recovery in this repository.
287+
288+
Hard rules:
289+
1) Recovery must bypass ArgoCD apply/SSA path for Cluster creation.
290+
2) Never use kubectl apply for recovery cluster creation; use kubectl create.
291+
3) Verify rendered recovery manifest contains bootstrap.recovery and does not contain bootstrap.initdb.
292+
4) If create fails with AlreadyExists, treat as ArgoCD race; pause reconcile on both immich applications, then retry delete/wait/create.
293+
5) After recovery, revert manifest to initdb mode but keep bumped backup serverName lineage (do not roll back lineage).
294+
6) Always validate restored data with SQL query before declaring success.
295+
296+
Required sequence:
297+
- Confirm backup source lineage (e.g., externalClusters serverName=v2) and backup target lineage (backup serverName=v3).
298+
- Render /tmp/immich-recovery.yaml from kustomize output and verify recovery-only bootstrap.
299+
- Delete cluster and create recovery cluster from /tmp/immich-recovery.yaml.
300+
- Monitor cluster/pods until ready.
301+
- If pod is stuck with volume not ready, check Longhorn volume state and wait for attached/healthy.
302+
- Validate SQL (e.g., SELECT count(*) FROM "user";).
303+
- Revert cluster.yaml to normal initdb mode; keep backup lineage bumped.
304+
- Summarize exactly what changed and next operator actions.
305+
306+
Output requirements:
307+
- Be explicit, command-by-command.
308+
- Explain failures and fallback commands.
309+
- Do not skip verification steps.
310+
```
311+
312+
### Option B: Copy/Paste User Prompt (for ChatGPT/Copilot/Claude)
313+
314+
```text
315+
Help me perform CloudNativePG disaster recovery for Immich in this repo.
316+
317+
Context:
318+
- This cluster uses ArgoCD with self-heal and server-side apply.
319+
- CNPG recovery must be created with kubectl create (not apply).
320+
- Current backup lineage is [FILL ME, e.g. immich-database-v2].
321+
- New backup lineage target is [FILL ME, e.g. immich-database-v3].
322+
323+
What I need from you:
324+
1) Give exact commands to render /tmp/immich-recovery.yaml from kustomize.
325+
2) Include checks to confirm manifest has recovery and no initdb.
326+
3) Give safe delete/create commands for immich-database.
327+
4) Include fallback if kubectl create returns AlreadyExists (Argo race).
328+
5) Include readiness checks and Longhorn attach troubleshooting.
329+
6) Include SQL validation commands to confirm data restored.
330+
7) Include exact post-recovery steps to revert manifest to initdb mode while keeping bumped backup serverName.
331+
332+
Do not skip any verification commands. Explain what success/failure looks like at each step.
333+
```

0 commit comments

Comments
 (0)