Skip to content

Commit ac94aea

Browse files
committed
Update cnpg-disaster-recovery.md
1 parent db198de commit ac94aea

1 file changed

Lines changed: 284 additions & 0 deletions

File tree

docs/cnpg-disaster-recovery.md

Lines changed: 284 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,130 @@ Additionally, ArgoCD ApplicationSets enforce `selfHeal: true`, recreating delete
1717

1818
**Solution**: Apply recovery manifests directly with `kubectl create`, bypassing ArgoCD entirely.
1919

20+
## GitOps During Recovery: Source of Truth & skip-reconcile
21+
22+
**Key principle: Git is ALWAYS source of truth.** But during recovery, we temporarily pause ArgoCD's auto-sync to avoid conflicts.
23+
24+
### Normal GitOps Flow (Always)
25+
26+
```
27+
┌──────────────┐
28+
│ Git │ ← Source of truth (cluster.yaml, values, etc.)
29+
└──────┬───────┘
30+
31+
│ (ArgoCD watches continuously)
32+
│ "Any change in Git = auto-sync to cluster"
33+
34+
┌──────────────┐
35+
│ ArgoCD │
36+
│ (auto-sync) │
37+
└──────┬───────┘
38+
39+
│ (SSA, Helm rendering, kustomize apply)
40+
41+
┌──────────────┐
42+
│ Cluster │
43+
│ (synced to │
44+
│ Git state) │
45+
└──────────────┘
46+
```
47+
48+
Git change → ArgoCD auto-discovers → Cluster updates. Simple, automated, always consistent.
49+
50+
### Recovery Flow: Temporary skip-reconcile
51+
52+
During CNPG recovery, we PAUSE auto-sync to prevent conflicts:
53+
54+
```
55+
STEP 1: Pause auto-sync (Set skip-reconcile=true)
56+
┌──────────────┐
57+
│ Git │
58+
└──────┬───────┘
59+
60+
X (ArgoCD paused)
61+
│ "Don't auto-sync yet"
62+
63+
┌──────────────────────────────┐
64+
│ ArgoCD (PAUSED) │
65+
│ skip-reconcile=true │
66+
│ (manual sync only) │
67+
└──────────────────────────────┘
68+
69+
│ (Manual sync via UI still works)
70+
71+
┌──────────────────────────────┐
72+
│ Cluster (unchanged so far) │
73+
└──────────────────────────────┘
74+
75+
STEP 2: Manual kubectl recovery (bypass ArgoCD)
76+
┌──────────────────────────────┐
77+
│ You (kubectl create) │
78+
│ recovery-cluster.yaml │
79+
└──────┬───────────────────────┘
80+
81+
│ (Direct API call, no SSA conflict)
82+
83+
┌──────────────────────────────┐
84+
│ Cluster │
85+
│ (recovery pod running) │
86+
└──────────────────────────────┘
87+
88+
STEP 3: Recovery completes, unpause (Remove skip-reconcile)
89+
┌──────────────┐
90+
│ Git │
91+
└──────┬───────┘
92+
93+
│ (ArgoCD unpaused)
94+
│ "Resume auto-sync"
95+
96+
┌──────────────────────────────┐
97+
│ ArgoCD (RESUMING) │
98+
│ skip-reconcile removed │
99+
│ (auto-sync enabled) │
100+
└──────────────────────────────┘
101+
102+
│ (normal GitOps resumes)
103+
104+
┌──────────────────────────────┐
105+
│ Cluster (final state) │
106+
│ (recovered data + Git sync) │
107+
└──────────────────────────────┘
108+
```
109+
110+
### Why skip-reconcile Doesn't Break GitOps
111+
112+
**Git remains source of truth the whole time:**
113+
- You commit recovered state back to Git (cluster.yaml reverted to initdb, backup lineage bumped to v3)
114+
- skip-reconcile only blocks **automatic reconciliation** (ArgoCD watching)
115+
- Manual sync (UI click) still reads Git and applies to cluster
116+
- Once skip-reconcile is removed, auto-sync resumes from Git state
117+
118+
**Think of it like:**
119+
- Normal: ArgoCD is always watching Git, automatically syncing any changes
120+
- skip-reconcile pause: You tell ArgoCD "ignore Git for now, let me work"
121+
- Manual recovery: You directly fix the cluster
122+
- Unpause: ArgoCD starts watching Git again, makes sure cluster matches Git
123+
124+
**After unpause, if someone changed Git while paused:**
125+
- ArgoCD syncs the newest Git state
126+
- Old recover state is overwritten by Git
127+
- Git wins (as it should)
128+
129+
### Cleanup Checklist
130+
131+
```
132+
[ ] Recovery cluster is healthy (pod Ready 1/1, data validated)
133+
[ ] cluster.yaml reverted to initdb mode (not recovery)
134+
[ ] cluster.yaml backup lineage bumped to v3 (not v2)
135+
[ ] Commit cluster.yaml to Git
136+
[ ] Push to main branch
137+
[ ] Manual sync via Argo UI (still has skip-reconcile on, that's ok)
138+
[ ] Remove skip-reconcile annotations
139+
[ ] Verify auto-sync working again
140+
```
141+
142+
After unpause, Git and cluster sync normally, and you're back to true GitOps.
143+
20144
## Backup Architecture
21145

22146
```
@@ -60,6 +184,166 @@ During recovery, treat these as two different values:
60184

61185
After recovery succeeds, keep backups on the new lineage (`v3`). Do **not** switch backup target back to `v2`.
62186

187+
## CNPG Normal Operation (Continuous Backups)
188+
189+
This is what happens every day to keep backups current:
190+
191+
```
192+
┌─────────────────────────────────────────────────────────────┐
193+
│ CNPG Cluster (Normal Operation) │
194+
│ │
195+
│ ┌──────────────┐ │
196+
│ │ Postgres │ ← Running, accepting transactions │
197+
│ │ (immich) │ │
198+
│ └──────┬───────┘ │
199+
│ │ │
200+
│ ┌───────┴──────────────────────┬────────────────────────┐ │
201+
│ │ split into two paths: │ │ │
202+
│ ↓ ↓ │ │
203+
│ ┌──────────────┐ ┌──────────────────┐ │ │
204+
│ │ WAL Stream │ │ Scheduled Base │ │ │
205+
│ │ (every txn) │ │ Backups (daily) │ │ │
206+
│ └──────┬───────┘ └────────┬─────────┘ │ │
207+
│ │ │ │ │
208+
│ │ (continuous) │ (full dump) │ │
209+
│ ↓ ↓ │ │
210+
│ ┌──────────────────────────────────────────┐ │ │
211+
│ │ Barman (CloudNativePG operator) │ │ │
212+
│ │ "Archive everything to S3" │ │ │
213+
│ └──────┬───────────────────────────────────┘ │ │
214+
│ │ │ │
215+
│ │ (upload to S3) │ │
216+
│ ↓ │ │
217+
│ ┌──────────────────────────────────────────┐ │ │
218+
│ │ RustFS S3 Storage │ │ │
219+
│ │ │ │ │
220+
│ │ s3://postgres-backups/cnpg/immich/ │ │ │
221+
│ │ ├── immich-database-v2/ │ │ │
222+
│ │ │ ├── base/ (full backups) │ │ │
223+
│ │ │ └── wals/ (transaction logs) │ │ │
224+
│ │ └── (encrypted, compressed) │ │ │
225+
│ └──────────────────────────────────────────┘ │ │
226+
│ │ │
227+
└───────────────────────────────────────────────────────┘ │
228+
```
229+
230+
**Result**: If something breaks tomorrow, backups with all transactions up to the failure moment are sitting on S3.
231+
232+
## CNPG Disaster Recovery (Reading from Backups)
233+
234+
When you nuke the cluster and rebuild, CNPG needs to restore from S3:
235+
236+
```
237+
SCENARIO: Cluster crashed, PVCs deleted, you're rebuilding
238+
239+
STEP 1: You tell CNPG "Use recovery mode" (in cluster.yaml)
240+
┌─────────────────────────────────────┐
241+
│ cluster.yaml bootstrap section: │
242+
│ recovery: │
243+
│ source: immich-backup ← points to S3│
244+
├─────────────────────────────────────┤
245+
│ externalClusters: │
246+
│ serverName: v2 ← restore FROM this version
247+
└─────────────────────────────────────┘
248+
249+
│ (kubectl create - bypass ArgoCD)
250+
251+
┌─────────────────────────────────────────────────────────┐
252+
│ CNPG Operator sees "recovery" mode │
253+
│ Looks for source in externalClusters │
254+
└────────────────────┬────────────────────────────────────┘
255+
256+
257+
┌───────────────────────┐
258+
│ RustFS S3 │
259+
│ (look for v2) │
260+
└─────────┬─────────────┘
261+
262+
┌────┴────┐
263+
↓ ↓
264+
┌────────┐ ┌───────┐
265+
│ base/ │ │ wals/ │ ← Latest transaction logs
266+
└────┬───┘ └───┬───┘
267+
│ │
268+
└────┬─────┘
269+
│ (download + restore)
270+
271+
┌─────────────────────┐
272+
│ New Postgres Pod │
273+
│ (recovering...) │
274+
│ + Longhorn PVCs │
275+
│ (data being written)
276+
└────────┬────────────┘
277+
│ (after restore completes)
278+
279+
┌─────────────────────┐
280+
│ Postgres Ready │
281+
│ All data restored! │
282+
│ (v2 lineage) │
283+
└─────────────────────┘
284+
285+
STEP 2: You change cluster.yaml back to initdb (normal mode)
286+
BUT change backup.serverName to v3 (new lineage)
287+
288+
This prevents WAL conflicts:
289+
- Old backups stay at v2 (untouched, point-in-time recovery available)
290+
- New writes go to v3 (fresh archive)
291+
- Next recovery will restore from v3, then bump to v4
292+
```
293+
294+
## Bootstrap Decision Tree
295+
296+
CNPG's bootstrap section determines what happens when a Cluster is created:
297+
298+
```
299+
┌──────────────────────────────────┐
300+
│ CNPG Cluster Created │
301+
│ (kubectl create or apply) │
302+
└──────────────┬───────────────────┘
303+
304+
│ Check spec.bootstrap:
305+
306+
┌──────────┴──────────┐
307+
│ │
308+
↓ ↓
309+
┌───────────────┐ ┌──────────────┐
310+
│ initdb │ │ recovery │
311+
│ (default) │ │ (restore) │
312+
└───────┬───────┘ └──────┬───────┘
313+
│ │
314+
↓ │ Look for externalClusters:
315+
┌──────────────────────┐ │
316+
│ Create fresh db │ ↓
317+
│ (empty, new owner) │ ┌──────────────────────────┐
318+
│ │ │ Find serverName=v2 in S3 │
319+
│ Starting postgres, │ │ Download base backup │
320+
│ then run │ │ + replay WALs │
321+
│ postInitSQL: │ │ │
322+
│ - CREATE EXT │ │ → Postgres starts with │
323+
│ - GRANT PRIVS │ │ restored data! │
324+
│ │ └──────────────────────────┘
325+
│ RESULT: Empty DB │
326+
│ User must sign up │ RESULT: Full data restored
327+
│ or restore from │ Users see their data
328+
│ PVCs │ All tables/users back
329+
└──────────────────────┘
330+
OR
331+
┌──────────────────────┐
332+
│ BUG: Both present │
333+
│ (initdb + recovery) │
334+
│ │
335+
│ CNPG webhook adds │
336+
│ defaults → merger │
337+
│ conflict → initdb │
338+
│ wins │
339+
│ │
340+
│ RESULT: Empty DB │
341+
│ (lost data!) │
342+
└──────────────────────┘
343+
```
344+
345+
**Key takeaway:** Only ONE bootstrap section should be present. If both exist, `initdb` wins and you lose data. Always remove recovery section before pushing to Git.
346+
63347
## Recovery Procedure
64348

65349
### Prerequisites

0 commit comments

Comments
 (0)