Skip to content

Commit 4077f18

Browse files
committed
Add CNPG disaster recovery & backup docs
Document CloudNativePG (CNPG) backup and disaster recovery. Adds a new docs/cnpg-disaster-recovery.md with architecture, step-by-step recovery procedure (including SSA/ArgoCD bypass), serverName versioning, troubleshooting, and verification steps. Updates CLAUDE.md to mention CNPG database PVCs and show an example Cluster backup config, and updates docs/backup-restore.md to call out CNPG's separate Barman→S3 backup path and differences (no auto-restore, manual recovery required). This clarifies that CNPG uses Barman to RustFS S3 and requires manual recovery steps (bump serverName, apply via kubectl create) rather than the PVC VolSync/Kopia flow.
1 parent 84485bd commit 4077f18

3 files changed

Lines changed: 326 additions & 1 deletion

File tree

CLAUDE.md

Lines changed: 85 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -320,6 +320,49 @@ spec:
320320
- Data synced from external sources
321321
- System namespaces (auto-excluded anyway)
322322
- PVCs that will be frequently deleted/recreated
323+
- **CNPG database PVCs** — these use Barman to S3, not Kyverno/VolSync (see below)
324+
325+
### Application with Database (CNPG CloudNativePG)
326+
327+
Databases use **CloudNativePG** with Barman backups to RustFS S3 — a separate backup path from the PVC/VolSync system.
328+
329+
```yaml
330+
# infrastructure/database/cloudnative-pg/<app>/cluster.yaml
331+
apiVersion: postgresql.cnpg.io/v1
332+
kind: Cluster
333+
metadata:
334+
name: <app>-database
335+
namespace: cloudnative-pg
336+
spec:
337+
instances: 1
338+
imageName: ghcr.io/cloudnative-pg/postgresql:16.2
339+
bootstrap:
340+
initdb:
341+
database: <app>
342+
owner: <app>
343+
storage:
344+
size: 20Gi
345+
storageClass: longhorn
346+
backup:
347+
barmanObjectStore:
348+
serverName: <app>-database # IMPORTANT: bump on DR recovery (see DR docs)
349+
destinationPath: s3://postgres-backups/cnpg/<app>
350+
endpointURL: http://192.168.10.133:30293
351+
s3Credentials:
352+
accessKeyId:
353+
name: cnpg-s3-credentials
354+
key: AWS_ACCESS_KEY_ID
355+
secretAccessKey:
356+
name: cnpg-s3-credentials
357+
key: AWS_SECRET_ACCESS_KEY
358+
retentionPolicy: "14d"
359+
```
360+
361+
**Key differences from PVC backups**:
362+
- Backups use **Barman** (SQL-aware) to RustFS S3, not Kopia to NFS
363+
- **No automatic restore** — recovery requires manual intervention (see [Database DR docs](docs/cnpg-disaster-recovery.md))
364+
- **Cannot go through ArgoCD** for recovery — CNPG webhook + SSA = `initdb` always wins
365+
- `serverName` must be bumped after each recovery (e.g. `-v2`, `-v3`) to avoid WAL archive conflicts
323366

324367
## Configuration Patterns
325368

@@ -636,6 +679,45 @@ kubectl apply -f pvc.yaml
636679
- Delete ReplicationSource/ReplicationDestination manually (Kyverno will recreate them if label still present)
637680
- Use backup labels on non-Longhorn PVCs (snapshot support required)
638681

682+
### Database Disaster Recovery (CNPG)
683+
684+
CNPG databases use Barman backups to S3 but **do NOT auto-restore**. After a cluster nuke:
685+
686+
**Recovery procedure** (must bypass ArgoCD — SSA + CNPG webhook makes `initdb` always win):
687+
688+
```bash
689+
# 1. Edit cluster.yaml: comment out initdb, uncomment recovery section
690+
# 2. Update externalClusters.serverName to match CURRENT backup.serverName
691+
# 3. Bump backup.serverName to next version (e.g. -v2 → -v3)
692+
# 4. Render and apply directly (bypass ArgoCD):
693+
kubectl kustomize infrastructure/database/cloudnative-pg/immich/ \
694+
| awk '/^apiVersion: postgresql.cnpg.io\/v1/{p=1} p{print} /^---/{if(p) exit}' \
695+
> /tmp/recovery.yaml
696+
697+
# 5. Delete existing empty cluster and immediately create recovery version:
698+
kubectl delete cluster immich-database -n cloudnative-pg --wait=false; \
699+
kubectl create -f /tmp/recovery.yaml
700+
701+
# 6. Wait for recovery:
702+
kubectl get clusters -n cloudnative-pg -w
703+
704+
# 7. Verify data:
705+
kubectl exec -n cloudnative-pg immich-database-1 -- \
706+
psql -U postgres -d immich -c "SELECT count(*) FROM \"user\";"
707+
708+
# 8. Revert cluster.yaml to initdb (keep new serverName in backup section)
709+
# 9. Commit and push — ArgoCD syncs, CNPG ignores bootstrap on existing clusters
710+
```
711+
712+
**Current serverName versions** (track these — must match for recovery):
713+
| Database | Current backup serverName |
714+
|----------|--------------------------|
715+
| immich | `immich-database-v2` |
716+
| khoj | `khoj-database` (original) |
717+
| paperless | `paperless-database` (original) |
718+
719+
See [docs/cnpg-disaster-recovery.md](docs/cnpg-disaster-recovery.md) for full details.
720+
639721
## Debugging & Troubleshooting
640722

641723
### ArgoCD Issues
@@ -813,7 +895,8 @@ kubectl exec -it gpu-pod -n app-name -- nvidia-smi
813895
| **Full backup/restore flow diagram** | `docs/pvc-plumber-full-flow.md` |
814896
| **VolSync configuration** | `infrastructure/storage/volsync/` |
815897
| **Helm + Kustomize** | `infrastructure/controllers/1passwordconnect/` |
816-
| **Database with operator** | `infrastructure/database/crunchy-postgres/immich/` |
898+
| **Database with CNPG** | `infrastructure/database/cloudnative-pg/immich/` |
899+
| **CNPG disaster recovery** | `docs/cnpg-disaster-recovery.md` |
817900
| **Gateway API routing** | `infrastructure/networking/gateway/` |
818901
| **Custom monitoring** | `monitoring/prometheus-stack/custom-alerts.yaml` |
819902
| **Secret management** | Any app with `externalsecret.yaml` |
@@ -828,3 +911,4 @@ kubectl exec -it gpu-pod -n app-name -- nvidia-smi
828911
- **[docs/network-topology.md](docs/network-topology.md)** - Network architecture details
829912
- **[docs/network-policy.md](docs/network-policy.md)** - Cilium network policies
830913
- **[docs/argocd.md](docs/argocd.md)** - ArgoCD-specific documentation
914+
- **[docs/cnpg-disaster-recovery.md](docs/cnpg-disaster-recovery.md)** - CNPG database backup/restore and disaster recovery procedures

docs/backup-restore.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -276,3 +276,28 @@ The following namespaces are excluded from automatic backup:
276276
| `infrastructure/controllers/kyverno/policies/volsync-pvc-backup-restore.yaml` | Kyverno backup/restore policy |
277277
| `infrastructure/controllers/kyverno/policies/volsync-orphan-cleanup.yaml` | Cleanup orphaned backup resources |
278278
| `monitoring/prometheus-stack/volsync-alerts.yaml` | Prometheus alerting rules |
279+
| `infrastructure/database/cloudnative-pg/` | CNPG database clusters (separate backup path) |
280+
281+
## Database Backups (CNPG — Separate System)
282+
283+
The PVC backup system above covers **application data**. Database backups use a **completely separate path**:
284+
285+
| | PVC Backups | Database Backups |
286+
|---|---|---|
287+
| **Tool** | VolSync + Kopia | CNPG + Barman |
288+
| **Destination** | TrueNAS NFS | RustFS S3 (`s3://postgres-backups/cnpg/`) |
289+
| **Trigger** | Kyverno auto-generates on PVC label | CNPG ScheduledBackup resource |
290+
| **Auto-restore** | Yes (PVC Plumber + Kyverno) | **No** — manual recovery required |
291+
| **Schedule** | Hourly or daily (per PVC label) | Hourly + continuous WAL archiving |
292+
293+
### Why databases don't use the PVC backup system
294+
295+
- Filesystem-level backup of a running Postgres database can be inconsistent
296+
- Barman uses `pg_basebackup` + WAL archiving for point-in-time recovery
297+
- CNPG manages its own PVCs (names are auto-generated, can't add Kyverno labels)
298+
299+
### Database disaster recovery
300+
301+
After a cluster nuke, CNPG creates **fresh empty databases** — it does NOT auto-restore from Barman backups. Recovery requires manually bypassing ArgoCD (SSA + CNPG webhook conflict prevents recovery mode through GitOps).
302+
303+
See **[docs/cnpg-disaster-recovery.md](cnpg-disaster-recovery.md)** for full recovery procedures.

docs/cnpg-disaster-recovery.md

Lines changed: 216 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,216 @@
1+
# CNPG Database Disaster Recovery
2+
3+
## Overview
4+
5+
CloudNativePG (CNPG) databases are backed up via **Barman** to RustFS S3 (`s3://postgres-backups/cnpg/`). Unlike PVC backups (which auto-restore via Kyverno + PVC Plumber), database recovery is **manual** and must bypass ArgoCD.
6+
7+
## Why Recovery Can't Go Through ArgoCD
8+
9+
ArgoCD uses **Server-Side Apply (SSA)**. CNPG has a **mutating admission webhook** that adds `initdb` defaults to every Cluster creation. When combined:
10+
11+
1. ArgoCD sends SSA patch with `bootstrap.recovery`
12+
2. CNPG webhook intercepts and adds `bootstrap.initdb` defaults
13+
3. SSA merges both field managers — `initdb` wins
14+
4. Result: fresh empty database, every time
15+
16+
Additionally, ArgoCD ApplicationSets enforce `selfHeal: true`, recreating deleted clusters in sub-second — too fast to manually intervene.
17+
18+
**Solution**: Apply recovery manifests directly with `kubectl create`, bypassing ArgoCD entirely.
19+
20+
## Backup Architecture
21+
22+
```
23+
CNPG Cluster
24+
↓ (continuous WAL archiving + scheduled base backups)
25+
Barman → RustFS S3
26+
s3://postgres-backups/cnpg/<app>/<serverName>/base/ (base backups)
27+
s3://postgres-backups/cnpg/<app>/<serverName>/wals/ (WAL files)
28+
```
29+
30+
### Current Database Inventory
31+
32+
| Database | S3 Path | Current serverName | Schedule |
33+
|----------|---------|-------------------|----------|
34+
| immich | `s3://postgres-backups/cnpg/immich` | `immich-database-v2` | Hourly + WAL |
35+
| khoj | `s3://postgres-backups/cnpg/khoj` | `khoj-database` | Daily 2am + WAL |
36+
| paperless | `s3://postgres-backups/cnpg/paperless` | `paperless-database` | Daily 2am + WAL |
37+
38+
### serverName Versioning
39+
40+
CNPG requires a **clean WAL archive** for new clusters. After recovery, the new cluster can't write WALs to the same path as the old cluster. The `serverName` in `backup.barmanObjectStore` controls the subdirectory:
41+
42+
```
43+
s3://postgres-backups/cnpg/immich/
44+
├── immich-database/ ← original (pre-recovery backups)
45+
│ ├── base/
46+
│ └── wals/
47+
└── immich-database-v2/ ← current (post-recovery backups)
48+
├── base/
49+
└── wals/
50+
```
51+
52+
**Each recovery bumps the version**: `-v2``-v3``-v4`, etc.
53+
54+
## Recovery Procedure
55+
56+
### Prerequisites
57+
58+
- Cluster is running (ArgoCD has bootstrapped)
59+
- CNPG operator is deployed
60+
- `cnpg-s3-credentials` secret exists in `cloudnative-pg` namespace
61+
- Barman backups exist on RustFS S3
62+
63+
### Step-by-Step (example: immich)
64+
65+
**1. Check if backups exist:**
66+
67+
```bash
68+
kubectl run -it --rm barman-check --image=amazon/aws-cli:latest \
69+
--restart=Never --namespace=cloudnative-pg --overrides='{
70+
"spec":{"containers":[{"name":"check","image":"amazon/aws-cli:latest",
71+
"command":["sh","-c","aws --endpoint-url http://192.168.10.133:30293 s3 ls s3://postgres-backups/cnpg/immich/immich-database-v2/base/ 2>&1 | tail -5"],
72+
"env":[
73+
{"name":"AWS_ACCESS_KEY_ID","valueFrom":{"secretKeyRef":{"name":"cnpg-s3-credentials","key":"AWS_ACCESS_KEY_ID"}}},
74+
{"name":"AWS_SECRET_ACCESS_KEY","valueFrom":{"secretKeyRef":{"name":"cnpg-s3-credentials","key":"AWS_SECRET_ACCESS_KEY"}}}
75+
]}]}}'
76+
```
77+
78+
**2. Edit the cluster.yaml:**
79+
80+
In `infrastructure/database/cloudnative-pg/immich/cluster.yaml`:
81+
- Comment out the `initdb` bootstrap section
82+
- Uncomment the `recovery` bootstrap + `externalClusters` section
83+
- Set `externalClusters[].barmanObjectStore.serverName` to the **current** backup serverName (e.g. `immich-database-v2`)
84+
- Bump `backup.barmanObjectStore.serverName` to the **next** version (e.g. `immich-database-v3`)
85+
86+
**3. Extract just the Cluster resource:**
87+
88+
```bash
89+
kubectl kustomize infrastructure/database/cloudnative-pg/immich/ \
90+
| awk '/^apiVersion: postgresql.cnpg.io\/v1/{p=1} p{print} /^---/{if(p) exit}' \
91+
> /tmp/immich-recovery.yaml
92+
93+
# Verify it has recovery, not initdb:
94+
grep -c "recovery" /tmp/immich-recovery.yaml # should be >= 1
95+
grep -c "initdb" /tmp/immich-recovery.yaml # should be 0
96+
```
97+
98+
**4. Delete and immediately recreate (one command — ArgoCD is fast):**
99+
100+
```bash
101+
kubectl delete cluster immich-database -n cloudnative-pg --wait=false; \
102+
sleep 15; \
103+
kubectl create -f /tmp/immich-recovery.yaml
104+
```
105+
106+
The 15-second sleep ensures old PVCs are cleaned up by Longhorn.
107+
108+
**5. Monitor recovery:**
109+
110+
```bash
111+
# Watch cluster status
112+
kubectl get clusters -n cloudnative-pg -w
113+
114+
# Watch recovery pod logs
115+
kubectl logs -n cloudnative-pg -l cnpg.io/cluster=immich-database -f
116+
```
117+
118+
Recovery typically takes 1-5 minutes depending on backup size.
119+
120+
**6. Verify data:**
121+
122+
```bash
123+
kubectl exec -n cloudnative-pg immich-database-1 -- \
124+
psql -U postgres -d immich -c "SELECT email FROM \"user\" LIMIT 5;"
125+
```
126+
127+
**7. Revert to normal operation:**
128+
129+
In `cluster.yaml`:
130+
- Uncomment `initdb` bootstrap
131+
- Comment out `recovery` bootstrap + `externalClusters`
132+
- Keep the new `serverName` in the backup section (e.g. `immich-database-v3`)
133+
- Update the commented recovery `externalClusters.serverName` to match the new backup serverName
134+
135+
```bash
136+
git add infrastructure/database/cloudnative-pg/immich/cluster.yaml
137+
git commit -m "CNPG: revert immich to initdb after successful recovery"
138+
git push
139+
```
140+
141+
ArgoCD syncs. CNPG ignores `initdb` bootstrap on existing clusters — your data is safe.
142+
143+
## Troubleshooting
144+
145+
### "Expected empty archive"
146+
147+
**Cause**: `backup.barmanObjectStore.serverName` matches old backup path (WALs already exist).
148+
149+
**Fix**: Bump `serverName` to next version (e.g. `-v2``-v3`).
150+
151+
### "no target backup found"
152+
153+
**Cause**: `externalClusters[].barmanObjectStore.serverName` is wrong or missing.
154+
155+
**Fix**: Set it to the serverName that the old backups were written under. Check S3:
156+
```bash
157+
aws --endpoint-url http://192.168.10.133:30293 s3 ls s3://postgres-backups/cnpg/immich/
158+
# Lists subdirectories like: immich-database/, immich-database-v2/
159+
```
160+
161+
### ArgoCD recreates cluster before manual apply
162+
163+
**Cause**: `selfHeal: true` in ApplicationSet template.
164+
165+
**Fix**: Use `delete --wait=false; sleep 15; kubectl create` in rapid succession. The sleep gives PVCs time to terminate.
166+
167+
### Recovery pod stuck in Pending
168+
169+
**Cause**: Old PVCs from previous cluster still terminating (Longhorn cleanup).
170+
171+
**Fix**: Wait 15-30 seconds for PVCs to fully delete, then recreate the cluster.
172+
173+
### "Only one bootstrap method can be specified"
174+
175+
**Cause**: Both `initdb` and `recovery` present in manifest (ArgoCD SSA merged them).
176+
177+
**Fix**: Don't use `kubectl apply`. Use `kubectl create` to bypass SSA.
178+
179+
## Verifying Backups Are Running
180+
181+
```bash
182+
# Check scheduled backups
183+
kubectl get scheduledbackup -n cloudnative-pg
184+
185+
# Check latest backup timestamp
186+
kubectl get backup -n cloudnative-pg --sort-by=.metadata.creationTimestamp | tail -5
187+
188+
# Check WAL archiving status
189+
kubectl get cluster -n cloudnative-pg -o jsonpath='{range .items[*]}{.metadata.name}: {.status.firstRecoverabilityPoint}{"\n"}{end}'
190+
191+
# Check S3 for actual backup files
192+
kubectl run -it --rm barman-ls --image=amazon/aws-cli:latest \
193+
--restart=Never --namespace=cloudnative-pg --overrides='{...}'
194+
```
195+
196+
## Two Backup Systems Summary
197+
198+
```
199+
┌──────────────────────────────────┐ ┌──────────────────────────────────┐
200+
│ PVC BACKUPS (App Data) │ │ DATABASE BACKUPS (CNPG) │
201+
│ │ │ │
202+
│ Tool: VolSync + Kopia │ │ Tool: CNPG + Barman │
203+
│ Dest: TrueNAS NFS │ │ Dest: RustFS S3 │
204+
│ Auto-restore: YES │ │ Auto-restore: NO │
205+
│ (PVC Plumber + Kyverno) │ │ (manual kubectl create) │
206+
│ Trigger: PVC label │ │ Trigger: ScheduledBackup CRD │
207+
│ Schedule: hourly/daily │ │ Schedule: hourly + WAL │
208+
│ │ │ │
209+
│ Covers: │ │ Covers: │
210+
│ - App configs │ │ - User accounts │
211+
│ - Thumbnails/previews │ │ - Metadata (albums, tags) │
212+
│ - ML model caches │ │ - Search indexes │
213+
│ - Home automation data │ │ - App state │
214+
│ │ │ │
215+
└──────────────────────────────────┘ └──────────────────────────────────┘
216+
```

0 commit comments

Comments
 (0)