Skip to content

Commit 42514b9

Browse files
mitchrossclaude
andcommitted
ci: land kopiur backup-coverage gate (supersedes PR #1490) + script prune
- scripts/validate-kopiur-coverage.py: hard-fails on a backed-up PVC missing its kopiur Restore dataSourceRef (empty-in-DR) or a backed-up ns missing the cluster-kopia repo label; warns on missing mover securityContext / uncovered+ unexempt PVCs / unqualified backup-exempt. Wired into render-and-schema on the rendered stream. Added a YAML value-tag constructor so kube-prometheus-stack's '=' enum CRD doesn't crash the parser (the gap that would have red-failed #1490's first CI run). Verified on full local render: 22/22 policies/restores, 18 backed namespaces, 0 failures. - Retire backup-exempt-contract job + validate-backup-exempt-keys.sh (folded into the coverage check as a warning). - Prune resolved one-off test-temporal-seed-timeout.sh. - validate-cluster-health.sh: swap retired VolSync probes for kopiur snapshot checks. - gitignore __pycache__/. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1 parent d9e34ac commit 42514b9

6 files changed

Lines changed: 184 additions & 138 deletions

File tree

.github/workflows/cluster-ci.yml

Lines changed: 18 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -117,22 +117,24 @@ jobs:
117117
-schema-location 'https://raw.githubusercontent.com/datreeio/CRDs-catalog/main/{{.Group}}/{{.ResourceKind}}_{{.ResourceAPIVersion}}.json' \
118118
/tmp/filtered-manifests.yaml
119119
120-
backup-exempt-contract:
121-
name: Backup-Exempt Annotation Contract
122-
# Catches the class of bug found 2026-05-19 (and again 2026-06-09 in
123-
# monitoring/ Helm values): a PVC labeled backup-exempt:"true" using
124-
# the bare `backup-exempt-reason` key instead of the fully-qualified
125-
# `storage.vanillax.dev/...` key. There is no runtime admission gate
126-
# anymore (pvc-plumber is gone) — the bare key silently fails to record
127-
# the reason, and the kopiur backup-coverage check can't attribute the
128-
# exemption. This CI gate catches it at PR time, not during DR.
129-
runs-on: ubuntu-latest
130-
steps:
131-
- name: Checkout
132-
uses: actions/checkout@v4
133-
134-
- name: Validate backup-exempt annotation keys
135-
run: bash ./scripts/validate-backup-exempt-keys.sh
120+
- name: Validate kopiur backup coverage
121+
# Replaces the retired pvc-plumber validate-restore-contract.sh and the
122+
# backup-exempt-contract job. Runs on the RENDERED stream so Helm-rendered
123+
# PVCs (gitea, tubesync) are covered. Hard-fails when a backed-up PVC is
124+
# missing its dataSourceRef (recreates EMPTY in DR) or a backed-up namespace
125+
# lacks the kopiur.home-operations.com/repo label (repo creds won't fan in);
126+
# warns on missing mover securityContext, uncovered+unexempt PVCs, and
127+
# backup-exempt PVCs missing the qualified reason annotation.
128+
run: |
129+
set -euo pipefail
130+
python3 -c "import yaml" 2>/dev/null || pip3 install --quiet pyyaml
131+
python3 ./scripts/validate-kopiur-coverage.py /tmp/all-manifests.yaml
132+
133+
# backup-exempt-contract job RETIRED 2026-06-27 (pvc-plumber removed). It
134+
# hard-failed on the bare `backup-exempt-reason` key because pvc-plumber denied
135+
# such PVCs on CREATE; with pvc-plumber gone, nothing enforces the key, so the
136+
# check is folded into validate-kopiur-coverage.py as a WARNING (grep-consistency
137+
# only). The coverage check runs as a step in the render-and-schema job above.
136138

137139
otel-collector-validate:
138140
name: OpenTelemetry Collector Config Validation

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -130,3 +130,7 @@ credentials.base64
130130
/scripts/k8s-capacity-estimate-omni-prod-talos-prod-cluster-talos-prod-sa-baseline-20260504T234238Z
131131
/scripts/k8s-talos-gpu-audit-omni-prod-talos-prod-cluster-talos-prod-sa_-20260504T231835Z
132132
scripts/k8s-talos-gpu-audit-omni-prod-talos-prod-cluster-talos-prod-sa_-20260504T231835Z.tar.gz
133+
134+
# Python bytecode
135+
__pycache__/
136+
*.pyc

scripts/test-temporal-seed-timeout.sh

Lines changed: 0 additions & 50 deletions
This file was deleted.

scripts/validate-backup-exempt-keys.sh

Lines changed: 0 additions & 66 deletions
This file was deleted.

scripts/validate-cluster-health.sh

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -143,14 +143,14 @@ run "Longhorn volume summary (state x robustness)" -- bash -c '
143143
| sort | uniq -c | sort -rn || echo "(longhorn CRDs not available)"
144144
'
145145

146-
run "VolSync replication sources" -- bash -c '
147-
kubectl get replicationsource -A 2>/dev/null \
148-
|| echo "(volsync CRDs not available)"
146+
run "kopiur snapshot policies + schedules" -- bash -c '
147+
kubectl get snapshotpolicy,snapshotschedule -A 2>/dev/null \
148+
|| echo "(kopiur CRDs not available)"
149149
'
150150

151-
run "VolSync replication destinations" -- bash -c '
152-
kubectl get replicationdestination -A 2>/dev/null \
153-
|| echo "(volsync CRDs not available)"
151+
run "kopiur snapshots (latest per source)" -- bash -c '
152+
kubectl get snapshot.kopiur.home-operations.com -A 2>/dev/null \
153+
|| echo "(kopiur CRDs not available)"
154154
'
155155

156156
# ─────────────────────────────────────────────
Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
#!/usr/bin/env python3
2+
"""Validate kopiur backup coverage against a RENDERED manifest stream.
3+
4+
Replaces the retired pvc-plumber `validate-restore-contract.sh` and folds in the
5+
now-dead `backup-exempt-contract` job. kopiur has no `/audit` ledger, so this is
6+
the CI guard that catches the silent gaps that ledger used to surface.
7+
8+
Runs on the rendered kustomize stream (so Helm-rendered PVCs — gitea, tubesync —
9+
are covered, which a static grep of *.yaml cannot do).
10+
11+
python3 scripts/validate-kopiur-coverage.py /tmp/all-manifests.yaml
12+
13+
HARD FAILS (exit 1):
14+
[dsr] A backed-up PVC (target of a kopiur SnapshotPolicy) whose
15+
spec.dataSourceRef does NOT point at a kopiur Restore → it recreates
16+
EMPTY in DR. The single most dangerous silent gap.
17+
[nslabel] A namespace containing a SnapshotPolicy that lacks the
18+
`kopiur.home-operations.com/repo: cluster-kopia` label → the
19+
ClusterExternalSecret won't fan the repo creds in; the mover can't auth.
20+
21+
WARNINGS (printed, exit 0):
22+
[mover] A SnapshotPolicy/Restore with no spec.mover security context (neither
23+
securityContext nor inheritSecurityContextFrom) → likely PermissionDenied
24+
(the #1 kopiur gotcha — see docs/domains/storage/kopiur-mover-permissions.md).
25+
[gap] A longhorn PVC that is neither backed up nor backup-exempt → review.
26+
[exempt] A backup-exempt PVC missing the fully-qualified reason annotation
27+
(kept for grep-ability now that pvc-plumber no longer enforces it).
28+
"""
29+
import sys
30+
31+
try:
32+
import yaml
33+
except ImportError:
34+
sys.stderr.write("pyyaml required: pip3 install pyyaml\n")
35+
sys.exit(2)
36+
37+
KOPIUR_GROUP = "kopiur.home-operations.com"
38+
REPO_LABEL = "kopiur.home-operations.com/repo"
39+
REPO_LABEL_VAL = "cluster-kopia"
40+
EXEMPT_LABEL = "backup-exempt"
41+
EXEMPT_REASON = "storage.vanillax.dev/backup-exempt-reason"
42+
SYSTEM_NS = {
43+
"kube-system", "argocd", "longhorn-system", "kopiur-system", "cert-manager",
44+
"external-secrets", "kube-node-lease", "kube-public", "monitoring", "gateway",
45+
"1passwordconnect", "volsync-system",
46+
}
47+
48+
49+
def meta(d, key):
50+
return (d.get("metadata") or {}).get(key)
51+
52+
53+
def labels_of(d):
54+
return (d.get("metadata") or {}).get("labels") or {}
55+
56+
57+
def anns_of(d):
58+
return (d.get("metadata") or {}).get("annotations") or {}
59+
60+
61+
def has_mover_sc(d):
62+
mover = (d.get("spec") or {}).get("mover") or {}
63+
return bool(mover.get("securityContext") or mover.get("inheritSecurityContextFrom"))
64+
65+
66+
def main():
67+
if len(sys.argv) != 2:
68+
sys.stderr.write("usage: validate-kopiur-coverage.py <rendered-manifests.yaml>\n")
69+
return 2
70+
71+
# kube-prometheus-stack CRDs contain a bare `=` enum value (AlertManager
72+
# matchType), which PyYAML maps to the special value-tag and otherwise fails
73+
# to construct. Treat it as a literal scalar so the rendered stream parses.
74+
yaml.SafeLoader.add_constructor(
75+
"tag:yaml.org,2002:value", lambda loader, node: loader.construct_scalar(node)
76+
)
77+
78+
with open(sys.argv[1]) as fh:
79+
docs = [d for d in yaml.safe_load_all(fh) if isinstance(d, dict) and d.get("kind")]
80+
81+
pvcs, namespaces, policies, restores = {}, {}, [], []
82+
for d in docs:
83+
kind = d.get("kind")
84+
group = (d.get("apiVersion") or "").split("/")[0]
85+
if kind == "PersistentVolumeClaim":
86+
pvcs[(meta(d, "namespace"), meta(d, "name"))] = d
87+
elif kind == "Namespace":
88+
namespaces[meta(d, "name")] = d
89+
elif group == KOPIUR_GROUP and kind == "SnapshotPolicy":
90+
policies.append(d)
91+
elif group == KOPIUR_GROUP and kind == "Restore":
92+
restores.append(d)
93+
94+
fails, warns = [], []
95+
backed_pvcs, backed_namespaces = set(), set()
96+
97+
for p in policies:
98+
pns, pname = meta(p, "namespace"), meta(p, "name")
99+
backed_namespaces.add(pns)
100+
if not has_mover_sc(p):
101+
warns.append(f"[mover] SnapshotPolicy {pns}/{pname}: no spec.mover security context (set the data-owner uid:gid)")
102+
for src in ((p.get("spec") or {}).get("sources") or []):
103+
pvcname = (src.get("pvc") or {}).get("name")
104+
if not pvcname:
105+
continue
106+
backed_pvcs.add((pns, pvcname))
107+
pvc = pvcs.get((pns, pvcname))
108+
if pvc is None:
109+
fails.append(f"[dsr] SnapshotPolicy {pns}/{pname} backs up PVC '{pvcname}' but no such PVC was rendered")
110+
continue
111+
dsr = (pvc.get("spec") or {}).get("dataSourceRef") or {}
112+
if dsr.get("apiGroup") != KOPIUR_GROUP or dsr.get("kind") != "Restore":
113+
fails.append(f"[dsr] PVC {pns}/{pvcname} is backed up but dataSourceRef is not a kopiur Restore → recreates EMPTY in DR (got: {dsr or 'none'})")
114+
115+
for r in restores:
116+
if not has_mover_sc(r):
117+
warns.append(f"[mover] Restore {meta(r, 'namespace')}/{meta(r, 'name')}: no spec.mover security context")
118+
119+
for ns in sorted(backed_namespaces):
120+
nd = namespaces.get(ns)
121+
if nd is None:
122+
warns.append(f"[nslabel] namespace '{ns}' has kopiur stubs but no Namespace object rendered (can't verify repo label)")
123+
elif labels_of(nd).get(REPO_LABEL) != REPO_LABEL_VAL:
124+
fails.append(f"[nslabel] namespace '{ns}' is backed up but missing label {REPO_LABEL}={REPO_LABEL_VAL} → repo creds won't fan in")
125+
126+
for (pns, pname), pvc in sorted(pvcs.items(), key=lambda kv: (kv[0][0] or "", kv[0][1] or "")):
127+
if pns in SYSTEM_NS:
128+
continue
129+
if (pvc.get("spec") or {}).get("storageClassName") != "longhorn":
130+
continue
131+
lbls = labels_of(pvc)
132+
if any(k.startswith("cnpg.io/") for k in lbls): # CNPG = Barman, not kopiur
133+
continue
134+
if (pns, pname) in backed_pvcs:
135+
continue
136+
if lbls.get(EXEMPT_LABEL) == "true":
137+
if not anns_of(pvc).get(EXEMPT_REASON):
138+
warns.append(f"[exempt] PVC {pns}/{pname} is backup-exempt but missing {EXEMPT_REASON} annotation")
139+
continue
140+
warns.append(f"[gap] PVC {pns}/{pname} (longhorn) is neither backed up nor backup-exempt → review")
141+
142+
print("== kopiur backup coverage ==")
143+
print(f" policies={len(policies)} restores={len(restores)} pvcs={len(pvcs)} backed-namespaces={len(backed_namespaces)}")
144+
for w in warns:
145+
print(f" WARN {w}")
146+
for f in fails:
147+
print(f" FAIL {f}")
148+
if fails:
149+
print(f"\n{len(fails)} hard failure(s): a backup would silently fail or a PVC would recreate empty in DR.")
150+
return 1
151+
print(f"\nOK — coverage intact ({len(warns)} warning(s), 0 failures).")
152+
return 0
153+
154+
155+
if __name__ == "__main__":
156+
sys.exit(main())

0 commit comments

Comments
 (0)