RijksICTGilde
diff --git a/‎docs/openproject-on-zad.md‎
Lines changed: 17 additions & 1 deletion b/‎docs/openproject-on-zad.md‎
Lines changed: 17 additions & 1 deletion
diff --git a/‎features/backup-retention-sweep.md‎
Lines changed: 84 additions & 0 deletions b/‎features/backup-retention-sweep.md‎
Lines changed: 84 additions & 0 deletions
diff --git a/‎features/service-orphan-reconciliation.md‎
Lines changed: 120 additions & 0 deletions b/‎features/service-orphan-reconciliation.md‎
Lines changed: 120 additions & 0 deletions
diff --git a/‎features/sops-skip-unchanged-reencryption.md‎
Lines changed: 91 additions & 0 deletions b/‎features/sops-skip-unchanged-reencryption.md‎
Lines changed: 91 additions & 0 deletions
@@ -189,7 +189,23 @@ components:
 ## Replacing the bypass image with a real Enterprise Token
 
 Once you have a valid OpenProject Enterprise Token, the wrap image is no
-longer needed. Two ways to install the token:
+longer needed.
+
+**Note on requesting the token**: OpenProject EE tokens (v2.0+) are bound
+to a specific hostname. OpenProject's sales/onboarding team will ask for
+the deployment URL — it ends up embedded in the token and validated at
+runtime against `Setting.host_name` (= our `OPENPROJECT_HOST__NAME` env
+var, set to `$PUBLIC_HOSTNAME`). Implication:
+
+- Sandbox needs its own token (e.g. for `productie-openp-7lh.sandbox.rijksapp.dev`)
+- Production needs a separate token for its final hostname
+- Changing the hostname after the fact → request a new token
+- Wildcard / multi-domain tokens exist but must be explicitly requested
+
+Source: `app/models/enterprise_token.rb` — `invalid_domain?` calls
+`token_object.valid_domain?(Setting.host_name)`.
+
+Two ways to install the token:
 
 1. **Via env var** (declarative, survives DB resets):
 
 
@@ -0,0 +1,84 @@
+# Backup Retention Sweep
+
+## What it is
+
+A daily background job in the Operations Manager that deletes **orphaned**
+backup snapshots — snapshots that no backup run will ever clean up again.
+
+Kopia retention is applied by the backup pod at the end of each backup run,
+scoped to that run's source identity. That means retention silently stops for:
+
+- deployments that were deleted (e.g. PR previews),
+- deployments whose `backup.schedule` was removed from the project file,
+- projects with `backup.enabled: false`,
+- legacy snapshots written under a broken Kopia source identity
+  (`<uid>@<pod-name>` instead of the stable `opi-backup@...`), which no
+  `kopia snapshot expire` ever matched.
+
+Without the sweep, those snapshots live forever. With it, a source that stops
+being backed up ages out to zero after a grace period.
+
+## How it works
+
+The sweep runs once per day from the backup scheduler loop, on the first tick
+at or after 06:00 Europe/Amsterdam (after the default 02:00 backups and their
+catch-up window). Per project on this cluster it lists all Kopia snapshots and
+classifies each one (first matching rule wins):
+
+| Rule | Verdict |
+|---|---|
+| `trigger:manual` tag, or source host ends in `-manual` | **Protected** — never touched. Manual backups are only removed explicitly by an operator. |
+| Source identity or timestamp missing/unparseable | **Unclassifiable** — skipped with a warning. The sweep only deletes what it can positively classify. |
+| Identity `opi-backup@...` and the deployment currently has a `backup.schedule` | **Active** — left to the backup pod's per-run retention. |
+| Anything else, newer than the grace period | **Young orphan** — kept for now. |
+| Anything else, older than the grace period | **Orphan** — deleted (or logged in dry-run). |
+
+Note that the classification is identity-aware: a legacy `uid@podname`
+snapshot is treated as an orphan even when its deployment tag points at an
+actively scheduled deployment, because per-run retention can never match it.
+
+Whole-project deletion is out of scope: deleting a project already marks its
+entire backup prefix for deferred deletion.
+
+> [!WARNING]
+> Setting `backup.enabled: false` on a project that still exists makes the
+> sweep treat **all** of that project's scheduled snapshots as orphans. They
+> are protected only by the grace period: every snapshot older than
+> `BACKUP_ORPHAN_RETENTION_DAYS` (default 30) is deleted on the first sweep
+> after the flag is flipped, and the rest age out as they cross the boundary.
+> If you intend to keep historical backups while pausing new ones, do not rely
+> on this flag — the snapshots are not retained indefinitely. Keep the first
+> production sweep in dry-run and review the manifest before arming deletion.
+
+## Configuration
+
+| Setting | Default | Meaning |
+|---|---|---|
+| `BACKUP_SWEEP_ENABLED` | `true` | Master switch for the daily sweep. |
+| `BACKUP_SWEEP_DRY_RUN` | `true` | Log a manifest of what would be deleted, delete nothing. |
+| `BACKUP_ORPHAN_RETENTION_DAYS` | `30` | Grace period: orphans younger than this are kept. Mirrors `BACKUP_RETENTION_KEEP_DAILY`, so an orphaned source's backups survive as long as an active one's would. |
+
+## Rollout
+
+The sweep ships with `BACKUP_SWEEP_DRY_RUN=true`. Review at least one sweep
+manifest in the OPI logs (`grep "Sweep "` / `grep "would delete"`), confirm
+the candidates are right, then set `BACKUP_SWEEP_DRY_RUN=false`.
+
+Example log output:
+
+```
+Backup retention sweep starting (cluster=odcn-production, grace=30d, dry_run=True)
+Sweep wies/rig-prd-wies: 226 snapshots — {'active': 55, 'orphan-expired': 54, 'orphan-young': 117}
+Sweep wies/rig-prd-wies: would delete orphan snapshot 3f1f... (1001730000@db-backup-production-postgresq-20260422-012810, deployment=production, ts=2026-04-22T01:28:10Z)
+...
+Backup retention sweep finished (would delete 54 snapshots)
+```
+
+## Dependencies
+
+- Backup scheduler (`opi/core/backup_scheduler.py`) — hosts the daily trigger.
+- Kopia CLI in the OPI image — the sweep queries and deletes directly, no
+  pods are spawned.
+- Namespace SOPS key — repository passwords are derived per namespace, so the
+  sweep covers any namespace that still exists. Snapshots of fully deleted
+  projects are handled by the project-deletion flow instead.
@@ -0,0 +1,120 @@
+# Service-orphan reconciliation (rapport-eerst opruiming van DB's, Keycloak-clients, MinIO-buckets)
+
+## Probleem
+
+Deletes uit het pre-#123 tijdperk rapporteerden succes terwijl service-resources
+bleven staan. De idempotente delete-API ("already absent" o.b.v. het projectbestand)
+ruimt die historische wezen nooit op. Geïnventariseerd op productie 2026-06-11:
+
+- **Keycloak realm `regel-k4c-odcn-production`: 74 PR-genummerde clients vs 11
+  actieve previews → ~63 wezen, allemaal `public` clients** (PR's uit de 200-300
+  reeks, maanden dood). Mild security-relevant: live OIDC-entrypoints met
+  redirect-URI's naar dode preview-hosts.
+- Realm `wies-odcn-production`: 50 clients — zelfde patroon, nog niet geauditeerd.
+- rig-db: 7 wees-databases voor regel-k4c (`regel_k4c_pr104` t/m `pr128`) +
+  `regel_k4c_pr748_v1` (clone-restant); cluster-totaal 55 databases incl. de
+  bekende `marked_for_deletion`-backlog die geen purge-scheduler heeft.
+- MinIO-buckets: nog niet geïnventariseerd, zelfde verdenking.
+
+De huidige delete-flow is sinds #123 WEL schoon (bewijs: pr777 op 2026-06-11 —
+ArgoCD-app, pods, database én Keycloak-clients allemaal verwijderd en geverifieerd).
+Dit gaat puur om historisch afval.
+
+## Ontwerp-eisen
+
+1. **Rapport-eerst, nooit direct purgen.** `waggl_9et_productie` (live prod-DB!)
+   stond ooit onterecht als marked_for_deletion — een blinde purge had hem
+   gedropt. Sweep produceert een rapport; verwijdering alleen vanaf een
+   bevestigde lijst.
+2. Inventariseer per service: pg_database (rig-db), Keycloak clients per realm
+   (admin-API of read-only SQL op de keycloak-DB), MinIO buckets (mc ls).
+3. Match tegen de waarheid: live deployments uit alle projectbestanden
+   (zad-projects repo) — zelfde bron als de delete-API gebruikt.
+4. Let op naamgevingsvarianten: `_v1`-suffixen (clone-generaties),
+   `-public`/`-private` clientparen, deployment- vs componentnamen.
+5. Het bestaande orphan-detect is een stub (zie geheugen rig-db reconciliation)
+   — dit vervangt/implementeert die.
+6. Hergebruik de verificatie-aanpak van #123: k8s/service-API is ground truth,
+   niet de eigen administratie.
+
+## Inventarisatie-queries (de specificatie)
+
+```bash
+# Databases
+kubectl -n rig-prd-operations exec rig-db-1 -- psql -U postgres -tAc \
+  "SELECT datname FROM pg_database WHERE datistemplate=false ORDER BY datname"
+
+# Keycloak clients per realm (read-only)
+kubectl -n rig-prd-operations exec rig-db-1 -- psql -U postgres -d keycloak -tAc \
+  "SELECT r.name, c.client_id, c.public_client FROM client c
+   JOIN realm r ON c.realm_id=r.id ORDER BY r.name, c.client_id"
+```
+
+## Relatie met bestaand werk
+
+- #123 (honest delete) — de flow vooruit is schoon; dit ruimt het verleden op.
+- Geheugen: `project_rigdb_restarts_reconciliation` (purge-gap, waggl-waarschuwing),
+  `project_incident_20260610_netpol` (context van deze week).
+- Nachtelijke cleaner bestaat al voor deployments; dit is de service-laag eronder.
+
+## Implementatie (2026-06-11)
+
+### Sweep (rapport, nul mutaties)
+
+```bash
+curl -X GET "https://<opi>/api/v2/admin/orphans/report" -H "X-API-Key: $ADMIN_API_KEY"
+```
+
+Inventariseert databases (rig-db), Keycloak realms/clients en MinIO-buckets en
+classificeert tegen de live projectbestanden:
+
+| Classificatie | Betekenis | Verwijderbaar |
+|---|---|---|
+| `expected` | hoort bij een live deployment | nee |
+| `system` | platform-infrastructuur (system-DB's, Keycloak built-ins, backup-buckets) | nee |
+| `orphan_candidate` | projectnaamgeving, geen live deployment | **alleen via confirm** |
+| `in_use_anomaly` | lijkt wees maar heeft actieve connecties | nee — onderzoeken |
+| `unknown` | matcht geen naamgevingsconventie | nee |
+
+Het rapport bevat ook `stale_marks`: marks waarvan de resource al weg is of
+juist weer in de verwachte set zit.
+
+### Bevestigen (start grace-periode)
+
+```bash
+curl -X POST "https://<opi>/api/v2/admin/orphans/confirm" \
+  -H "X-API-Key: $ADMIN_API_KEY" -H "Content-Type: application/json" \
+  -d '{"items": [{"type": "postgresql_database", "name": "regel_k4c_pr104"},
+                 {"type": "keycloak_client", "name": "regel-k4c-pr250-public", "realm": "regel-k4c-odcn-production"}]}'
+```
+
+De sweep wordt server-side opnieuw uitgevoerd; alleen items die op dat moment
+`orphan_candidate` zijn worden geaccepteerd. Daarna geldt de normale
+grace-periode (`DELETION_GRACE_PERIOD_DAYS`, 7 dagen) en verwijdert
+`POST /api/v2/admin/reconciliation/trigger?dry_run=false` ze definitief.
+
+### Veiligheidslagen (lessen van waggl-9et)
+
+1. `_build_expected_resources` leest nu schema v1 én v2/v2.2 (catalog-
+   componenten + legacy deployment-level blokken + clone-generaties).
+2. Purge hercontroleert de verwachte set op het moment van verwijderen:
+   een mark waarvan de resource weer in de YAML staat wordt ge-unmarkt.
+3. Een database met actieve connecties wordt NOOIT gedropt door de purge
+   (geweigerd + gerapporteerd); alleen de expliciete projectverwijdering
+   mag connecties termineren.
+4. `cleanup/trigger` (project-scoped) heeft dezelfde bescherming als de
+   volledige reconcile.
+
+### Bekende beperking (geparkeerd)
+
+Resources van **volledig verwijderde projecten** (geen projectbestand meer)
+classificeren als `unknown` en zijn dus niet via confirm op te ruimen — er is
+geen actuele waarheid die de naam claimt. Concreet op productie:
+`bouwm_75c_main`, `bouwm_75c_production`, `bouwm_nr1_main` (+ bijbehorende
+Keycloak-realms) van de verwijderde projecten bouwm-75c/bouwm-nr1
+(pre-#123 deletes; verwijdering aantoonbaar in zad-projects git-history:
+`git log --diff-filter=D --name-only -- projects/`).
+
+Geparkeerd idee: die git-history als extra waarheidsbron voeden aan de sweep,
+zodat resources van aantoonbaar-verwijderde projecten `orphan_candidate`
+worden met de delete-commit als bewijs. Tot die tijd: handmatige opruiming.
@@ -0,0 +1,91 @@
+# SOPS Skip-If-Unchanged Re-encryption
+
+## What it is
+
+An optimization in the Operations Manager's SOPS encryption step that **keeps
+the existing ciphertext when a secret's plaintext has not changed**, instead of
+re-encrypting it on every deployment.
+
+SOPS encryption is non-deterministic: every run generates fresh nonces, a new
+MAC, and a new `lastmodified` timestamp. So re-encrypting an unchanged secret
+produces a byte-different `*.sops.yaml` file every time. Before this change,
+every deployment rewrote every generated secret, which churned the GitOps repos
+(`zad-deployments`, ArgoCD repository secrets) with meaningless diffs and grew
+the commit history with noise on each push.
+
+With this change, a secret whose decrypted content is identical to what's
+already committed is left untouched — no re-encryption, no git diff.
+
+## How it works
+
+`encrypt_to_sops_files()` in `opi/utils/sops.py` now takes an optional
+`private_key`. When the key is provided, for each `*.to-sops.yaml` it:
+
+1. Locates the existing `*.sops.yaml` output (if any).
+2. Decrypts it with the private key.
+3. Parses **both** the existing decrypted content and the freshly generated
+   plaintext as YAML and compares the parsed documents.
+4. If they are equal, it keeps the existing ciphertext verbatim, deletes only
+   the plaintext `*.to-sops.yaml` source (so nothing leaks), and skips
+   encryption. Otherwise it re-encrypts as before.
+
+The comparison is on **parsed YAML**, not raw bytes, so key-order or formatting
+differences in the generated plaintext never count as a change.
+
+It **fails safe — re-encrypting — on any doubt**:
+
+- no existing `*.sops.yaml` (first deployment),
+- a decrypt failure (e.g. after AGE key rotation, or a wrong key),
+- unparseable YAML on either side,
+- no `private_key` passed at all (the original always-re-encrypt behaviour).
+
+So a missing or mismatched key can never cause a stale secret to be kept — at
+worst it re-encrypts unnecessarily, exactly as before.
+
+## Configuration
+
+There is nothing to configure. The behaviour is enabled automatically wherever a
+private key is available:
+
+- **Project secrets** (deployment secrets, helm values, helmfile values,
+  infrastructure secrets) use the project's own AGE key, resolved via
+  `ProjectManager._sops_private_key_for()`, which returns `None` for legacy
+  projects without an `age-private-key` (those fall back to always
+  re-encrypting).
+- **ArgoCD repository secrets** use the cluster's `SOPS_AGE_PRIVATE_KEY` from
+  settings.
+
+`encrypt_to_sops_files_or_fail()` (the fail-closed wrapper that all managers
+use) threads the `private_key` straight through.
+
+## Examples
+
+```python
+# Skip-if-unchanged enabled (project key resolved per project):
+encrypt_to_sops_files_or_fail(
+    target_path,
+    public_key,
+    f"secrets voor deployment '{deployment_name}' (namespace '{prefixed_namespace}')",
+    private_key=await self._sops_private_key_for(project_data),
+)
+
+# Always re-encrypt (no key) — original behaviour:
+encrypt_to_sops_files(directory, public_key)
+```
+
+## Dependencies
+
+- `sops` and `age` binaries (already required by the Operations Manager image).
+- `get_decoded_project_private_key()` in `opi/utils/age.py` for the project key.
+- `settings.SOPS_AGE_PRIVATE_KEY` for ArgoCD repository secrets.
+
+## Tests
+
+- `tests/test_sops_skip_unchanged.py` — real round-trip tests using the actual
+  `sops`/`age` binaries with a throwaway keypair: byte-identical ciphertext on
+  unchanged input, re-encryption on change, semantics-vs-formatting, no-key and
+  wrong-key fall back to re-encrypt, and first-time encryption. Skipped when the
+  binaries are absent.
+- `tests/test_sops_fail_abort.py` — includes a guard test asserting the managers
+  never call the bare `encrypt_to_sops_files` (the fail-closed wrapper must
+  always be used).