Skip to content

Add nice URL feature with dot-separated pattern#3

Merged
uittenbroekrobbert merged 19 commits into
mainfrom
feature/nice-urls
Jan 28, 2026
Merged

Add nice URL feature with dot-separated pattern#3
uittenbroekrobbert merged 19 commits into
mainfrom
feature/nice-urls

Conversation

@anneschuth

@anneschuth anneschuth commented Jan 23, 2026

Copy link
Copy Markdown
Member

Nice-URLs Feature

Deze PR voegt "nice-url" modus toe waarmee projecten eigen subdomeinen kunnen gebruiken in plaats van de standaard cluster URLs.

URL Structuur

Nice-URL modus:

  • Component URLs: https://{component}.{subdomain}.{base_domain}
  • Root URL (optioneel): https://{subdomain}.{base_domain} → wijst naar root component

Voorbeeld: Project "mijn-app" met subdomain "mijn-app" op "rijks.app":

  • https://frontend.mijn-app.rijks.app
  • https://api.mijn-app.rijks.app
  • https://mijn-app.rijks.app (root, wijst naar frontend)

Nieuwe Functionaliteit

Subdomain Registry

  • Globale uniciteit: Subdomeinen zijn uniek per base domain (bijv. myapp.rijks.app)
  • Database-backed: PostgreSQL tabel subdomain_registry voor persistentie
  • Reserved subdomains: 100+ gereserveerde namen (www, api, admin, mail, etc.)
  • Atomic operations: TOCTOU-bescherming met INSERT...ON CONFLICT DO NOTHING

Domain Mode Selection

Vier URL modes beschikbaar per deployment:

  1. component-specific: {component}-{deployment}-{project}.{cluster-domain}
  2. deployment-name: {component}.{deployment}.{cluster-domain}
  3. custom: {component}.{custom-subdomain}.{cluster-domain}
  4. nice-url: {component}.{subdomain}.{base-domain} (nieuw)

Edit Domain Settings

  • Modal voor wijzigen domain settings van bestaande deployments
  • Real-time subdomain availability check
  • URL preview tijdens configuratie
  • Automatische Let's Encrypt bij nice-url mode

Security

Rate Limiting

  • Multi-factor identificatie: IP + browser fingerprint + session ID
  • Per-IP secondary limit: Voorkomt bypass door fingerprint rotatie
  • New client rate limiting: Detecteert identifier flooding attacks
  • Gradual purge: Verwijdert alleen stale entries (>60s), niet willekeurige helft

Information Disclosure Prevention

  • Generieke foutmeldingen: "Niet beschikbaar" voor zowel reserved als taken subdomains
  • Geen project info lekken: Error messages vermelden geen project namen

Atomic Subdomain Changes

  • Database transacties: Subdomain wijzigingen in één transactie
  • Rollback bij failure: Als nieuwe subdomain registratie faalt, blijft oude behouden
  • Audit logging: Alle operaties naar opi.audit.subdomain

Input Validation

  • DNS-compliant: Subdomain validatie volgens RFC 1123
  • Log injection preventie: sanitize_for_log() voor alle user input
  • CSRF protection: Token + Origin/Referer validatie

Internet.nl Compliance (Security Headers)

  • HSTS: max-age=31536000; includeSubDomains; preload
  • X-Content-Type-Options: nosniff
  • X-Frame-Options: DENY
  • Referrer-Policy: strict-origin-when-cross-origin
  • Permissions-Policy: geolocation=(), microphone=(), camera=()

🌐 Internet.nl Compliance

✅ Geïmplementeerd in Ingress Template

Header Waarde internet.nl Vereiste
HSTS max-age=31536000; includeSubDomains; preload ✅ Verplicht (min 1 jaar)
X-Content-Type-Options nosniff ✅ Aanbevolen
X-Frame-Options DENY ✅ Optioneel
Referrer-Policy strict-origin-when-cross-origin ✅ Aanbevolen
Permissions-Policy geolocation=(), microphone=(), camera=() ✅ Aanbevolen

📋 Handmatige Configuratie Vereist

1. DNSSEC (TransIP)

DNSSEC is automatisch ingeschakeld voor domeinen met TransIP nameservers.

2. CAA Records (TransIP Control Panel)

@    CAA    0    issue       "letsencrypt.org"
@    CAA    0    issuewild   "letsencrypt.org"
@    CAA    0    iodef       "mailto:security@rijksoverheid.nl"

3. IPv6 Support

Voor 100% score moet cluster IPv6 ondersteunen.

4. security.txt

Applicaties moeten /.well-known/security.txt serveren.

5. Content-Security-Policy

Applicaties moeten zelf CSP header configureren (applicatie-specifiek).


DNS Configuratie (Infrastructuur)

Geen wildcards nodig. External-dns maakt automatisch individuele A-records aan per Ingress hostname.

Vereisten

  1. Domeinen geregistreerd bij TransIP met nameservers naar TransIP
  2. DNSSEC ingeschakeld (automatisch bij TransIP nameservers)
  3. CAA records geconfigureerd (zie boven)
  4. API toegang ingeschakeld via TransIP Control Panel → Account → API
  5. External-dns geïnstalleerd (zie configuratie hieronder)

External-DNS Deployment

apiVersion: v1
kind: Secret
metadata:
  name: transip-credentials
  namespace: external-dns
type: Opaque
stringData:
  transip-api-key: |
    -----BEGIN RSA PRIVATE KEY-----
    ... (private key content) ...
    -----END RSA PRIVATE KEY-----
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: external-dns
  namespace: external-dns
spec:
  strategy:
    type: Recreate
  replicas: 1
  selector:
    matchLabels:
      app: external-dns
  template:
    metadata:
      labels:
        app: external-dns
    spec:
      serviceAccountName: external-dns
      containers:
      - name: external-dns
        image: registry.k8s.io/external-dns/external-dns:v0.14.0
        args:
        - --source=ingress
        - --provider=transip
        - --transip-account=TRANSIP_ACCOUNT_NAME
        - --transip-keyfile=/transip/transip-api-key
        - --domain-filter=rijks.app
        - --domain-filter=rijksapps.nl
        - --policy=upsert-only
        - --txt-owner-id=rig-cluster
        - --interval=1m
        volumeMounts:
        - name: transip-api-key
          mountPath: /transip
          readOnly: true
      volumes:
      - name: transip-api-key
        secret:
          secretName: transip-credentials
          items:
          - key: transip-api-key
            path: transip-api-key

RBAC Configuratie

apiVersion: v1
kind: ServiceAccount
metadata:
  name: external-dns
  namespace: external-dns
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: external-dns
rules:
- apiGroups: [""]
  resources: ["services", "endpoints", "pods"]
  verbs: ["get", "watch", "list"]
- apiGroups: ["extensions", "networking.k8s.io"]
  resources: ["ingresses"]
  verbs: ["get", "watch", "list"]
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: external-dns-viewer
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: external-dns
subjects:
- kind: ServiceAccount
  name: external-dns
  namespace: external-dns

⚠️ Belangrijke Waarschuwingen

Startup volgorde: External-dns moet starten NADAT Ingresses bestaan. Zie issue #1238.

Let's Encrypt rate limits: 50 certificates per registered domain per week.


Test Instructies

# Unit tests
pytest tests/test_rate_limiter.py tests/test_subdomain_connector.py tests/test_nice_url_naming.py -v

# Volledige test suite
pytest tests/

Verificatie Commands

# Security headers
curl -I https://frontend.myproject.rijks.app 2>/dev/null | grep -iE "(strict-transport|x-content-type|x-frame|referrer-policy)"

# DNS records
dig frontend.myproject.rijks.app +short
dig CAA rijks.app +short
dig +dnssec rijks.app

# internet.nl test
open "https://internet.nl/site/frontend.myproject.rijks.app/"

Referenties

@anneschuth anneschuth marked this pull request as draft January 23, 2026 10:33
@anneschuth anneschuth marked this pull request as ready for review January 23, 2026 13:01
anneschuth added a commit that referenced this pull request Jan 23, 2026
Security fixes:
- Fix #1: Race condition in register_or_update_for_deployment - now uses
  atomic database transaction to prevent subdomain loss on failed updates
- Fix #2: Information disclosure - reserved subdomains now return generic
  "not available" message to prevent enumeration attacks
- Fix #3: API rate limiter - added per-IP secondary limit that cannot be
  bypassed by rotating browser fingerprints
- Fix #4: Rate limiter emergency purge - now uses gradual purge that only
  removes stale entries (>60s old) instead of arbitrary half, preventing
  attackers from evicting legitimate users

Additional improvements:
- Added new client rate limiting to detect identifier flooding attacks
- Updated frontend error code translations for consistent generic messages
- Updated tests to reflect new generic error messages
anneschuth and others added 19 commits January 28, 2026 12:50
Add new opt-in domain mode "nice-url" that generates hostnames using a
dot-separated pattern (component.deployment.base_domain) instead of the
default dash-separated pattern. This provides cleaner, more readable URLs
like frontend.prod.rijks.app.

Features:
- New domain mode "nice-url" alongside existing modes
- Optional include-project-name flag for URLs like frontend.prod.myapp.rijks.app
- Support for rijks.app and rijksapps.nl domains in production
- Support for kind and local domains in local development
- Full UI support in self-service portal
- Backward compatible - all existing domain modes continue to work

Files changed:
- naming.py: Add generate_nice_url_hostname() and update ingress functions
- cluster_config.py: Add nice_url config with supported domains per cluster
- project_manager.py: Pass domain-mode to ingress generation
- keycloak_manager.py: Pass domain-mode to hostname generation
- router.py (api): Add include_project_name to SelfServiceProjectRequest
- self-service-portal.html.j2: Add nice-url option with UI controls
- project_utils.py: Add domain-mode to YAML generation
- router.py (web): Extract include-project-name from form data
- test_nice_url_naming.py: Add comprehensive tests
- Add SubdomainConnector with PostgreSQL-backed registry
- Change nice URL pattern from component.deployment.base_domain
  to component.subdomain.base_domain with globally unique subdomains
- Add root URL support (subdomain.base_domain) for component with root: true
- Add API endpoints for subdomain availability checking
- Update self-service portal with subdomain input and availability check
- Remove include_project_name field in favor of explicit subdomain
- P0: Register subdomain in project creation flow (create_application_manifests)
- P0: Clean up subdomain registrations on project deletion (delete_project)
- P0: Create root ingress for component marked with root: true
- P1: Add @validate_api_token to /api/subdomains list endpoint
- P2: Add subdomain format validation (DNS-safe chars, 1-63 chars)
- P3: Add reserved subdomain blocklist (www, api, admin, etc.)
- Update frontend to display validation errors in real-time
- Add 11 new tests for validation (total: 52 tests pass)
Functional fixes:
- P0: Fail deployment if subdomain registration fails (don't silently continue)
- P0: Add subdomain cleanup to deployment deletion (not just project deletion)
- P0: Validate only ONE component has root: true (prevent ingress conflicts)

Security fixes:
- P1: Add IP-based rate limiting to /api/subdomains/check (30 req/min)
- P2: Remove registered_to from response (prevent project name disclosure)

UX fixes:
- P2: Translate all validation errors to Dutch for consistency
- Update frontend to not show "in gebruik door" (no longer exposed)

New features:
- Add delete_by_deployment() method to SubdomainConnector
- Add IPRateLimiter class for HTTP endpoint rate limiting
- Validation messages support nl/en language parameter

Tests: 52 tests pass
P1 - Race condition: Use INSERT ... ON CONFLICT DO NOTHING for atomic
registration, handling race between availability check and register

P1 - Subdomain update handling: Add register_or_update_for_deployment
method that handles subdomain changes by deleting old registration
before creating new one

P2 - Base domain validation: Add validate_base_domain function that
checks against cluster-configured supported domains

P2 - Root component validation: Ensure root component has publish-on-web
enabled (required for root ingress creation)

P3 - Remove registered_to from API: Remove from SubdomainCheckResponse
schema entirely since it was always null for privacy

P3 - Form submit validation: Re-validate subdomain availability on
form submit to handle stale availability checks

Added tests for:
- validate_base_domain function
- get_supported_base_domains function
- BaseDomainValidationError exception
- register_or_update_for_deployment method
- Race condition handling in register method
P0 - CRITICAL: Add base-domain selector for nice-url mode
- Added base-domain dropdown in self-service portal
- Options filtered by cluster (local: kind/local, prod: rijks.app/rijksapps.nl)
- Base-domain is now required for nice-url mode
- Updated availability check to require base-domain selection
- Updated form submit validation to require base-domain
- Added cluster change listener to update base-domain options

P2 - Rate limiter memory leak fix
- Added periodic cleanup call in is_allowed() when >1000 IPs tracked
- Calls cleanup_old_entries() to remove stale entries

P3 - Code cleanup
- Removed unused datetime import from subdomain.py
- Made all error messages consistently Dutch:
  - "is already registered" -> "is al geregistreerd"
  - Updated test assertion accordingly
- Replace hardcoded cluster options in template with server-provided data
- Replace hardcoded clusterBaseDomains JS object with server config
- Remove non-existent odcn-staging cluster from UI
- Add helper functions in router_self_service.py to generate options
- Cluster options and base domains now come from single source: CLUSTER_CONFIG
Feature implementation:
- Add GET/POST endpoints for deployment domain settings
- Add modal UI for editing domain mode, subdomain, base domain
- Add root component selection for nice-url mode
- Real-time subdomain availability checking

Security hardening:
- CSRF protection via Origin/Referer validation (localhost only in DEBUG)
- Path traversal protection for project filenames
- XSS protection via DOM-safe message rendering
- Empty host header validation
- Sanitized commit messages

Robustness improvements:
- Transaction-like rollback for multi-step operations
- Proper rollback order (git first, then subdomain)
- Subdomain connector reuse to avoid multiple connections
- Complete rollback coverage including deleted subdomains

UX improvements:
- Modal with proper ARIA attributes (role, aria-modal, aria-labelledby)
- Escape key closes modal
- Backdrop click protection during submission
- Error messages placed before form actions
- Root component cleared when switching away from nice-url mode
When domain-mode is 'nice-url' with a base-domain, automatically set
issuer: letsencrypt to enable HTTPS by default. This applies to:

- Self-service form (web router)
- Edit domain settings (web router)
- API self-service endpoint

The auto-issuer logic:
- Only triggers when issuer is not explicitly set
- Preserves explicit issuer values (e.g., letsencrypt-staging)
- Removes issuer when switching away from nice-url mode

Added 8 tests for auto-issuer behavior.
P1 Critical fixes:
- Add subdomain rollback on project creation failure
- Validate base domain against cluster before Let's Encrypt auto-enable
- Remove project name leak from subdomain registration errors

P2 Significant fixes:
- Add pagination support to subdomain list endpoint (limit/offset)
- Fix X-Forwarded-For handling in rate limiter for proxy environments
- Fix potential memory leak in rate limiter with bounded tracking
- Correct misleading URL examples in nice-url help text

Security improvements:
- Rate limiter now properly extracts client IP behind reverse proxies
- Emergency purge mechanism prevents memory exhaustion under attack
- Error messages no longer reveal which project owns a subdomain
Security fixes:
- Rate limiter: Add multi-factor client identification (IP + browser
  fingerprint + session) to prevent X-Forwarded-For spoofing bypass
- Subdomain check API: Add base_domain validation to prevent probing
  of arbitrary/unsupported domains
- CSRF: Harden DEBUG localhost bypass to only work when request host
  is actually localhost, preventing accidental production weakening
- Reserved subdomains: Add ~100 security-critical names including
  RFC 2142 mailbox names (postmaster, abuse, security), auto-discovery
  endpoints (autoconfig, autodiscover, wpad), and payment-related names

Documentation:
- Add comprehensive comments explaining TOCTOU protection strategy
  (check_availability for UX + INSERT ON CONFLICT for atomicity)

Tests:
- Add 7 new tests for client identification and fingerprinting
Security:
- Add SSO authentication to subdomain availability check endpoint
- Move endpoint from /api/subdomains/check to /subdomains/check (web router)
- Prevents unauthenticated enumeration of registered subdomains

Functional fixes:
- Update Keycloak redirect URIs when domain settings change (F1)
- Validate all publish-on-web components have ports in nice-url mode (F2)
- Move validation before subdomain registration to prevent orphaned entries

UX improvements:
- Replace "Mooi URL" with "Eigen subdomein" (proper Dutch)
- Add example URLs to each domain mode option
- Show typing indicator during subdomain check debounce
- Return error codes from backend for robust translation
- Change root component selection from checkbox to radio button
- Add rate limiting to SSO-protected /subdomains/check web endpoint
  to prevent enumeration abuse by authenticated users
- Add subdomain format validation for "custom" domain mode to ensure
  DNS-compatible names and block reserved subdomains
- Add confirmation dialog when switching away from nice-url mode to
  warn users about permanent subdomain registration deletion
- Add tests for web subdomain rate limiter
Security fixes:
- Fix #1: Race condition in register_or_update_for_deployment - now uses
  atomic database transaction to prevent subdomain loss on failed updates
- Fix #2: Information disclosure - reserved subdomains now return generic
  "not available" message to prevent enumeration attacks
- Fix #3: API rate limiter - added per-IP secondary limit that cannot be
  bypassed by rotating browser fingerprints
- Fix #4: Rate limiter emergency purge - now uses gradual purge that only
  removes stale entries (>60s old) instead of arbitrary half, preventing
  attackers from evicting legitimate users

Additional improvements:
- Added new client rate limiting to detect identifier flooding attacks
- Updated frontend error code translations for consistent generic messages
- Updated tests to reflect new generic error messages
Security fix: Changed "door een ander project" (by another project) error
messages to generic "is niet beschikbaar" (is not available) to prevent
information disclosure. Attackers cannot determine whether a subdomain
is taken by another project or simply reserved.
Security fix: Updated all user-facing error messages to use generic
"Subdomein niet beschikbaar" instead of revealing whether a subdomain
is taken by another project or reserved. This prevents enumeration
attacks in both self-service portal and project details templates.
Security headers added for internet.nl compliance:
- HSTS: max-age=31536000; includeSubDomains; preload (both HAProxy and NGINX)
- X-Content-Type-Options: nosniff
- X-Frame-Options: DENY
- Referrer-Policy: strict-origin-when-cross-origin
- Permissions-Policy: geolocation=(), microphone=(), camera=()

These headers ensure websites hosted on the platform achieve high scores
on internet.nl tests for modern internet standards compliance.

References:
- https://internet.nl/faqs/https/
- https://internet.nl/faqs/appsecpriv/
@uittenbroekrobbert uittenbroekrobbert merged commit a5c1d67 into main Jan 28, 2026
3 of 4 checks passed
uittenbroekrobbert added a commit that referenced this pull request May 5, 2026
The single-deployment endpoint stays strict (503 on any fetch failure
— there's one resource, partial truth misleads). But the list endpoint
gets lenient: when one deployment's fetch raises, that deployment is
returned with status=null, status_reason=Unavailable, and the others
come back normally. Whole-backend-down still 503s.

This keeps a CLI's `deployment list` working through partial outages —
without that, one broken deployment in a project would 503 the whole
list and make it impossible to render the others.

The status_reason field also disambiguates the existing "status: null"
overload: it now means either Pending (cluster doesn't have an
Application for this deployment yet) or Unavailable (fetch raised in
lenient list mode). status_reason is null when status is set.

- Add status_reason: StatusReason | None on DeploymentDetail
- Split _fetch_one_deployment_status into the strict per-deployment
  helper (used by single endpoint) and a lenient batch wrapper (used
  by list endpoint)
- _fetch_one_deployment_status returns (status, reason) so callers
  can attribute null status to the right cause
- Add test_partial_failure_marks_one_unavailable_returns_others
- Update test_app_not_yet_known to assert status_reason=Pending
- Update feature doc to describe the two-tier failure model
uittenbroekrobbert added a commit that referenced this pull request May 6, 2026
…#57)

* Complete issue #51 — deployment read endpoints (status, errors, logs)

Initial implementation in 430eeec shipped components/images/URLs but
omitted the status fields the issue asked for and any debug data for
stuck deployments.

This commit adds:

- ArgoCD-sourced reconciliation status on the response: sync_status,
  health_status, revision (full SHA), last_synced_at. Returned as a
  nested DeploymentStatus sub-object so source-of-truth is encoded
  in the schema. status is null when the cluster has no Application
  for the deployment yet.
- When health_status != Healthy, the response also includes:
  - errors[]: aggregated from Argo (resources, resource tree,
    conditions, syncResult) and namespace events (kubectl)
  - logs{}: per-component tail (default 50 lines, capped at 500
    via hidden log_lines query param)
- Drop the image_pull_policy field that was added without being asked
  for.
- Backend-neutral naming (status, DeploymentStatus, "Deployment status
  backend is unreachable") so the source of truth can change without
  breaking callers.

503 is returned when the status backend is unreachable or any
per-deployment fetch raises — partial state would be misleading.

Implementation:

- New module opi/services/deployment_diagnostics.py with two pure-data
  helpers (gather_deployment_errors, gather_component_logs). No
  FastAPI/Pydantic deps.
- opi/api/v2/router.py uses the helpers; healthy deployments still
  pay only the single Argo status call.
- opi/web/router.py was duplicating the entire error-gather logic;
  refactored to call the shared helper, deleting ~92 lines. Dutch
  presentation (interpret_argocd_errors, age strings) stays UI-side.

Tests: 17 new for deployment_diagnostics, 4 new V2 endpoint tests
(unhealthy populates diagnostics, log_lines cap, etc.).

* PR feedback #1: drop status.logs and log_lines query param

Logs aren't status. The existing GET /api/logs/{project_name} (HTTP)
and WS /api/logs/stream/{project_name} endpoints already serve general
log access — embedding up to 500 lines × N components in every read of
an unhealthy deployment is wasted bytes for polling clients (CLI tab
completion, dashboards). Errors stay (cluster events are cheap and
distinct from logs); logs go to their own endpoint where they belong.

- Drop logs field from DeploymentStatus
- Drop log_lines query param from both endpoints
- Drop gather_component_logs helper and tests
- Drop kubectl log mocks from V2 tests
- Update feature doc to point at the existing /api/logs endpoints

* PR feedback #2: type sync_status/health_status as enums

Replace free-string sync_status and health_status with StrEnum types
so OpenAPI emits enum: [...] (consumers can validate, generate typed
clients, and know what to expect). Also add StatusReason enum for use
with the lenient-list change in the next commit.

PascalCase values mirror what ArgoCD already returns (Synced, Healthy,
etc.), keeping the wire format identical.

Safe extractors (_safe_sync_status, _safe_health_status) map any
unknown value Argo might emit in the future to "Unknown" rather than
raising — so a new Argo state doesn't break our response.

- Add SyncStatus, HealthStatus, StatusReason in opi/api/v2/models.py
- DeploymentStatus.sync_status/health_status now use the enum types
- Add safe extractors in opi/api/v2/router.py
- StatusReason is unused this commit; lenient-list commit consumes it

* PR feedback #2 + #3: lenient list endpoint + status_reason

The single-deployment endpoint stays strict (503 on any fetch failure
— there's one resource, partial truth misleads). But the list endpoint
gets lenient: when one deployment's fetch raises, that deployment is
returned with status=null, status_reason=Unavailable, and the others
come back normally. Whole-backend-down still 503s.

This keeps a CLI's `deployment list` working through partial outages —
without that, one broken deployment in a project would 503 the whole
list and make it impossible to render the others.

The status_reason field also disambiguates the existing "status: null"
overload: it now means either Pending (cluster doesn't have an
Application for this deployment yet) or Unavailable (fetch raised in
lenient list mode). status_reason is null when status is set.

- Add status_reason: StatusReason | None on DeploymentDetail
- Split _fetch_one_deployment_status into the strict per-deployment
  helper (used by single endpoint) and a lenient batch wrapper (used
  by list endpoint)
- _fetch_one_deployment_status returns (status, reason) so callers
  can attribute null status to the right cause
- Add test_partial_failure_marks_one_unavailable_returns_others
- Update test_app_not_yet_known to assert status_reason=Pending
- Update feature doc to describe the two-tier failure model

* PR feedback #4: typed ErrorCategory + human explanation on StatusError

Today, clients have to regex over the message field to tell an
ImagePullBackOff from a probe timeout from an OOMKill. Server-side,
the categorization is straightforward — and once expressed as a typed
enum it lets CLIs and dashboards filter, group, and colorize without
string matching.

Two new fields on StatusError:

- category: ErrorCategory — for automation. One of ImagePull, CrashLoop,
  OutOfMemory, HealthCheck, SyncFailed, ComparisonError, Unknown.
  PascalCase to match the rest of the V2 enums.
- explanation: str | None — for humans. Static, curated remediation
  guidance per category. Null for Unknown.

Categories are intentionally broader than literal K8s reasons (e.g.
ImagePull covers ImagePullBackOff, ErrImagePull, manifest-unknown
pulls) so app-level categories can be added later without binding the
contract to specific K8s state names.

Categorization logic lives in opi/services/deployment_diagnostics.py
next to gather_deployment_errors, keeping the data layer co-located.

- Add ErrorCategory enum
- Add category + explanation fields to StatusError
- Add categorize_error() helper with explanation lookup table
- Wire categorizer into _fetch_one_deployment_status
- 14 new categorizer tests + assertions in V2 endpoint test
- Update feature doc with new fields + example

* PR feedback #5: clarify last_synced_at is the last attempt, not last success

Today's value comes from operationState.finishedAt with a fallback to
status.reconciledAt — both of which fire on every sync, regardless of
outcome. For a Degraded deployment that means last_synced_at can be
the timestamp of a failed sync attempt, which is misleading if read as
"last successful deploy."

Doc-only fix here. The proper field split (last_attempt_at +
last_success_at) is a follow-up; this commit just makes the existing
field's semantics honest.

* PR feedback: collapse status into a single enum

The previous design had `status: DeploymentStatus | null` with a
separate `status_reason: StatusReason | null` (Pending/Unavailable).
Two fields, two-level nullability, and a sub-object containing typed
sync_status + health_status enums. Consumers had to switch on a tree.

Replace with one top-level `status` enum that subsumes everything:

  Healthy, Degraded, Progressing, OutOfSync, Suspended, Missing,
  Pending, Unavailable, Unknown

Argo's two orthogonal dimensions (sync, health) are collapsed using a
worst-of-both priority: Degraded/Suspended/Missing > OutOfSync >
Progressing > Healthy > Unknown. Pending and Unavailable replace the
old nullable+reason combo for "we have no data" cases.

What's lost: callers can't distinguish "OutOfSync + Degraded" from
plain Degraded anymore. For the user-facing question "what state is
my deployment in?", Degraded is what matters. If anyone needs both
dimensions later, a debug field can be added — YAGNI for now.

DeploymentDetail flattens to:
  status (enum, always set)
  sync_revision (was status.revision)
  last_synced_at
  errors[]

- Drop SyncStatus, HealthStatus, StatusReason enums; keep
  DeploymentStatus as the single StrEnum and ErrorCategory unchanged
- Drop nested DeploymentStatus model (became the enum)
- Add internal _LiveStatus NamedTuple for orchestrator/builder hand-off
- _collapse_argo_status() does the priority mapping
- Tests assert on flat shape: data["status"] == "Healthy" etc.
- Doc explains the collapse rules and the value table
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants