Skip to content

bug: service_keys.identity_id FK declared CASCADE in migration 006 but legacy DBs carry non-cascading constraint #196

@saucam

Description

@saucam

Summary

migrations/006_service_keys.up.sql:15 declares:

identity_id     UUID REFERENCES identities(id) ON DELETE CASCADE,

But the migration uses CREATE TABLE IF NOT EXISTS, so on any deployment where service_keys already existed before this constraint was added (the dev1 / stage1 / prod lineage), the table keeps its non-cascading FK. CREATE TABLE IF NOT EXISTS is a no-op when the table is present — it does not reconcile column / constraint differences.

Fresh deployments come up with the cascading FK as declared. Legacy deployments do not. The declared schema lies about live state.

How this surfaced

highflame-authn#109DELETE /agents/registry/{id} (and /identities/{id}) 500'd with:

update or delete on table "identities" violates foreign key constraint
"service_keys_identity_id_fkey" on table "service_keys" (SQLSTATE=23503)

because the hard-delete handler tripped the non-cascading FK on any identity with a service key — and every registered agent gets a bootstrap key, so this broke every agent delete from Studio's registry on legacy DBs.

zeroid#187 fixed the user-visible bug by switching both user-facing DELETE handlers to soft delete (DeactivateIdentity), which sidesteps the FK entirely and matches the platform "never hard DELETE" convention. That was the right immediate call: no risky migration, audit trail preserved, no user-visible regression.

This issue is the latent schema bug the workaround left behind.

Why it still matters

The hard-delete path is not gone — it is now reserved for PurgeIdentity, used as the compensating rollback of a half-created identity in AgentService.RegisterAgent. That path runs before any service key is persisted (CreateKey writes its row last), so it works today by timing — there's no service-key row to trip the FK during rollback.

That's fragile. A future refactor that ever inserts a key row before the rollback site, or any other internal caller of PurgeIdentity, will silently break on legacy DBs and pass on fresh ones — exactly the drift pattern that caused authn#109.

Also: any future compliance / GDPR-erasure path that does a controlled hard-delete (separate from the day-to-day Studio "Delete" button — see the corresponding Studio UX consolidation work) will need the cascade in place.

Proposed fix

Add a new migration (019_service_keys_fk_cascade or whatever the next free number is — confirm against migrations/ at branch time) that reconciles the constraint on legacy DBs:

-- 019_service_keys_fk_cascade.up.sql
-- Reconcile service_keys.identity_id FK to match migration 006's declared
-- ON DELETE CASCADE on legacy deployments where CREATE TABLE IF NOT EXISTS
-- was a no-op (the table predates the cascade and the constraint never got
-- re-applied). Fresh deployments are a no-op (constraint flip to identical
-- shape). See highflame-authn#109 / zeroid#187 for the user-visible incident
-- this drift originally produced.
ALTER TABLE service_keys
    DROP CONSTRAINT IF EXISTS service_keys_identity_id_fkey;

ALTER TABLE service_keys
    ADD CONSTRAINT service_keys_identity_id_fkey
    FOREIGN KEY (identity_id) REFERENCES identities(id) ON DELETE CASCADE;

And a matching 019_service_keys_fk_cascade.down.sql that flips back to ON DELETE NO ACTION (or whatever the legacy default was — verify on a dev1 snapshot before finalising) so rollback is symmetric.

Locking / safety analysis

  • Both ALTERs take ACCESS EXCLUSIVE on service_keys for the duration of the constraint flip.
  • service_keys is small (one row per issued API key — order of thousands at most in current deployments), so the validation scan is fast and the exclusive hold should be sub-second.
  • DROP CONSTRAINT is metadata-only; no rewrite. ADD CONSTRAINT does a full-table validation scan against identities to verify the FK invariant holds (which it must, since the existing constraint already enforces referential integrity — the only change is the cascade behaviour on parent delete).
  • No risk to in-flight reads: ACCESS EXCLUSIVE blocks concurrent writes on service_keys only. identities is untouched.
  • Run migration-analyzer before merge for an independent locking review.

Acceptance

  • Migration drops + re-adds the FK with ON DELETE CASCADE on dev1 / stage1 / prod (verify via \d service_keys post-migration that the constraint matches migration 006's declaration).
  • PurgeIdentity is exercised end-to-end against a dev1 snapshot with a service-key-bearing identity and succeeds (today it succeeds only because the rollback site writes the key row last; this confirms the FK itself allows the cascade).
  • No live request path observable regression (this isn't a user-facing change — DELETE /agents/registry/{id} continues to soft-delete via DeactivateIdentity).
  • Down migration round-trips cleanly on a fresh DB and on a legacy snapshot.

Out of scope

  • Studio UX consolidation (Delete button is functionally redundant post-soft-delete). Tracked separately.
  • Any user-facing hard-delete / GDPR-erasure path. That's a separate compliance ticket with elevated auth scoping; this issue is purely about closing the declared-vs-actual schema drift.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions