Summary
migrations/006_service_keys.up.sql:15 declares:
identity_id UUID REFERENCES identities(id) ON DELETE CASCADE,
But the migration uses CREATE TABLE IF NOT EXISTS, so on any deployment where service_keys already existed before this constraint was added (the dev1 / stage1 / prod lineage), the table keeps its non-cascading FK. CREATE TABLE IF NOT EXISTS is a no-op when the table is present — it does not reconcile column / constraint differences.
Fresh deployments come up with the cascading FK as declared. Legacy deployments do not. The declared schema lies about live state.
How this surfaced
highflame-authn#109 — DELETE /agents/registry/{id} (and /identities/{id}) 500'd with:
update or delete on table "identities" violates foreign key constraint
"service_keys_identity_id_fkey" on table "service_keys" (SQLSTATE=23503)
because the hard-delete handler tripped the non-cascading FK on any identity with a service key — and every registered agent gets a bootstrap key, so this broke every agent delete from Studio's registry on legacy DBs.
zeroid#187 fixed the user-visible bug by switching both user-facing DELETE handlers to soft delete (DeactivateIdentity), which sidesteps the FK entirely and matches the platform "never hard DELETE" convention. That was the right immediate call: no risky migration, audit trail preserved, no user-visible regression.
This issue is the latent schema bug the workaround left behind.
Why it still matters
The hard-delete path is not gone — it is now reserved for PurgeIdentity, used as the compensating rollback of a half-created identity in AgentService.RegisterAgent. That path runs before any service key is persisted (CreateKey writes its row last), so it works today by timing — there's no service-key row to trip the FK during rollback.
That's fragile. A future refactor that ever inserts a key row before the rollback site, or any other internal caller of PurgeIdentity, will silently break on legacy DBs and pass on fresh ones — exactly the drift pattern that caused authn#109.
Also: any future compliance / GDPR-erasure path that does a controlled hard-delete (separate from the day-to-day Studio "Delete" button — see the corresponding Studio UX consolidation work) will need the cascade in place.
Proposed fix
Add a new migration (019_service_keys_fk_cascade or whatever the next free number is — confirm against migrations/ at branch time) that reconciles the constraint on legacy DBs:
-- 019_service_keys_fk_cascade.up.sql
-- Reconcile service_keys.identity_id FK to match migration 006's declared
-- ON DELETE CASCADE on legacy deployments where CREATE TABLE IF NOT EXISTS
-- was a no-op (the table predates the cascade and the constraint never got
-- re-applied). Fresh deployments are a no-op (constraint flip to identical
-- shape). See highflame-authn#109 / zeroid#187 for the user-visible incident
-- this drift originally produced.
ALTER TABLE service_keys
DROP CONSTRAINT IF EXISTS service_keys_identity_id_fkey;
ALTER TABLE service_keys
ADD CONSTRAINT service_keys_identity_id_fkey
FOREIGN KEY (identity_id) REFERENCES identities(id) ON DELETE CASCADE;
And a matching 019_service_keys_fk_cascade.down.sql that flips back to ON DELETE NO ACTION (or whatever the legacy default was — verify on a dev1 snapshot before finalising) so rollback is symmetric.
Locking / safety analysis
- Both ALTERs take
ACCESS EXCLUSIVE on service_keys for the duration of the constraint flip.
service_keys is small (one row per issued API key — order of thousands at most in current deployments), so the validation scan is fast and the exclusive hold should be sub-second.
- DROP CONSTRAINT is metadata-only; no rewrite. ADD CONSTRAINT does a full-table validation scan against
identities to verify the FK invariant holds (which it must, since the existing constraint already enforces referential integrity — the only change is the cascade behaviour on parent delete).
- No risk to in-flight reads: ACCESS EXCLUSIVE blocks concurrent writes on
service_keys only. identities is untouched.
- Run
migration-analyzer before merge for an independent locking review.
Acceptance
Out of scope
- Studio UX consolidation (Delete button is functionally redundant post-soft-delete). Tracked separately.
- Any user-facing hard-delete / GDPR-erasure path. That's a separate compliance ticket with elevated auth scoping; this issue is purely about closing the declared-vs-actual schema drift.
Summary
migrations/006_service_keys.up.sql:15declares:But the migration uses
CREATE TABLE IF NOT EXISTS, so on any deployment whereservice_keysalready existed before this constraint was added (the dev1 / stage1 / prod lineage), the table keeps its non-cascading FK.CREATE TABLE IF NOT EXISTSis a no-op when the table is present — it does not reconcile column / constraint differences.Fresh deployments come up with the cascading FK as declared. Legacy deployments do not. The declared schema lies about live state.
How this surfaced
highflame-authn#109—DELETE /agents/registry/{id}(and/identities/{id}) 500'd with:because the hard-delete handler tripped the non-cascading FK on any identity with a service key — and every registered agent gets a bootstrap key, so this broke every agent delete from Studio's registry on legacy DBs.
zeroid#187 fixed the user-visible bug by switching both user-facing DELETE handlers to soft delete (
DeactivateIdentity), which sidesteps the FK entirely and matches the platform "never hard DELETE" convention. That was the right immediate call: no risky migration, audit trail preserved, no user-visible regression.This issue is the latent schema bug the workaround left behind.
Why it still matters
The hard-delete path is not gone — it is now reserved for
PurgeIdentity, used as the compensating rollback of a half-created identity inAgentService.RegisterAgent. That path runs before any service key is persisted (CreateKeywrites its row last), so it works today by timing — there's no service-key row to trip the FK during rollback.That's fragile. A future refactor that ever inserts a key row before the rollback site, or any other internal caller of
PurgeIdentity, will silently break on legacy DBs and pass on fresh ones — exactly the drift pattern that caused authn#109.Also: any future compliance / GDPR-erasure path that does a controlled hard-delete (separate from the day-to-day Studio "Delete" button — see the corresponding Studio UX consolidation work) will need the cascade in place.
Proposed fix
Add a new migration (
019_service_keys_fk_cascadeor whatever the next free number is — confirm againstmigrations/at branch time) that reconciles the constraint on legacy DBs:And a matching
019_service_keys_fk_cascade.down.sqlthat flips back toON DELETE NO ACTION(or whatever the legacy default was — verify on a dev1 snapshot before finalising) so rollback is symmetric.Locking / safety analysis
ACCESS EXCLUSIVEonservice_keysfor the duration of the constraint flip.service_keysis small (one row per issued API key — order of thousands at most in current deployments), so the validation scan is fast and the exclusive hold should be sub-second.identitiesto verify the FK invariant holds (which it must, since the existing constraint already enforces referential integrity — the only change is the cascade behaviour on parent delete).service_keysonly.identitiesis untouched.migration-analyzerbefore merge for an independent locking review.Acceptance
ON DELETE CASCADEon dev1 / stage1 / prod (verify via\d service_keyspost-migration that the constraint matches migration 006's declaration).PurgeIdentityis exercised end-to-end against a dev1 snapshot with a service-key-bearing identity and succeeds (today it succeeds only because the rollback site writes the key row last; this confirms the FK itself allows the cascade).DELETE /agents/registry/{id}continues to soft-delete viaDeactivateIdentity).Out of scope