Skip to content

[CRITICAL] Deleting all agents leaves messages behind — verify FK cascade runs on existing DBs vs. silent UI refresh failure #43

@MelbourneDeveloper

Description

@MelbourneDeveloper

Summary

Deleting all agents from the VSIX did not remove all messages — message rows/UI entries survived after every agent was gone. Either (A) the DB is not actually cascade-deleting messages on the live database, or (B) the UI is not reflecting the deletion (stale view / silent refresh failure). Both are plausible given the evidence below. Investigate and confirm which before fixing. Per request: do NOT fix yet — this issue is to nail the root cause.


Hypothesis A — FK cascade isn't actually running on the live DB

The schema declares the right cascades, and a fresh DB passes E2E — but an existing data.db may not, because of how SQLite stores FK actions.

Schema is correctpackages/too-many-cooks/prisma/schema.prisma:

  • Message.from (from_agent) and Message.to (to_agent) both onDelete: Cascade (lines ~47-48)
  • Lock.identity onDelete: Cascade (line ~34), Plan.identity onDelete: Cascade (line ~73)

Pragma is set per-connectionpackages/too-many-cooks/src/db-sqlite.ts:153: db.pragma("foreign_keys = ON").

Delete relies purely on cascadedb-sqlite.ts:917-945 adminDeleteAgent runs a single DELETE FROM identity WHERE agent_name = ? and trusts the DB to cascade.

Fresh-DB E2E passestoo_many_cooks_vscode_extension/test/suite/deleteAllAgents.test.ts proves cascade works on a newly created DB.

Why an existing DB can still leak messages

The messages.to_agent cascade was added only recently — migration packages/too-many-cooks/prisma/migrations/20260525000000_add_to_agent_fk_cascade/migration.sql. SQLite bakes FK actions into the table DDL at CREATE time; PRAGMA foreign_keys = ON cannot retroactively add a cascade. That migration is a full table rebuild (RedefineTables: create new_messages with both FKs → copy → drop → rename). So whether a given data.db cascades inbound messages depends entirely on that migration having actually run against it.

Drift risk between two schema-apply paths:

  • Boot path uses prisma migrate deploydb-sqlite.ts:148 (applyMigrations), error string "Prisma migrate deploy failed", log "Schema applied via prisma migrate deploy".
  • But packages/too-many-cooks/src/migrate.ts uses prisma db push --accept-data-loss.

A DB ever created via db push has no _prisma_migrations history; a later migrate deploy can then fail/skip applying 20260525..., leaving the old messages table without the to_agent cascade. Result: deleting an agent removes the agent but orphans every message addressed to it — exactly the reported symptom (messages survive agent deletion). CLAUDE.md states there is no legacy DB migration support ("delete the stale DB and recreate"), which makes a pre-cascade data.db a live hazard rather than a handled case.

Investigation steps (read-only)

On an affected data.db:

SELECT sql FROM sqlite_master WHERE name = 'messages';   -- does to_agent FK say ON DELETE CASCADE?
PRAGMA foreign_key_list('messages');                     -- both from_agent AND to_agent present with cascade?
SELECT * FROM _prisma_migrations WHERE migration_name LIKE '20260525%';  -- was the cascade migration applied?
PRAGMA foreign_keys;                                     -- is it ON for this connection?

If the messages DDL lacks ON DELETE CASCADE on to_agent (or the migration row is missing) → Hypothesis A confirmed: the live schema, not the code, is the bug, and the migrate deploy vs db push drift is the cause.


Hypothesis B — UI not reactive / silent refresh failure

The delete path doesn't mutate messages locally; it refetches server truth — but that refetch can silently no-op, leaving stale messages on screen.

Delete pathtoo_many_cooks_vscode_extension/src/services/storeManager.ts:238-257:
deleteAgent / deleteAllAgents POST /admin/delete-agent (once per agent), then call refreshStatus().

refreshStatus() swallows failures silentlystoreManager.ts:196-227:

  • Line 204 & 210: a refreshSeq race guard early-returns if a newer refresh started — if requests overlap, an in-flight refresh can bail without ever dispatching SetMessages.
  • Lines 205-208: a non-ok HTTP response is logged and swallowedreturn; with no error surfaced and no state update. The UI keeps showing the pre-delete messages and the user sees no indication anything failed.
    Only on the happy path does it dispatch({ messages, type: 'SetMessages' }) (line 225) with server truth.

Latent reducer landminetoo_many_cooks_vscode_extension/src/state/store.ts:13-25:
the RemoveAgent reducer filters agents, locks, and plans for the removed agent but not messages. It's currently never dispatched (dead branch — grep finds no dispatch({ type: 'RemoveAgent' })), so it isn't the active cause, but if anyone later wires optimistic single-agent removal to it, it will leave orphaned messages in the store. Should be fixed for consistency.


State-architecture audit (re: "is everything on screen using signals / centralized state?")

  • State IS centralized in a single immutable store — src/state/store.ts (Store class, getState/dispatch/subscribe, immutable spread updates). Single source of truth. ✅
  • It is NOT signal-based — it's a hand-rolled Redux-style EventEmitter. Tree views subscribe and re-render on every dispatch: e.g. MessagesTreeProvider fires onDidChangeTreeData on any store change (src/ui/tree/messagesTreeProvider.ts:27-30) and re-reads selectMessages(state) in getChildren. So reactivity wiring is present and centralized. ✅
  • The gap is not scattered/global mutable UI state; it's (1) the silent refreshStatus failure path and (2) the incomplete RemoveAgent reducer. So if the symptom is UI-side, the root cause is a silent refresh no-op, not a missing-signal problem.

Repro plan (do NOT fix yet)

  1. Reproduce against an existing data.db that predates 20260525... (or one created via db push). Send messages between agents A→B and B→A, delete all agents, then query the DB directly (Hypothesis A queries above) and observe the VSIX message tree. Compare DB rows vs. UI.
  2. If DB still has message rows → A (cascade not applied on this DB / migrate-deploy-vs-db-push drift).
  3. If DB rows are gone but UI still shows them → B (refresh silently failed or didn't fire). Check the extension log for refreshStatus: response not ok / a swallowed return.

Acceptance criteria (for the eventual fix)

  • Deleting all agents leaves zero message rows in the DB and zero messages in the UI, verified on a DB that predates the to_agent cascade migration (not just a fresh DB).
  • A single source-of-truth for schema application (no migrate deploy vs db push divergence) OR an explicit guard that detects a pre-cascade messages table and rebuilds it.
  • refreshStatus surfaces failures instead of swallowing them (no silent stale UI).
  • RemoveAgent reducer also filters messages (consistency, even though currently unused).
  • Regression tests covering both a fresh DB and a simulated pre-cascade DB.

Do not fix in this issue — confirm the root cause first. Marked critical: deleting agents leaving live message rows is a data-integrity / privacy concern.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcriticalCritical: data integrity, security, or correctness

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions