feat(dedup): opt-in near-duplicate observation dedup — Tier-0 auto-merge + Tier-1 candidates (#3038) by crippledgeek · Pull Request #3063 · thedotmack/claude-mem

crippledgeek · 2026-06-26T00:07:56Z

Opt-in, off-by-default near-duplicate dedup for observations. content_hash only
catches byte-identical rows, so paraphrased observations of the same recurring work
accumulate as noise. This adds two deterministic tiers, validated against a real
7,651-observation DB before any code was written (see the diagnostic on #3038).

This is the conservative interpretation of #3038 (exact auto-merge + fuzzy
candidate-flagging, not fuzzy auto-merge). The data showed fuzzy lexical
auto-merge is unsafe — happy to switch to auto-merge if you'd prefer; I asked on
the issue. The LLM-adjudication tier that could safely resolve the residual cases
is deliberately deferred to a follow-up PR.

What it does (all gated on `CLAUDE_MEM_DEDUP_ENABLED`, default `false`)

Tier 0 — exact-normalized-title → silent auto-merge. A new observation whose title
matches an existing one after lowercase/whitespace-collapse/punctuation-strip is collapsed
onto the existing row (cross-session) and its new occurrence_count is bumped. O(1) via an
indexed title_norm_key (sha256(project + normalizeTitle)), mirroring the content_hash
pattern. NULL key for empty/emoji/punctuation-only titles so they never collapse into
each other.
Tier 1 — IDF-weighted TF-IDF cosine + an IDF "veto" → review-only candidates. Records
near-dup candidates in observation_dedup_candidates (never auto-merged). The veto
(a Fellegi–Sunter blocking key) rejects pairs that differ only in a rare discriminating
token — rdlp-api vs rdlp-plugin, ffmpeg-7.1 vs 6.1 — which plain string similarity
wrongly scores ~0.9.

Why this shape (validation evidence)

On the real DB at production thresholds: SimHash-over-narrative was useless (false-positive
dominated); the signal lives in the title; exact-normalized-title found 35–47 real
redundancies content_hash misses with zero false positives; and fuzzy lexical auto-merge
would have destroyed genuinely-distinct work that differs in one token — hence Tier-1 is
review-only. Semantic paraphrases (different words, same meaning) need embeddings and are out
of scope here.

Surfaces

Migration v33 (pure SQL, idempotent): occurrence_count, token_df (IDF model),
dedup_meta, observation_dedup_candidates.
POST /api/dedup/scan — opt-in idempotent backfill (so existing DBs participate) + bounded
inverted-index corpus sweep. Gated on the flag + single-flight + row-cap.
GET /api/dedup/candidates?project= — read-only candidate list.
7 CLAUDE_MEM_DEDUP_* settings (documented in configuration.mdx).

Disabled ⇒ byte-identical legacy behavior (regression-tested).

Tests

Strict TDD, ~20 commits. 175 dedup/sqlite/settings tests green; tsc --noEmit clean; full
bun test green. Coverage includes the empty-title data-loss guard, cross-project isolation,
cross-session + intra-batch Tier-0, cold-start gate, retry-idempotency (token_df/doc_count stay
flat on a merge or redelivery), and a real-DB-derived golden fixture.

Reviews

Both ran over the full BASE..HEAD diff, fixes applied, then re-reviewed:

security-reviewer → PASS. No injection / auth bypass / data-loss. Initial MEDIUM findings
(scan concurrency guard; prepared-statement hardening) + LOWs (backfill row cap; trust-model
doc) all resolved and re-confirmed.
code-reviewer → Approve. Initial findings (occurrence_count retry-idempotency, a
token_df-flat regression test, scan-when-disabled gating, IDF single-load perf, NaN-config
clamp, DRY) all resolved with a genuine regression test; re-confirmed merge-ready.

Trust model

Like all worker routes, /api/dedup/* is guarded only by the 127.0.0.1 binding + localhost
CORS (consistent with DataRoutes/SearchRoutes/MemoryRoutes).

Follow-up (separate PR, only if this is accepted)

An async LLM-adjudication consolidation tier (with a tombstone/audit subsystem) to resolve the
common-token cases lexical methods can't (Code vs Security review) — kept out of this PR
because a wrong auto-merge in an append-only store is unrecoverable, so it needs its own
reversibility machinery.

…#3038 Tier-0 exact-normalized-title normalization (punctuation/case/ws-insensitive) and a whitespace-only tokenizer that preserves compound identifiers (rdlp-api, ffmpeg-7.1.conf) as single tokens for IDF-veto correctness.

idf(df,N)=log(1+N/(df+0.5)) + buildIdfFn; rare tokens weigh high so a difference in a discriminating token dominates cosine + veto.

Rare discriminating tokens inflate the norm without contributing to the dot product, pulling cosine below threshold for distinct-but-similar titles (rdlp-api vs rdlp-plugin) that plain token-sort wrongly scores ~0.9.

Fellegi-Sunter blocking-key: a rare token present on only one side vetoes the merge regardless of cosine. Test documents the known limitation that common-token discriminators (code vs security) need the Branch-2 LLM tier.

…edotmack#3038 Tier-0 exact-normalized-title (safe auto-merge) / Tier-1 candidate (cosine>=threshold AND !veto). Golden fixture encodes the real-DB-validated cases incl. the common-token-discriminator limitation (code/security never auto-merged) and the true recurring-dup that IS flagged.

…hedotmack#3038) Code review caught a data-loss defect: null/empty/whitespace/punctuation-only/ emoji-only titles all normalize to '' and would collapse into each other as Tier-0 'exact' (silent merge of distinct observations). This project uses emoji titles (🔵/✅), so it's real. Guard the exact branch on a non-empty normal form; Tier-1 fall-through is already safe (empty → cosine 0 → none). Adds 5 empty/symbolic negative guards, normalize empty-collapse preconditions, and an N=0 empty-corpus cosine test (no NaN on first observation).

Research Q-D: short-title cosine is jumpy — a single shared rare token can dominate it to ~1.0 even for otherwise-disjoint titles. Gate Tier-1 on a configurable minimum shared-token count (default 2) to kill sparse-vector noise before scoring.

…nv keys The SettingsDefaultsManager `loadFromFile` edge-case tests assert `loadFromFile(<empty|corrupt file>)` deepEquals `getAllDefaults()`. But `loadFromFile` applies env overrides on top of file/defaults by default, while `getAllDefaults()` returns pure defaults — so any settings-default key present in `process.env` makes the two diverge. The suite already stripped `CLAUDE_MEM_DATA_DIR` (pinned by the preload tripwire) for this reason, but only that one key. On a contributor machine with a running claude-mem install, other keys are exported too (e.g. `CLAUDE_MEM_API_TIMEOUT_MS=120000`), which silently failed 9 of these tests locally while CI — a clean env — stayed green. Generalize the isolation: snapshot and delete EVERY `getAllDefaults()` key from `process.env` in beforeEach, restore in afterEach. Robust to whichever CLAUDE_MEM_* vars the host exports. No production code changes — `loadFromFile`'s env-override behavior is correct and already covered by the "environment variable overrides" describe block; this only fixes test isolation. Before: full suite 2159 pass / 9 fail on a dev box exporting CLAUDE_MEM_* vars. After: 2159 pass / 0 fail in the same env. (cherry picked from commit 7ea4880)

…dotmack#3038 Six opt-in knobs: ENABLED=false, COSINE_THRESHOLD=0.80 (empirical short-title sweet spot), IDF_VETO_DF=10, MIN_SHARED_TOKENS=2, MIN_PROJECT_DOCS=10 (cold-start gate), MAX_SCAN=2000. Env/settings.json only (no viewer UI wiring needed for a server-side off-by-default flag).

…didates (thedotmack#3038) Pure-SQL migration (repo norm — no JS backfill in migrations): adds observations.occurrence_count (default 1), token_df (per-project IDF model, filled forward / rebuilt by dedup-scan), dedup_meta (cold-start + drift tracking), and observation_dedup_candidates (Tier-1 review-only, mirrors observation_feedback; UNIQUE(observation_id,duplicate_of_id) + method/status CHECKs). schema.sql updated to match.

…rt gate (thedotmack#3038) dedup-store.ts (plain fns over Database, keeps SessionStore lean): bumpTokenDf (forward DF/doc_count maintenance, unique tokens only), buildProjectIdf (project-scoped idf + corpus size), isFuzzyReady (cold-start gate, research Q-B). Called on real inserts only — a Tier-0 merge adds no document.

…mack#3038) Research: SQLite can't express \p{L}-aware normalization (ASCII-only lower(), no regexp_replace, no bun:sqlite custom fns), so precompute the key in app code and index it — the content_hash pattern. computeTitleNormKey = sha256(project + normalizeTitle), NULL when title normalizes to empty (data-loss guard reused at the persistence layer). findTier0Canonical does the O(1) lookup. NON-unique index keeps dedup app-gated on the flag (off = byte-identical).

…toreObservation(s) (thedotmack#3038) Both the single and batch (ResponseProcessor) write paths, fully gated on CLAUDE_MEM_DEDUP_ENABLED (off = byte-identical legacy behavior): - Tier-0: O(1) title_norm_key lookup -> bump occurrence_count + reuse id, cross-session and intra-batch (in-transaction visibility); mirrors the content_hash ON CONFLICT semantics. - Forward token_df/doc_count maintenance on real inserts only. - Tier-1 (>= MIN_PROJECT_DOCS): capped recent-window classifyPair scan -> persist review-only candidates (INSERT OR IGNORE). Full-corpus sweep is the upcoming dedup-scan. 7 integration tests (cross-session/cross-project/intra-batch/cold-start/disabled); typecheck clean; 80 sqlite tests green.

…tmack#3038) Opt-in, idempotent (research Q-A/Q-C): backfillProjectDedup recomputes title_norm_key for every row + DELETE/INSERT-rebuilds token_df + resets dedup_meta in one transaction (this is how an EXISTING DB joins dedup and how DF drift is reclaimed). sweepProjectCandidates finds existing near-dups via a bounded inverted index (postings on df in 2..~4*sqrt(N) tokens, pairs sharing >=minSharedTokens classified) — not O(N^2). runDedupScan covers all projects. Persists review-only candidates (INSERT OR IGNORE, idempotent).

… & scan (thedotmack#3038) GET /api/dedup/candidates (read-only, joined to both titles, project-scoped) and POST /api/dedup/scan (opt-in backfill+sweep all projects) — thin glue over tested SessionStore.listDedupCandidates / runDedupScan. Registered in worker-service. Gives the Tier-1 candidates table a standalone consumer and the dedup-scan its callable surface (CLI/viewer deferred).

Security: - M1: single-flight guard on POST /api/dedup/scan (409 on overlap) - M2: two explicit prepared statements in listDedupCandidates (drop ${where} interpolation + the as-any cast) - L1: document the localhost-only trust model on DedupRoutes - L2: CLAUDE_MEM_DEDUP_MAX_BACKFILL_ROWS cap (skip+warn oversized projects) Code review: - C1: occurrence_count retry-idempotency — a redelivered (session,content_hash) no longer bumps the count (single + batch paths) - C2: regression test — Tier-0 merge AND content_hash retry leave token_df/ doc_count flat (maintenance = real inserts only) - C3: annotate dedup_meta drift columns as reserved (no delete hook yet) - C4: POST /api/dedup/scan gated on CLAUDE_MEM_DEDUP_ENABLED (no row mutation when disabled) - C5: buildProjectIdf loads project DF once into a Map (no per-token round-trip) - C7: dedupConfig clamps NaN config values to safe defaults - C8: shared candidateInsert() helper (DRY) typecheck clean; 175 dedup/sqlite/settings tests green.

…tmack#3038)

…3038 maxScan is bound as a SQL LIMIT — keep a fractional misconfig from reaching the binding as a float.

greptile-apps · 2026-06-26T00:12:39Z

Greptile Summary

This PR adds opt-in near-duplicate deduplication for observations. The main changes are:

Tier-0 normalized-title auto-merge with occurrence_count tracking.
Tier-1 review-only candidate detection using IDF-weighted cosine scoring and a rare-token veto.
SQLite schema additions for title keys, IDF metadata, and candidate storage.
Worker endpoints for running a gated dedup scan and listing candidates.
New settings, documentation, fixtures, and tests for the dedup pipeline.

Confidence Score: 5/5

The changes appear merge-safe based on the scoped review of the opt-in deduplication pipeline.

No blocking correctness or security issues were identified, and the feature is gated off by default with tests covering the new normalization, scoring, persistence, settings, and worker-route behavior.

T-Rex Logs

What T-Rex did

Compared the base Tier-0 dedup state with the after state to confirm changes in occurrence_count and title normalization handling.
Verified the dedup API routing and behavior across disabled and enabled scans, and observed idempotent candidate results across runs.
Observed schema and data model changes after head: added occurrence_count, title_norm_key, idx_observations_title_norm, and dedup tables; confirmed Tier-0 merge returns id=1 with occurrence_count=2 and related data remains consistent.
Confirmed the Tier-1 scan feature was added by enabling the module and config, showing backfill documentation, and reporting one pending candidate after sweep with observed sweep results.
Compared dedup-config before and after: defaults/docs entries expanded, opt-in behavior introduced, and cross-session normalization merging demonstrated when opt-in is enabled.

_{Ran code and verified through T-Rex}

_{Reviews (2): Last reviewed commit: "fix(dedup): count only newly-persisted c..." | Re-trigger Greptile}

…hedotmack#3038) Greptile review: recordTier1Candidates/sweepProjectCandidates did count++ on every INSERT OR IGNORE that classified as a candidate, so a re-run/redelivery reported phantom counts for rows that were actually ignored by the UNIQUE(observation_id,duplicate_of_id) guard. Count via .run().changes instead — a second idempotent sweep now correctly returns 0. Tightened the idempotency test to assert the return value, not just row count.

crippledgeek · 2026-06-26T00:16:40Z

Thanks @greptile-apps — good catch, fixed in 4c67a64.

recordTier1Candidates / sweepProjectCandidates were doing count++ on every INSERT OR IGNORE that classified as a candidate, so a re-run or redelivery reported a phantom count for rows that the UNIQUE(observation_id, duplicate_of_id) guard actually ignored. They now accumulate .run().changes instead, so the returned count reflects only newly-persisted rows — a second idempotent sweep correctly returns 0 (the persisted data was always idempotent; only the reported number was off).

Tightened the idempotency test to assert the return value (not just the row count) — it fails against the old code and passes now. 140 dedup/sqlite tests green, typecheck clean.

greptile-apps · 2026-06-26T00:17:16Z

The fix is correct and well-executed. The key change in dedup-store.ts:

// Before (buggy):
count++; // incremented for every classifyPair 'candidate', including INSERT OR IGNORE no-ops

// After (correct):
count += ins.run(...).changes; // only increments when the row was actually inserted (changes=1), 0 on UNIQUE conflict

And the test tightening is exactly right — expect(secondReturned).toBe(0) now fails against the old code and proves the fix. The assertion that firstReturned === after1 (return value equals row count on first run) is also a nice addition that covers the forward case.

LGTM. The fix is minimal, precise, and the test proves both the positive and idempotent paths. No concerns.

Tip: You can customize Greptile's behavior for this repo with .greptile/rules.md and .greptile/config.json.

crippledgeek added 18 commits June 26, 2026 00:38

feat(dedup): add smoothed IDF math for thedotmack#3038

9f25f5d

idf(df,N)=log(1+N/(df+0.5)) + buildIdfFn; rare tokens weigh high so a difference in a discriminating token dominates cosine + veto.

feat(dedup): add IDF-weighted TF-IDF cosine for thedotmack#3038

65e653c

Rare discriminating tokens inflate the norm without contributing to the dot product, pulling cosine below threshold for distinct-but-similar titles (rdlp-api vs rdlp-plugin) that plain token-sort wrongly scores ~0.9.

docs(dedup): document CLAUDE_MEM_DEDUP_* settings + dedup-scan (thedo…

5ea4038

…tmack#3038)

fix(dedup): truncate integer config knobs (review N1) for thedotmack#…

c075529

…3038 maxScan is bound as a SQL LIMIT — keep a fractional misconfig from reaching the binding as a float.

greptile-apps Bot reviewed Jun 26, 2026

View reviewed changes

Comment thread src/services/sqlite/dedup-store.ts Outdated

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(dedup): opt-in near-duplicate observation dedup — Tier-0 auto-merge + Tier-1 candidates (#3038)#3063

feat(dedup): opt-in near-duplicate observation dedup — Tier-0 auto-merge + Tier-1 candidates (#3038)#3063
crippledgeek wants to merge 19 commits into
thedotmack:mainfrom
crippledgeek:feature/fuzzy-near-dup-dedup

crippledgeek commented Jun 26, 2026

Uh oh!

greptile-apps Bot commented Jun 26, 2026 •

edited

Loading

T-Rex Logs

Uh oh!

Uh oh!

crippledgeek commented Jun 26, 2026

Uh oh!

greptile-apps Bot commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

crippledgeek commented Jun 26, 2026

What it does (all gated on CLAUDE_MEM_DEDUP_ENABLED, default false)

Why this shape (validation evidence)

Surfaces

Tests

Reviews

Trust model

Follow-up (separate PR, only if this is accepted)

Uh oh!

greptile-apps Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

T-Rex Logs

Uh oh!

Uh oh!

crippledgeek commented Jun 26, 2026

Uh oh!

greptile-apps Bot commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

What it does (all gated on `CLAUDE_MEM_DEDUP_ENABLED`, default `false`)

greptile-apps Bot commented Jun 26, 2026 •

edited

Loading