feat(dedup): opt-in near-duplicate observation dedup — Tier-0 auto-merge + Tier-1 candidates (#3038)#3063
Conversation
…#3038 Tier-0 exact-normalized-title normalization (punctuation/case/ws-insensitive) and a whitespace-only tokenizer that preserves compound identifiers (rdlp-api, ffmpeg-7.1.conf) as single tokens for IDF-veto correctness.
idf(df,N)=log(1+N/(df+0.5)) + buildIdfFn; rare tokens weigh high so a difference in a discriminating token dominates cosine + veto.
Rare discriminating tokens inflate the norm without contributing to the dot product, pulling cosine below threshold for distinct-but-similar titles (rdlp-api vs rdlp-plugin) that plain token-sort wrongly scores ~0.9.
Fellegi-Sunter blocking-key: a rare token present on only one side vetoes the merge regardless of cosine. Test documents the known limitation that common-token discriminators (code vs security) need the Branch-2 LLM tier.
…edotmack#3038 Tier-0 exact-normalized-title (safe auto-merge) / Tier-1 candidate (cosine>=threshold AND !veto). Golden fixture encodes the real-DB-validated cases incl. the common-token-discriminator limitation (code/security never auto-merged) and the true recurring-dup that IS flagged.
…hedotmack#3038) Code review caught a data-loss defect: null/empty/whitespace/punctuation-only/ emoji-only titles all normalize to '' and would collapse into each other as Tier-0 'exact' (silent merge of distinct observations). This project uses emoji titles (🔵/✅), so it's real. Guard the exact branch on a non-empty normal form; Tier-1 fall-through is already safe (empty → cosine 0 → none). Adds 5 empty/symbolic negative guards, normalize empty-collapse preconditions, and an N=0 empty-corpus cosine test (no NaN on first observation).
Research Q-D: short-title cosine is jumpy — a single shared rare token can dominate it to ~1.0 even for otherwise-disjoint titles. Gate Tier-1 on a configurable minimum shared-token count (default 2) to kill sparse-vector noise before scoring.
…nv keys The SettingsDefaultsManager `loadFromFile` edge-case tests assert `loadFromFile(<empty|corrupt file>)` deepEquals `getAllDefaults()`. But `loadFromFile` applies env overrides on top of file/defaults by default, while `getAllDefaults()` returns pure defaults — so any settings-default key present in `process.env` makes the two diverge. The suite already stripped `CLAUDE_MEM_DATA_DIR` (pinned by the preload tripwire) for this reason, but only that one key. On a contributor machine with a running claude-mem install, other keys are exported too (e.g. `CLAUDE_MEM_API_TIMEOUT_MS=120000`), which silently failed 9 of these tests locally while CI — a clean env — stayed green. Generalize the isolation: snapshot and delete EVERY `getAllDefaults()` key from `process.env` in beforeEach, restore in afterEach. Robust to whichever CLAUDE_MEM_* vars the host exports. No production code changes — `loadFromFile`'s env-override behavior is correct and already covered by the "environment variable overrides" describe block; this only fixes test isolation. Before: full suite 2159 pass / 9 fail on a dev box exporting CLAUDE_MEM_* vars. After: 2159 pass / 0 fail in the same env. (cherry picked from commit 7ea4880)
…dotmack#3038 Six opt-in knobs: ENABLED=false, COSINE_THRESHOLD=0.80 (empirical short-title sweet spot), IDF_VETO_DF=10, MIN_SHARED_TOKENS=2, MIN_PROJECT_DOCS=10 (cold-start gate), MAX_SCAN=2000. Env/settings.json only (no viewer UI wiring needed for a server-side off-by-default flag).
…didates (thedotmack#3038) Pure-SQL migration (repo norm — no JS backfill in migrations): adds observations.occurrence_count (default 1), token_df (per-project IDF model, filled forward / rebuilt by dedup-scan), dedup_meta (cold-start + drift tracking), and observation_dedup_candidates (Tier-1 review-only, mirrors observation_feedback; UNIQUE(observation_id,duplicate_of_id) + method/status CHECKs). schema.sql updated to match.
…rt gate (thedotmack#3038) dedup-store.ts (plain fns over Database, keeps SessionStore lean): bumpTokenDf (forward DF/doc_count maintenance, unique tokens only), buildProjectIdf (project-scoped idf + corpus size), isFuzzyReady (cold-start gate, research Q-B). Called on real inserts only — a Tier-0 merge adds no document.
…mack#3038) Research: SQLite can't express \p{L}-aware normalization (ASCII-only lower(), no regexp_replace, no bun:sqlite custom fns), so precompute the key in app code and index it — the content_hash pattern. computeTitleNormKey = sha256(project + normalizeTitle), NULL when title normalizes to empty (data-loss guard reused at the persistence layer). findTier0Canonical does the O(1) lookup. NON-unique index keeps dedup app-gated on the flag (off = byte-identical).
…toreObservation(s) (thedotmack#3038) Both the single and batch (ResponseProcessor) write paths, fully gated on CLAUDE_MEM_DEDUP_ENABLED (off = byte-identical legacy behavior): - Tier-0: O(1) title_norm_key lookup -> bump occurrence_count + reuse id, cross-session and intra-batch (in-transaction visibility); mirrors the content_hash ON CONFLICT semantics. - Forward token_df/doc_count maintenance on real inserts only. - Tier-1 (>= MIN_PROJECT_DOCS): capped recent-window classifyPair scan -> persist review-only candidates (INSERT OR IGNORE). Full-corpus sweep is the upcoming dedup-scan. 7 integration tests (cross-session/cross-project/intra-batch/cold-start/disabled); typecheck clean; 80 sqlite tests green.
…tmack#3038) Opt-in, idempotent (research Q-A/Q-C): backfillProjectDedup recomputes title_norm_key for every row + DELETE/INSERT-rebuilds token_df + resets dedup_meta in one transaction (this is how an EXISTING DB joins dedup and how DF drift is reclaimed). sweepProjectCandidates finds existing near-dups via a bounded inverted index (postings on df in 2..~4*sqrt(N) tokens, pairs sharing >=minSharedTokens classified) — not O(N^2). runDedupScan covers all projects. Persists review-only candidates (INSERT OR IGNORE, idempotent).
… & scan (thedotmack#3038) GET /api/dedup/candidates (read-only, joined to both titles, project-scoped) and POST /api/dedup/scan (opt-in backfill+sweep all projects) — thin glue over tested SessionStore.listDedupCandidates / runDedupScan. Registered in worker-service. Gives the Tier-1 candidates table a standalone consumer and the dedup-scan its callable surface (CLI/viewer deferred).
Security:
- M1: single-flight guard on POST /api/dedup/scan (409 on overlap)
- M2: two explicit prepared statements in listDedupCandidates (drop ${where}
interpolation + the as-any cast)
- L1: document the localhost-only trust model on DedupRoutes
- L2: CLAUDE_MEM_DEDUP_MAX_BACKFILL_ROWS cap (skip+warn oversized projects)
Code review:
- C1: occurrence_count retry-idempotency — a redelivered (session,content_hash)
no longer bumps the count (single + batch paths)
- C2: regression test — Tier-0 merge AND content_hash retry leave token_df/
doc_count flat (maintenance = real inserts only)
- C3: annotate dedup_meta drift columns as reserved (no delete hook yet)
- C4: POST /api/dedup/scan gated on CLAUDE_MEM_DEDUP_ENABLED (no row mutation
when disabled)
- C5: buildProjectIdf loads project DF once into a Map (no per-token round-trip)
- C7: dedupConfig clamps NaN config values to safe defaults
- C8: shared candidateInsert() helper (DRY)
typecheck clean; 175 dedup/sqlite/settings tests green.
…3038 maxScan is bound as a SQL LIMIT — keep a fractional misconfig from reaching the binding as a float.
Greptile SummaryThis PR adds opt-in near-duplicate deduplication for observations. The main changes are:
Confidence Score: 5/5The changes appear merge-safe based on the scoped review of the opt-in deduplication pipeline. No blocking correctness or security issues were identified, and the feature is gated off by default with tests covering the new normalization, scoring, persistence, settings, and worker-route behavior.
What T-Rex did
Reviews (2): Last reviewed commit: "fix(dedup): count only newly-persisted c..." | Re-trigger Greptile |
…hedotmack#3038) Greptile review: recordTier1Candidates/sweepProjectCandidates did count++ on every INSERT OR IGNORE that classified as a candidate, so a re-run/redelivery reported phantom counts for rows that were actually ignored by the UNIQUE(observation_id,duplicate_of_id) guard. Count via .run().changes instead — a second idempotent sweep now correctly returns 0. Tightened the idempotency test to assert the return value, not just row count.
|
Thanks @greptile-apps — good catch, fixed in 4c67a64.
Tightened the idempotency test to assert the return value (not just the row count) — it fails against the old code and passes now. 140 dedup/sqlite tests green, typecheck clean. |
|
The fix is correct and well-executed. The key change in // Before (buggy):
count++; // incremented for every classifyPair 'candidate', including INSERT OR IGNORE no-ops
// After (correct):
count += ins.run(...).changes; // only increments when the row was actually inserted (changes=1), 0 on UNIQUE conflictAnd the test tightening is exactly right — LGTM. The fix is minimal, precise, and the test proves both the positive and idempotent paths. No concerns. Tip: You can customize Greptile's behavior for this repo with |
Refs #3038.
Opt-in, off-by-default near-duplicate dedup for observations.
content_hashonlycatches byte-identical rows, so paraphrased observations of the same recurring work
accumulate as noise. This adds two deterministic tiers, validated against a real
7,651-observation DB before any code was written (see the diagnostic on #3038).
What it does (all gated on
CLAUDE_MEM_DEDUP_ENABLED, defaultfalse)matches an existing one after lowercase/whitespace-collapse/punctuation-strip is collapsed
onto the existing row (cross-session) and its new
occurrence_countis bumped. O(1) via anindexed
title_norm_key(sha256(project + normalizeTitle)), mirroring thecontent_hashpattern. NULL key for empty/emoji/punctuation-only titles so they never collapse into
each other.
near-dup candidates in
observation_dedup_candidates(never auto-merged). The veto(a Fellegi–Sunter blocking key) rejects pairs that differ only in a rare discriminating
token —
rdlp-apivsrdlp-plugin,ffmpeg-7.1vs6.1— which plain string similaritywrongly scores ~0.9.
Why this shape (validation evidence)
On the real DB at production thresholds: SimHash-over-narrative was useless (false-positive
dominated); the signal lives in the title; exact-normalized-title found 35–47 real
redundancies
content_hashmisses with zero false positives; and fuzzy lexical auto-mergewould have destroyed genuinely-distinct work that differs in one token — hence Tier-1 is
review-only. Semantic paraphrases (different words, same meaning) need embeddings and are out
of scope here.
Surfaces
occurrence_count,token_df(IDF model),dedup_meta,observation_dedup_candidates.POST /api/dedup/scan— opt-in idempotent backfill (so existing DBs participate) + boundedinverted-index corpus sweep. Gated on the flag + single-flight + row-cap.
GET /api/dedup/candidates?project=— read-only candidate list.CLAUDE_MEM_DEDUP_*settings (documented inconfiguration.mdx).Disabled ⇒ byte-identical legacy behavior (regression-tested).
Tests
Strict TDD, ~20 commits. 175 dedup/sqlite/settings tests green;
tsc --noEmitclean; fullbun testgreen. Coverage includes the empty-title data-loss guard, cross-project isolation,cross-session + intra-batch Tier-0, cold-start gate, retry-idempotency (token_df/doc_count stay
flat on a merge or redelivery), and a real-DB-derived golden fixture.
Reviews
Both ran over the full
BASE..HEADdiff, fixes applied, then re-reviewed:(scan concurrency guard; prepared-statement hardening) + LOWs (backfill row cap; trust-model
doc) all resolved and re-confirmed.
token_df-flat regression test, scan-when-disabled gating, IDF single-load perf, NaN-config
clamp, DRY) all resolved with a genuine regression test; re-confirmed merge-ready.
Trust model
Like all worker routes,
/api/dedup/*is guarded only by the 127.0.0.1 binding + localhostCORS (consistent with
DataRoutes/SearchRoutes/MemoryRoutes).Follow-up (separate PR, only if this is accepted)
An async LLM-adjudication consolidation tier (with a tombstone/audit subsystem) to resolve the
common-token cases lexical methods can't (
CodevsSecurityreview) — kept out of this PRbecause a wrong auto-merge in an append-only store is unrecoverable, so it needs its own
reversibility machinery.