Skip to content

feat(dedup): opt-in near-duplicate observation dedup — Tier-0 auto-merge + Tier-1 candidates (#3038)#3063

Open
crippledgeek wants to merge 19 commits into
thedotmack:mainfrom
crippledgeek:feature/fuzzy-near-dup-dedup
Open

feat(dedup): opt-in near-duplicate observation dedup — Tier-0 auto-merge + Tier-1 candidates (#3038)#3063
crippledgeek wants to merge 19 commits into
thedotmack:mainfrom
crippledgeek:feature/fuzzy-near-dup-dedup

Conversation

@crippledgeek

Copy link
Copy Markdown

Refs #3038.

Opt-in, off-by-default near-duplicate dedup for observations. content_hash only
catches byte-identical rows, so paraphrased observations of the same recurring work
accumulate as noise. This adds two deterministic tiers, validated against a real
7,651-observation DB before any code was written (see the diagnostic on #3038).

This is the conservative interpretation of #3038 (exact auto-merge + fuzzy
candidate-flagging, not fuzzy auto-merge). The data showed fuzzy lexical
auto-merge is unsafe — happy to switch to auto-merge if you'd prefer; I asked on
the issue. The LLM-adjudication tier that could safely resolve the residual cases
is deliberately deferred to a follow-up PR.

What it does (all gated on CLAUDE_MEM_DEDUP_ENABLED, default false)

  • Tier 0 — exact-normalized-title → silent auto-merge. A new observation whose title
    matches an existing one after lowercase/whitespace-collapse/punctuation-strip is collapsed
    onto the existing row (cross-session) and its new occurrence_count is bumped. O(1) via an
    indexed title_norm_key (sha256(project + normalizeTitle)), mirroring the content_hash
    pattern. NULL key for empty/emoji/punctuation-only titles so they never collapse into
    each other.
  • Tier 1 — IDF-weighted TF-IDF cosine + an IDF "veto" → review-only candidates. Records
    near-dup candidates in observation_dedup_candidates (never auto-merged). The veto
    (a Fellegi–Sunter blocking key) rejects pairs that differ only in a rare discriminating
    token — rdlp-api vs rdlp-plugin, ffmpeg-7.1 vs 6.1 — which plain string similarity
    wrongly scores ~0.9.

Why this shape (validation evidence)

On the real DB at production thresholds: SimHash-over-narrative was useless (false-positive
dominated); the signal lives in the title; exact-normalized-title found 35–47 real
redundancies content_hash misses with zero false positives
; and fuzzy lexical auto-merge
would have destroyed genuinely-distinct work that differs in one token — hence Tier-1 is
review-only. Semantic paraphrases (different words, same meaning) need embeddings and are out
of scope here.

Surfaces

  • Migration v33 (pure SQL, idempotent): occurrence_count, token_df (IDF model),
    dedup_meta, observation_dedup_candidates.
  • POST /api/dedup/scan — opt-in idempotent backfill (so existing DBs participate) + bounded
    inverted-index corpus sweep. Gated on the flag + single-flight + row-cap.
  • GET /api/dedup/candidates?project= — read-only candidate list.
  • 7 CLAUDE_MEM_DEDUP_* settings (documented in configuration.mdx).

Disabled ⇒ byte-identical legacy behavior (regression-tested).

Tests

Strict TDD, ~20 commits. 175 dedup/sqlite/settings tests green; tsc --noEmit clean; full
bun test green. Coverage includes the empty-title data-loss guard, cross-project isolation,
cross-session + intra-batch Tier-0, cold-start gate, retry-idempotency (token_df/doc_count stay
flat on a merge or redelivery), and a real-DB-derived golden fixture.

Reviews

Both ran over the full BASE..HEAD diff, fixes applied, then re-reviewed:

  • security-reviewer → PASS. No injection / auth bypass / data-loss. Initial MEDIUM findings
    (scan concurrency guard; prepared-statement hardening) + LOWs (backfill row cap; trust-model
    doc) all resolved and re-confirmed.
  • code-reviewer → Approve. Initial findings (occurrence_count retry-idempotency, a
    token_df-flat regression test, scan-when-disabled gating, IDF single-load perf, NaN-config
    clamp, DRY) all resolved with a genuine regression test; re-confirmed merge-ready.

Trust model

Like all worker routes, /api/dedup/* is guarded only by the 127.0.0.1 binding + localhost
CORS (consistent with DataRoutes/SearchRoutes/MemoryRoutes).

Follow-up (separate PR, only if this is accepted)

An async LLM-adjudication consolidation tier (with a tombstone/audit subsystem) to resolve the
common-token cases lexical methods can't (Code vs Security review) — kept out of this PR
because a wrong auto-merge in an append-only store is unrecoverable, so it needs its own
reversibility machinery.

…#3038

Tier-0 exact-normalized-title normalization (punctuation/case/ws-insensitive)
and a whitespace-only tokenizer that preserves compound identifiers
(rdlp-api, ffmpeg-7.1.conf) as single tokens for IDF-veto correctness.
idf(df,N)=log(1+N/(df+0.5)) + buildIdfFn; rare tokens weigh high so a
difference in a discriminating token dominates cosine + veto.
Rare discriminating tokens inflate the norm without contributing to the
dot product, pulling cosine below threshold for distinct-but-similar
titles (rdlp-api vs rdlp-plugin) that plain token-sort wrongly scores ~0.9.
Fellegi-Sunter blocking-key: a rare token present on only one side vetoes
the merge regardless of cosine. Test documents the known limitation that
common-token discriminators (code vs security) need the Branch-2 LLM tier.
…edotmack#3038

Tier-0 exact-normalized-title (safe auto-merge) / Tier-1 candidate
(cosine>=threshold AND !veto). Golden fixture encodes the real-DB-validated
cases incl. the common-token-discriminator limitation (code/security never
auto-merged) and the true recurring-dup that IS flagged.
…hedotmack#3038)

Code review caught a data-loss defect: null/empty/whitespace/punctuation-only/
emoji-only titles all normalize to '' and would collapse into each other as
Tier-0 'exact' (silent merge of distinct observations). This project uses
emoji titles (🔵/✅), so it's real. Guard the exact branch on a non-empty
normal form; Tier-1 fall-through is already safe (empty → cosine 0 → none).

Adds 5 empty/symbolic negative guards, normalize empty-collapse preconditions,
and an N=0 empty-corpus cosine test (no NaN on first observation).
Research Q-D: short-title cosine is jumpy — a single shared rare token can
dominate it to ~1.0 even for otherwise-disjoint titles. Gate Tier-1 on a
configurable minimum shared-token count (default 2) to kill sparse-vector
noise before scoring.
…nv keys

The SettingsDefaultsManager `loadFromFile` edge-case tests assert
`loadFromFile(<empty|corrupt file>)` deepEquals `getAllDefaults()`. But
`loadFromFile` applies env overrides on top of file/defaults by default, while
`getAllDefaults()` returns pure defaults — so any settings-default key present
in `process.env` makes the two diverge.

The suite already stripped `CLAUDE_MEM_DATA_DIR` (pinned by the preload
tripwire) for this reason, but only that one key. On a contributor machine with
a running claude-mem install, other keys are exported too (e.g.
`CLAUDE_MEM_API_TIMEOUT_MS=120000`), which silently failed 9 of these tests
locally while CI — a clean env — stayed green.

Generalize the isolation: snapshot and delete EVERY `getAllDefaults()` key from
`process.env` in beforeEach, restore in afterEach. Robust to whichever
CLAUDE_MEM_* vars the host exports. No production code changes — `loadFromFile`'s
env-override behavior is correct and already covered by the "environment
variable overrides" describe block; this only fixes test isolation.

Before: full suite 2159 pass / 9 fail on a dev box exporting CLAUDE_MEM_* vars.
After: 2159 pass / 0 fail in the same env.
(cherry picked from commit 7ea4880)
…dotmack#3038

Six opt-in knobs: ENABLED=false, COSINE_THRESHOLD=0.80 (empirical short-title
sweet spot), IDF_VETO_DF=10, MIN_SHARED_TOKENS=2, MIN_PROJECT_DOCS=10
(cold-start gate), MAX_SCAN=2000. Env/settings.json only (no viewer UI wiring
needed for a server-side off-by-default flag).
…didates (thedotmack#3038)

Pure-SQL migration (repo norm — no JS backfill in migrations): adds
observations.occurrence_count (default 1), token_df (per-project IDF model,
filled forward / rebuilt by dedup-scan), dedup_meta (cold-start + drift
tracking), and observation_dedup_candidates (Tier-1 review-only, mirrors
observation_feedback; UNIQUE(observation_id,duplicate_of_id) + method/status
CHECKs). schema.sql updated to match.
…rt gate (thedotmack#3038)

dedup-store.ts (plain fns over Database, keeps SessionStore lean):
bumpTokenDf (forward DF/doc_count maintenance, unique tokens only),
buildProjectIdf (project-scoped idf + corpus size), isFuzzyReady
(cold-start gate, research Q-B). Called on real inserts only — a Tier-0
merge adds no document.
…mack#3038)

Research: SQLite can't express \p{L}-aware normalization (ASCII-only lower(),
no regexp_replace, no bun:sqlite custom fns), so precompute the key in app code
and index it — the content_hash pattern. computeTitleNormKey = sha256(project +
normalizeTitle), NULL when title normalizes to empty (data-loss guard reused at
the persistence layer). findTier0Canonical does the O(1) lookup. NON-unique
index keeps dedup app-gated on the flag (off = byte-identical).
…toreObservation(s) (thedotmack#3038)

Both the single and batch (ResponseProcessor) write paths, fully gated on
CLAUDE_MEM_DEDUP_ENABLED (off = byte-identical legacy behavior):
- Tier-0: O(1) title_norm_key lookup -> bump occurrence_count + reuse id,
  cross-session and intra-batch (in-transaction visibility); mirrors the
  content_hash ON CONFLICT semantics.
- Forward token_df/doc_count maintenance on real inserts only.
- Tier-1 (>= MIN_PROJECT_DOCS): capped recent-window classifyPair scan ->
  persist review-only candidates (INSERT OR IGNORE). Full-corpus sweep is the
  upcoming dedup-scan.
7 integration tests (cross-session/cross-project/intra-batch/cold-start/disabled);
typecheck clean; 80 sqlite tests green.
…tmack#3038)

Opt-in, idempotent (research Q-A/Q-C): backfillProjectDedup recomputes
title_norm_key for every row + DELETE/INSERT-rebuilds token_df + resets
dedup_meta in one transaction (this is how an EXISTING DB joins dedup and how
DF drift is reclaimed). sweepProjectCandidates finds existing near-dups via a
bounded inverted index (postings on df in 2..~4*sqrt(N) tokens, pairs sharing
>=minSharedTokens classified) — not O(N^2). runDedupScan covers all projects.
Persists review-only candidates (INSERT OR IGNORE, idempotent).
… & scan (thedotmack#3038)

GET /api/dedup/candidates (read-only, joined to both titles, project-scoped)
and POST /api/dedup/scan (opt-in backfill+sweep all projects) — thin glue over
tested SessionStore.listDedupCandidates / runDedupScan. Registered in
worker-service. Gives the Tier-1 candidates table a standalone consumer and the
dedup-scan its callable surface (CLI/viewer deferred).
Security:
- M1: single-flight guard on POST /api/dedup/scan (409 on overlap)
- M2: two explicit prepared statements in listDedupCandidates (drop ${where}
  interpolation + the as-any cast)
- L1: document the localhost-only trust model on DedupRoutes
- L2: CLAUDE_MEM_DEDUP_MAX_BACKFILL_ROWS cap (skip+warn oversized projects)
Code review:
- C1: occurrence_count retry-idempotency — a redelivered (session,content_hash)
  no longer bumps the count (single + batch paths)
- C2: regression test — Tier-0 merge AND content_hash retry leave token_df/
  doc_count flat (maintenance = real inserts only)
- C3: annotate dedup_meta drift columns as reserved (no delete hook yet)
- C4: POST /api/dedup/scan gated on CLAUDE_MEM_DEDUP_ENABLED (no row mutation
  when disabled)
- C5: buildProjectIdf loads project DF once into a Map (no per-token round-trip)
- C7: dedupConfig clamps NaN config values to safe defaults
- C8: shared candidateInsert() helper (DRY)
typecheck clean; 175 dedup/sqlite/settings tests green.
…3038

maxScan is bound as a SQL LIMIT — keep a fractional misconfig from reaching
the binding as a float.
@greptile-apps

greptile-apps Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds opt-in near-duplicate deduplication for observations. The main changes are:

  • Tier-0 normalized-title auto-merge with occurrence_count tracking.
  • Tier-1 review-only candidate detection using IDF-weighted cosine scoring and a rare-token veto.
  • SQLite schema additions for title keys, IDF metadata, and candidate storage.
  • Worker endpoints for running a gated dedup scan and listing candidates.
  • New settings, documentation, fixtures, and tests for the dedup pipeline.

Confidence Score: 5/5

The changes appear merge-safe based on the scoped review of the opt-in deduplication pipeline.

No blocking correctness or security issues were identified, and the feature is gated off by default with tests covering the new normalization, scoring, persistence, settings, and worker-route behavior.

T-Rex T-Rex Logs

What T-Rex did

  • Compared the base Tier-0 dedup state with the after state to confirm changes in occurrence_count and title normalization handling.
  • Verified the dedup API routing and behavior across disabled and enabled scans, and observed idempotent candidate results across runs.
  • Observed schema and data model changes after head: added occurrence_count, title_norm_key, idx_observations_title_norm, and dedup tables; confirmed Tier-0 merge returns id=1 with occurrence_count=2 and related data remains consistent.
  • Confirmed the Tier-1 scan feature was added by enabling the module and config, showing backfill documentation, and reporting one pending candidate after sweep with observed sweep results.
  • Compared dedup-config before and after: defaults/docs entries expanded, opt-in behavior introduced, and cross-session normalization merging demonstrated when opt-in is enabled.

View all artifacts

T-Rex Ran code and verified through T-Rex

Reviews (2): Last reviewed commit: "fix(dedup): count only newly-persisted c..." | Re-trigger Greptile

Comment thread src/services/sqlite/dedup-store.ts Outdated
…hedotmack#3038)

Greptile review: recordTier1Candidates/sweepProjectCandidates did count++ on
every INSERT OR IGNORE that classified as a candidate, so a re-run/redelivery
reported phantom counts for rows that were actually ignored by the
UNIQUE(observation_id,duplicate_of_id) guard. Count via .run().changes instead
— a second idempotent sweep now correctly returns 0. Tightened the idempotency
test to assert the return value, not just row count.
@crippledgeek

Copy link
Copy Markdown
Author

Thanks @greptile-apps — good catch, fixed in 4c67a64.

recordTier1Candidates / sweepProjectCandidates were doing count++ on every INSERT OR IGNORE that classified as a candidate, so a re-run or redelivery reported a phantom count for rows that the UNIQUE(observation_id, duplicate_of_id) guard actually ignored. They now accumulate .run().changes instead, so the returned count reflects only newly-persisted rows — a second idempotent sweep correctly returns 0 (the persisted data was always idempotent; only the reported number was off).

Tightened the idempotency test to assert the return value (not just the row count) — it fails against the old code and passes now. 140 dedup/sqlite tests green, typecheck clean.

@greptile-apps

greptile-apps Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

The fix is correct and well-executed. The key change in dedup-store.ts:

// Before (buggy):
count++; // incremented for every classifyPair 'candidate', including INSERT OR IGNORE no-ops

// After (correct):
count += ins.run(...).changes; // only increments when the row was actually inserted (changes=1), 0 on UNIQUE conflict

And the test tightening is exactly right — expect(secondReturned).toBe(0) now fails against the old code and proves the fix. The assertion that firstReturned === after1 (return value equals row count on first run) is also a nice addition that covers the forward case.

LGTM. The fix is minimal, precise, and the test proves both the positive and idempotent paths. No concerns.

Tip: You can customize Greptile's behavior for this repo with .greptile/rules.md and .greptile/config.json.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant