Skip to content

docs: improve doc parsing#116

Merged
mikr13 merged 18 commits intomainfrom
improve-doc-parsing
Mar 17, 2026
Merged

docs: improve doc parsing#116
mikr13 merged 18 commits intomainfrom
improve-doc-parsing

Conversation

@pratham-kpatil
Copy link
Copy Markdown
Contributor

No description provided.

@pratham-kpatil pratham-kpatil requested review from Copilot and mikr13 and removed request for Copilot March 14, 2026 10:32
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 14, 2026

Walkthrough

This PR adds duplicate detection and deduplication to the document import workflow. It enriches parsed items with existing memory references, provides UI controls to compare and select between new vs existing duplicates, strengthens memory storage with duplicate detection logic, and introduces cancellation support for document parsing operations.

Changes

Cohort / File(s) Summary
Changeset Documentation
.changeset/fifty-words-lick.md, .changeset/huge-suns-prove.md, .changeset/tender-months-wish.md
Three changelog entries documenting patch, minor, and major version bumps; noting observability additions, prompt updates for deduplication, and UI preview of duplicate memory comparisons.
Document Parsing & Cancellation
src/lib/document/document-parser.ts
Introduces AbortSignal-based cancellation, caches pdfjs-dist module with worker setup, expands parseDocument API with options (requestId, onStageChange, signal), adds stage callbacks and telemetry hooks, implements deduplication and normalization (normalizeTags, normalizeKey, deduplicateItems), updates prompt guidance for memory merging.
Document Import Dialog
src/components/features/document/document-import-dialog.tsx
Adds memoized handleFileSelect with useCallback, debounces duplicate imports via lastImportKeyRef/lastImportTimeRef, integrates AbortController for cancellation, enriches parsed items with existingDuplicate metadata via findDuplicates lookup, adds toast error notifications, implements lifecycle-aware cleanup.
Import Dialog UI & Duplication Display
src/components/features/import/import-dialog-shared.tsx
Displays duplicate count badge in header, adds conditional styling for duplicate items, includes "New" vs "Currently saved" comparison sections, adds "Use new" and "Keep existing" action buttons to toggle between values, preserves existing selection behavior.
Memory Deduplication Logic
src/lib/storage/memories.ts
Adds deduplication via contentHash and secondary answerKey matching, introduces uniqueTags helper for normalization, updates addEntry to merge on duplicate detection (preserves ID, updates hash/tags/confidence/metadata), adds batched deduplication in addEntries, exports new findDuplicates function for candidate-to-existing matching.
Content Hash & Answer Key
src/lib/storage/content-hash.ts
Adds buildAnswerKey function that normalizes and concatenates answer and category with || delimiter for secondary duplicate detection key.
Import Type Extensions
src/types/import.ts
Extends BaseImportItem with optional existingDuplicate?: MemoryEntry field, imported MemoryEntry type from memory definitions.
Import Hook Logging
src/hooks/use-import-dialog.ts
Enhances success and error messages with item counts, introduces finally block to reliably reset isSaving state.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant ImportDialog as Document<br/>Import Dialog
    participant Parser as Document<br/>Parser
    participant Memory as Memory<br/>Storage
    participant UI as Import Dialog<br/>UI

    User->>ImportDialog: Select PDF/TXT file
    ImportDialog->>ImportDialog: Debounce check<br/>(lastImportKey, time)
    ImportDialog->>ImportDialog: Create AbortController
    ImportDialog->>Parser: parseDocument(file, options)
    activate Parser
    Parser->>Parser: Read file (stage: reading)
    Parser->>Parser: Extract text (PDF/TXT)
    Parser->>Parser: Call AI parsing<br/>(stage: parsing)
    Parser->>Parser: Deduplicate & normalize<br/>items
    Parser-->>ImportDialog: Return items[]
    deactivate Parser
    
    ImportDialog->>Memory: findDuplicates(items, existing)
    activate Memory
    Memory->>Memory: Build index by<br/>contentHash & answerKey
    Memory-->>ImportDialog: Map<index, MemoryEntry>
    deactivate Memory
    
    ImportDialog->>ImportDialog: Enrich items with<br/>existingDuplicate
    ImportDialog-->>UI: Display items with<br/>duplicate metadata
    
    User->>UI: Compare & select<br/>"Use new" or "Keep existing"
    UI->>Memory: Save selection
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • feat(SUP-5): scraping of linkedin & PDF #69: Both PRs modify the document import/parsing stack—adding PDF parsing infrastructure (pdfjs-dist integration), the parseDocument API, and import dialog components with duplicate detection support.
  • feat: new features for v2 #45: Both PRs modify memory storage and deduplication surface (src/lib/storage/memories.ts, type extensions) to support duplicate detection and entry merging.
🚥 Pre-merge checks | ❌ 3

❌ Failed checks (1 warning, 2 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive No pull request description was provided by the author, making it impossible to assess relevance to the changeset. Add a detailed pull request description explaining the objectives, key changes, and how duplicate detection and deduplication improve the document parsing workflow.
Title check ❓ Inconclusive The PR title is vague and generic, using the non-descriptive term 'improve' without specifying what aspects of doc parsing are being improved or what the main changes accomplish. Provide a more specific title that describes the main functionality added, such as 'Add duplicate detection and memory deduplication in document import' or 'Add observability and duplicate handling to document parser'.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch improve-doc-parsing
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can use OpenGrep to find security vulnerabilities and bugs across 17+ programming languages.

OpenGrep is compatible with Semgrep configurations. Add an opengrep.yml or semgrep.yml configuration file to your project to enable OpenGrep analysis.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.changeset/fifty-words-lick.md:
- Line 5: Replace the misspelled word "obersvability" with "observability" in
the release note text (the line reading "add obersvability in the document
parser") so the sentence reads "add observability in the document parser".

In @.changeset/huge-suns-prove.md:
- Line 5: Replace the misspelled word "redeundant" with "redundant" in the
changeset note — specifically update the phrase "Updated doc parser prompt to
stop redeundant memory creation and added logic for deduplication of memory" to
"Updated doc parser prompt to stop redundant memory creation and added logic for
deduplication of memory" so the release note reads correctly.

In @.changeset/tender-months-wish.md:
- Line 5: Fix the typo in the changeset description by replacing the misspelled
word "exisitng" with "existing" in the changeset text (the phrase "preview of
the new and exisitng duplicate memory in the UI"); update that exact string so
the sentence reads "preview of the new and existing duplicate memory in the UI".

In `@src/components/features/document/document-import-dialog.tsx`:
- Around line 150-151: The file input value is not cleared on all early-exit
paths in the try/catch around the parseDocument(file, ...) call in
DocumentImportDialog, which prevents re-selecting the same file; update the
handler so that every exit path (success, AbortError/cancellation, other errors,
and any early returns) clears the file input and resets any selectedFile state.
Concretely, locate the function containing parseDocument(file, ...) and ensure
you either call a shared cleanup helper (e.g., resetFileInput()) or directly set
fileInputRef.current.value = '' and setSelectedFile(null) in a finally block or
immediately before each return/throw (including AbortError handling), so the
input is always reset.
- Around line 176-183: After awaiting findDuplicates, the handler can overwrite
UI if a newer import started; capture the current request identifier before the
await (e.g., const currentRequestId = importRequestIdRef.current or a local
requestId variable used by the component) and immediately after the await check
that the global/latest request id still equals currentRequestId; if it does not
match, return early and do not call setImportItems or setStatus. Apply this
guard around the block that computes duplicatesMap, enrichedItems and calls
setImportItems/setStatus (referencing findDuplicates, duplicatesMap,
enrichedItems, setImportItems, setStatus).

In `@src/components/features/import/import-dialog-shared.tsx`:
- Around line 173-174: Remove the non-essential inline JSX comments in the
render block of the ImportDialogShared component: delete comments like {/* New
value from document */} and the similar comments around the adjacent divs so the
JSX is not cluttered; keep only comments that explain tricky/hacky/TODO behavior
and remove descriptive comments at the three locations mentioned (the comment
currently in the div with the new value and the other two nearby inline
comments).
- Around line 199-230: The two toggle-like buttons rendered near the "Use new" /
"Keep existing" labels don't expose their pressed state for assistive tech;
update the button elements in this component (the buttons that call
onToggleItem(item.id) and use the item.selected condition) to include
aria-pressed set to a boolean reflecting the current choice (e.g.,
aria-pressed={item.selected} for "Use new" and aria-pressed={!item.selected} for
"Keep existing"), leaving existing onClick logic intact so screen readers can
see which option is pressed.

In `@src/lib/document/document-parser.ts`:
- Around line 431-441: normalizeKey currently only trims/lowers and includes
label, which diverges from storage-side dedupe (which normalizes
Unicode/whitespace and collapses by answer+category), causing preview-time and
save-time mismatch; update normalizeKey (and the other similar functions around
lines 443-466 and 471-477) to reuse the exact storage normalization and keying
logic (call the shared storage normalizer or import the storage module) so the
preview dedupe uses the same Unicode/whitespace normalization and the same
composite key (answer+category, not label/question) that addEntries uses,
ensuring consistent duplicate detection between preview and save.
- Around line 80-90: The PDF and AI parsing functions don't regularly observe
the provided AbortSignal, so long-running work continues after cancellation;
update extractTextFromPDF to check signal.aborted inside the per-page loop and
abort immediately (throw an AbortError/DOMException) when observed, and update
parseDocumentWithAI (and callers like parseDocument) to re-check the signal
after any awaited model call and before returning success so a mid-flight AI
call doesn't cause a false success result; reference the AbortSignal parameter
(or add one) in extractTextFromPDF, parseDocumentWithAI, and parseDocument and
ensure all long-running loops and post-await paths throw on abort.

In `@src/lib/storage/content-hash.ts`:
- Around line 10-11: The current buildAnswerKey function concatenates
normalizePart(answer) and normalizePart(category) with "||", which can collide
if either part contains that delimiter; change buildAnswerKey to produce a
collision-safe key by encoding or delimiting parts unambiguously (e.g.,
base64-encode each normalized part or use a length-prefixed or JSON-based
encoding) and join the encoded pieces instead of raw strings; update the
implementation that references buildAnswerKey and ensure normalizePart remains
applied before encoding so lookups use the same deterministic encoding format.

In `@src/lib/storage/memories.ts`:
- Around line 192-220: Wrap the entire async body of findDuplicates in a
try/catch so failures from computeContentHash (called for existing entries and
candidates) do not reject the whole flow; inside the catch, log the error with
contextual info (e.g. candidates.length and which loop phase) and return an
empty Map as a safe fallback (or rethrow a new Error with that context if you
prefer failing loudly). Ensure you protect both loops that build
hashToEntry/answerKeyToEntry and the candidate iteration that computes
hash/answerKey, and include the function name findDuplicates and symbols
computeContentHash, hashToEntry, answerKeyToEntry in your log/context to aid
debugging.
- Around line 42-49: The secondary dedupe fallback using buildAnswerKey(answer,
category) can incorrectly collide unrelated memories that share an answer;
update the logic in the places referencing answerKey and existingIndex (used by
addEntry/addEntries and findDuplicates) so the fallback only applies to
legacy/questionless entries (e.g., where e.question is null/empty or a legacy
flag exists) or include a normalized question component when computing the key;
specifically, change the existingIndex computation to first match by
contentHash, then if not found only match by buildAnswerKey(e.answer,
e.category) when e.question is falsy (or otherwise normalize e.question and
include it in buildAnswerKey) to avoid destructive overwrites.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: b1dfe4cd-9a90-4c5e-bafe-aa1669d10709

📥 Commits

Reviewing files that changed from the base of the PR and between 16b4d71 and 5acf4bb.

📒 Files selected for processing (10)
  • .changeset/fifty-words-lick.md
  • .changeset/huge-suns-prove.md
  • .changeset/tender-months-wish.md
  • src/components/features/document/document-import-dialog.tsx
  • src/components/features/import/import-dialog-shared.tsx
  • src/hooks/use-import-dialog.ts
  • src/lib/document/document-parser.ts
  • src/lib/storage/content-hash.ts
  • src/lib/storage/memories.ts
  • src/types/import.ts

Comment thread .changeset/fifty-words-lick.md Outdated
Comment thread .changeset/huge-suns-prove.md Outdated
Comment thread .changeset/tender-months-wish.md Outdated
Comment on lines +150 to +151
try {
const result = await parseDocument(file, {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Ensure file input is reset on all exit paths.

There are early returns in the try/catch flow (e.g., cancellation / AbortError) before Line [203], so the file input may not be cleared. That can prevent selecting the same file again.

🔧 Suggested fix
-try {
+try {
   // existing logic
 } catch (err) {
   // existing logic
+} finally {
+  if (fileInputRef.current) {
+    fileInputRef.current.value = "";
+  }
 }
-
-if (fileInputRef.current) {
-  fileInputRef.current.value = "";
-}

Also applies to: 163-170, 190-205

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/components/features/document/document-import-dialog.tsx` around lines 150
- 151, The file input value is not cleared on all early-exit paths in the
try/catch around the parseDocument(file, ...) call in DocumentImportDialog,
which prevents re-selecting the same file; update the handler so that every exit
path (success, AbortError/cancellation, other errors, and any early returns)
clears the file input and resets any selectedFile state. Concretely, locate the
function containing parseDocument(file, ...) and ensure you either call a shared
cleanup helper (e.g., resetFileInput()) or directly set
fileInputRef.current.value = '' and setSelectedFile(null) in a finally block or
immediately before each return/throw (including AbortError handling), so the
input is always reset.

Comment on lines +176 to +183
const duplicatesMap = await findDuplicates(items, existingMemories);
const enrichedItems = items.map((item, i) => {
const duplicate = duplicatesMap.get(i);
return duplicate ? { ...item, existingDuplicate: duplicate } : item;
});

setImportItems(enrichedItems);
setStatus("success");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add stale-request guard after duplicate lookup.

findDuplicates(...) is awaited, but there’s no request-id recheck afterward. A newer import can start during that await, and this older request can still overwrite UI state.

🔧 Suggested fix
 const duplicatesMap = await findDuplicates(items, existingMemories);
+if (requestIdRef.current !== currentRequestId) return;

 const enrichedItems = items.map((item, i) => {
   const duplicate = duplicatesMap.get(i);
   return duplicate ? { ...item, existingDuplicate: duplicate } : item;
 });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/components/features/document/document-import-dialog.tsx` around lines 176
- 183, After awaiting findDuplicates, the handler can overwrite UI if a newer
import started; capture the current request identifier before the await (e.g.,
const currentRequestId = importRequestIdRef.current or a local requestId
variable used by the component) and immediately after the await check that the
global/latest request id still equals currentRequestId; if it does not match,
return early and do not call setImportItems or setStatus. Apply this guard
around the block that computes duplicatesMap, enrichedItems and calls
setImportItems/setStatus (referencing findDuplicates, duplicatesMap,
enrichedItems, setImportItems, setStatus).

Comment on lines 80 to 90
export async function extractTextFromPDF(file: File): Promise<string> {
const pdfjsLib = await import("pdfjs-dist");

pdfjsLib.GlobalWorkerOptions.workerSrc =
browser.runtime.getURL("/pdf.worker.mjs");
// Cache pdfjs module + worker URL so it's only initialized once per session
if (!_pdfjsLib) {
_pdfjsLib = await import("pdfjs-dist");
_pdfjsLib.GlobalWorkerOptions.workerSrc =
browser.runtime.getURL("/pdf.worker.mjs");
}
const pdfjsLib = _pdfjsLib;

const arrayBuffer = await file.arrayBuffer();
const pdf = await pdfjsLib.getDocument({ data: arrayBuffer }).promise;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Abort support still misses the longest-running work.

Cancellation is only checked at phase boundaries. During PDF parsing, extractTextFromPDF(...) keeps walking every page after the caller aborts; during AI extraction, a mid-flight abort is ignored entirely because parseDocumentWithAI(...) never re-checks the signal after the awaited model call, so parseDocument(...) can still return success: true after the dialog was cancelled.

Possible fix
-export async function extractTextFromPDF(file: File): Promise<string> {
+export async function extractTextFromPDF(
+  file: File,
+  signal?: AbortSignal,
+): Promise<string> {
   // Cache pdfjs module + worker URL so it's only initialized once per session
   if (!_pdfjsLib) {
     _pdfjsLib = await import("pdfjs-dist");
     _pdfjsLib.GlobalWorkerOptions.workerSrc =
       browser.runtime.getURL("/pdf.worker.mjs");
@@
   for (let i = 1; i <= pdf.numPages; i++) {
+    throwIfAborted(signal);
     const page = await pdf.getPage(i);
     const textContent = await page.getTextContent();
@@
     const result = await generateObject({
       model,
       schema: ExtractedInfoSchema,
@@
     });
     object = result.object;
+    throwIfAborted(signal);
@@
-      text = await extractTextFromPDF(file);
+      text = await extractTextFromPDF(file, signal);

Also applies to: 95-146, 210-211, 251-269, 351-358, 387-397

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/document/document-parser.ts` around lines 80 - 90, The PDF and AI
parsing functions don't regularly observe the provided AbortSignal, so
long-running work continues after cancellation; update extractTextFromPDF to
check signal.aborted inside the per-page loop and abort immediately (throw an
AbortError/DOMException) when observed, and update parseDocumentWithAI (and
callers like parseDocument) to re-check the signal after any awaited model call
and before returning success so a mid-flight AI call doesn't cause a false
success result; reference the AbortSignal parameter (or add one) in
extractTextFromPDF, parseDocumentWithAI, and parseDocument and ensure all
long-running loops and post-await paths throw on abort.

Comment on lines +431 to +441
function normalizeKey(item: ExtractedItem): string {
const safe = (value: string) => value.trim().toLowerCase();
return [
safe(item.label),
safe(item.question),
safe(item.answer),
item.category,
]
.map((part) => part || "")
.join("|");
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Parser-side dedupe key does not match storage-side duplicate matching.

normalizeKey(...) only trims/lowercases and also includes label, while storage dedupe normalizes Unicode/whitespace differently and can collapse records by answer + category even when labels/questions differ. That means the import preview can keep rows that addEntries(...) later merges, or flag a different duplicate than storage will. Reuse the same normalization/keying strategy as the storage layer so preview-time dedupe and save-time dedupe stay consistent.

Also applies to: 443-466, 471-477

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/document/document-parser.ts` around lines 431 - 441, normalizeKey
currently only trims/lowers and includes label, which diverges from storage-side
dedupe (which normalizes Unicode/whitespace and collapses by answer+category),
causing preview-time and save-time mismatch; update normalizeKey (and the other
similar functions around lines 443-466 and 471-477) to reuse the exact storage
normalization and keying logic (call the shared storage normalizer or import the
storage module) so the preview dedupe uses the same Unicode/whitespace
normalization and the same composite key (answer+category, not label/question)
that addEntries uses, ensuring consistent duplicate detection between preview
and save.

Comment on lines +10 to +11
export const buildAnswerKey = (answer: string, category: string): string =>
`${normalizePart(answer)}||${normalizePart(category)}`;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use collision-safe encoding for buildAnswerKey.

At Line [11], joining normalized parts with || can collide when input values include the delimiter, causing false duplicate matches.

🔧 Suggested fix
 export const buildAnswerKey = (answer: string, category: string): string =>
-  `${normalizePart(answer)}||${normalizePart(category)}`;
+  JSON.stringify([normalizePart(answer), normalizePart(category)]);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
export const buildAnswerKey = (answer: string, category: string): string =>
`${normalizePart(answer)}||${normalizePart(category)}`;
export const buildAnswerKey = (answer: string, category: string): string =>
JSON.stringify([normalizePart(answer), normalizePart(category)]);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/storage/content-hash.ts` around lines 10 - 11, The current
buildAnswerKey function concatenates normalizePart(answer) and
normalizePart(category) with "||", which can collide if either part contains
that delimiter; change buildAnswerKey to produce a collision-safe key by
encoding or delimiting parts unambiguously (e.g., base64-encode each normalized
part or use a length-prefixed or JSON-based encoding) and join the encoded
pieces instead of raw strings; update the implementation that references
buildAnswerKey and ensure normalizePart remains applied before encoding so
lookups use the same deterministic encoding format.

Comment on lines +42 to +49
const answerKey = buildAnswerKey(entry.answer, entry.category);

const existingIndex =
currentEntries.findIndex((e) => e.contentHash === contentHash) !== -1
? currentEntries.findIndex((e) => e.contentHash === contentHash)
: currentEntries.findIndex(
(e) => buildAnswerKey(e.answer, e.category) === answerKey,
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Secondary dedupe on answer + category can overwrite unrelated memories.

buildAnswerKey(...) ignores question, so two different facts in the same category that happen to share an answer will collide here. A common case is two location memories with "New York" (for example, birth city vs current city): addEntry/addEntries will overwrite one with the other, and findDuplicates will point the UI at the wrong existing record. This fallback needs to be restricted to truly questionless legacy rows, or it needs a normalized question component before it is used for destructive merges. Based on learnings, "Maintain backward compatibility with stored data when making changes".

Also applies to: 120-123, 136-137, 204-214

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/storage/memories.ts` around lines 42 - 49, The secondary dedupe
fallback using buildAnswerKey(answer, category) can incorrectly collide
unrelated memories that share an answer; update the logic in the places
referencing answerKey and existingIndex (used by addEntry/addEntries and
findDuplicates) so the fallback only applies to legacy/questionless entries
(e.g., where e.question is null/empty or a legacy flag exists) or include a
normalized question component when computing the key; specifically, change the
existingIndex computation to first match by contentHash, then if not found only
match by buildAnswerKey(e.answer, e.category) when e.question is falsy (or
otherwise normalize e.question and include it in buildAnswerKey) to avoid
destructive overwrites.

Comment on lines +192 to +220
export const findDuplicates = async (
candidates: Array<{ question: string; answer: string; category: string }>,
existing: MemoryEntry[],
): Promise<Map<number, MemoryEntry>> => {
const hashToEntry = new Map<string, MemoryEntry>();
const answerKeyToEntry = new Map<string, MemoryEntry>();

for (const entry of existing) {
const hash =
entry.contentHash ||
(await computeContentHash(entry.question, entry.answer, entry.category));
hashToEntry.set(hash, entry);
answerKeyToEntry.set(buildAnswerKey(entry.answer, entry.category), entry);
}

const result = new Map<number, MemoryEntry>();

for (let i = 0; i < candidates.length; i++) {
const c = candidates[i];
const hash = await computeContentHash(c.question, c.answer, c.category);
const answerKey = buildAnswerKey(c.answer, c.category);

const matched = hashToEntry.get(hash) ?? answerKeyToEntry.get(answerKey);
if (matched) {
result.set(i, matched);
}
}

return result;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

findDuplicates is the only async storage helper here without local error handling.

This path awaits computeContentHash(...) for both existing and candidate entries, but any failure currently rejects the whole duplicate-detection flow without the logging/context the rest of this module provides. Wrap the body in try/catch and either rethrow with context or return a safe fallback if duplicate detection is best-effort.

Possible fix
 export const findDuplicates = async (
   candidates: Array<{ question: string; answer: string; category: string }>,
   existing: MemoryEntry[],
 ): Promise<Map<number, MemoryEntry>> => {
-  const hashToEntry = new Map<string, MemoryEntry>();
-  const answerKeyToEntry = new Map<string, MemoryEntry>();
-
-  for (const entry of existing) {
-    const hash =
-      entry.contentHash ||
-      (await computeContentHash(entry.question, entry.answer, entry.category));
-    hashToEntry.set(hash, entry);
-    answerKeyToEntry.set(buildAnswerKey(entry.answer, entry.category), entry);
-  }
-
-  const result = new Map<number, MemoryEntry>();
-
-  for (let i = 0; i < candidates.length; i++) {
-    const c = candidates[i];
-    const hash = await computeContentHash(c.question, c.answer, c.category);
-    const answerKey = buildAnswerKey(c.answer, c.category);
-
-    const matched = hashToEntry.get(hash) ?? answerKeyToEntry.get(answerKey);
-    if (matched) {
-      result.set(i, matched);
-    }
-  }
-
-  return result;
+  try {
+    const hashToEntry = new Map<string, MemoryEntry>();
+    const answerKeyToEntry = new Map<string, MemoryEntry>();
+
+    for (const entry of existing) {
+      const hash =
+        entry.contentHash ||
+        (await computeContentHash(entry.question, entry.answer, entry.category));
+      hashToEntry.set(hash, entry);
+      answerKeyToEntry.set(buildAnswerKey(entry.answer, entry.category), entry);
+    }
+
+    const result = new Map<number, MemoryEntry>();
+
+    for (let i = 0; i < candidates.length; i++) {
+      const c = candidates[i];
+      const hash = await computeContentHash(c.question, c.answer, c.category);
+      const answerKey = buildAnswerKey(c.answer, c.category);
+
+      const matched = hashToEntry.get(hash) ?? answerKeyToEntry.get(answerKey);
+      if (matched) {
+        result.set(i, matched);
+      }
+    }
+
+    return result;
+  } catch (error) {
+    logger.error("Failed to find duplicates:", error);
+    throw error;
+  }
 };

As per coding guidelines, "Handle errors gracefully with try-catch blocks and null checks in all async operations".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
export const findDuplicates = async (
candidates: Array<{ question: string; answer: string; category: string }>,
existing: MemoryEntry[],
): Promise<Map<number, MemoryEntry>> => {
const hashToEntry = new Map<string, MemoryEntry>();
const answerKeyToEntry = new Map<string, MemoryEntry>();
for (const entry of existing) {
const hash =
entry.contentHash ||
(await computeContentHash(entry.question, entry.answer, entry.category));
hashToEntry.set(hash, entry);
answerKeyToEntry.set(buildAnswerKey(entry.answer, entry.category), entry);
}
const result = new Map<number, MemoryEntry>();
for (let i = 0; i < candidates.length; i++) {
const c = candidates[i];
const hash = await computeContentHash(c.question, c.answer, c.category);
const answerKey = buildAnswerKey(c.answer, c.category);
const matched = hashToEntry.get(hash) ?? answerKeyToEntry.get(answerKey);
if (matched) {
result.set(i, matched);
}
}
return result;
export const findDuplicates = async (
candidates: Array<{ question: string; answer: string; category: string }>,
existing: MemoryEntry[],
): Promise<Map<number, MemoryEntry>> => {
try {
const hashToEntry = new Map<string, MemoryEntry>();
const answerKeyToEntry = new Map<string, MemoryEntry>();
for (const entry of existing) {
const hash =
entry.contentHash ||
(await computeContentHash(entry.question, entry.answer, entry.category));
hashToEntry.set(hash, entry);
answerKeyToEntry.set(buildAnswerKey(entry.answer, entry.category), entry);
}
const result = new Map<number, MemoryEntry>();
for (let i = 0; i < candidates.length; i++) {
const c = candidates[i];
const hash = await computeContentHash(c.question, c.answer, c.category);
const answerKey = buildAnswerKey(c.answer, c.category);
const matched = hashToEntry.get(hash) ?? answerKeyToEntry.get(answerKey);
if (matched) {
result.set(i, matched);
}
}
return result;
} catch (error) {
logger.error("Failed to find duplicates:", error);
throw error;
}
};
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/storage/memories.ts` around lines 192 - 220, Wrap the entire async
body of findDuplicates in a try/catch so failures from computeContentHash
(called for existing entries and candidates) do not reject the whole flow;
inside the catch, log the error with contextual info (e.g. candidates.length and
which loop phase) and return an empty Map as a safe fallback (or rethrow a new
Error with that context if you prefer failing loudly). Ensure you protect both
loops that build hashToEntry/answerKeyToEntry and the candidate iteration that
computes hash/answerKey, and include the function name findDuplicates and
symbols computeContentHash, hashToEntry, answerKeyToEntry in your log/context to
aid debugging.

Removed comments related to import debouncing and cancellation.
Copilot AI review requested due to automatic review settings March 14, 2026 10:42
mikr13 added 5 commits March 14, 2026 16:13
Removed commented-out code for clarity.
Remove comment about cached pdfjs module.
Removed telemetry-related debug code from document parser.
Remove comment about secondary dedup key in content-hash.ts
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the document import flow by reducing duplicate memories, previewing potential overwrites in the UI, and adding more robust parsing/observability controls around document-to-memory extraction.

Changes:

  • Add deduplication + duplicate-detection helpers in storage (contentHash primary, answer+category fallback) and surface duplicates in the import UI.
  • Enhance document parsing with stage reporting, cancellation, pdfjs module caching, prompt updates, and optional telemetry hooks.
  • Update import dialog UI to highlight duplicates and allow “Use new / Keep existing” selection per item.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/types/import.ts Adds existingDuplicate?: MemoryEntry to import items so UI can render duplicate previews.
src/lib/storage/memories.ts Updates addEntry/addEntries to merge on duplicates; adds findDuplicates helper; normalizes/uniquifies tags.
src/lib/storage/content-hash.ts Introduces buildAnswerKey(answer, category) as a secondary deduplication key.
src/lib/document/document-parser.ts Adds parse options (requestId/stage/cancel), pdfjs caching, telemetry hooks, prompt tweaks, and extracted-item deduplication.
src/hooks/use-import-dialog.ts Improves import success/error logging details.
src/components/features/import/import-dialog-shared.tsx Renders duplicate badges, highlights duplicate rows, and shows “new vs currently saved” previews with toggle buttons.
src/components/features/document/document-import-dialog.tsx Wires up new parse options, cancellation, duplicate detection, and error toasts for document imports.
.changeset/tender-months-wish.md Release note for dedup + duplicate preview (contains typos).
.changeset/huge-suns-prove.md Release note for prompt + dedup work (contains typos).
.changeset/fifty-words-lick.md Release note for observability work (contains typos).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread .changeset/huge-suns-prove.md Outdated
"superfill.ai": minor
---

Updated doc parser prompt to stop redeundant memory creation and added logic for deduplication of memory
Comment thread .changeset/fifty-words-lick.md Outdated
"superfill.ai": patch
---

add obersvability in the document parser
Comment on lines +134 to +138
const answerKey = buildAnswerKey(entry.answer, entry.category);

const existingIndex =
hashToIndex.get(contentHash) ?? answerKeyToIndex.get(answerKey);


const logger = createLogger("document-parser");

let _pdfjsLib: typeof PdfjsDist | null = null;
Comment on lines +42 to +49
const answerKey = buildAnswerKey(entry.answer, entry.category);

const existingIndex =
currentEntries.findIndex((e) => e.contentHash === contentHash) !== -1
? currentEntries.findIndex((e) => e.contentHash === contentHash)
: currentEntries.findIndex(
(e) => buildAnswerKey(e.answer, e.category) === answerKey,
);
Comment on lines +44 to +46
const existingIndex =
currentEntries.findIndex((e) => e.contentHash === contentHash) !== -1
? currentEntries.findIndex((e) => e.contentHash === contentHash)
Comment on lines +199 to +230
e.preventDefault();
e.stopPropagation();
if (!item.selected) onToggleItem(item.id);
}}
className={cn(
"text-[11px] px-2 py-0.5 rounded border font-medium transition-colors",
item.selected
? "bg-primary text-primary-foreground border-primary"
: "bg-background text-muted-foreground border-border hover:border-primary hover:text-primary",
)}
>
Use new
</button>
<button
type="button"
onClick={(e) => {
e.preventDefault();
e.stopPropagation();
if (item.selected) onToggleItem(item.id);
}}
className={cn(
"text-[11px] px-2 py-0.5 rounded border font-medium transition-colors",
!item.selected
? "bg-primary text-primary-foreground border-primary"
: "bg-background text-muted-foreground border-border hover:border-primary hover:text-primary",
)}
>
Keep existing
</button>
</div>
</>
)}
Comment thread .changeset/tender-months-wish.md Outdated
"superfill.ai": major
---

Updated the deduplication logic to give the preview of the new and exisitng duplicate memory in the UI
Copilot AI review requested due to automatic review settings March 14, 2026 10:48
mikr13 added 2 commits March 14, 2026 16:20
Removed unused documentation comment for findDuplicates function.
Remove comment about existingDuplicate property.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Enhances document import parsing and memory deduplication, including a UI preview for duplicates and improved observability/cancellation behavior during document parsing.

Changes:

  • Add answer+category fallback keying (buildAnswerKey) and apply consistent dedupe/merge behavior in memory storage (single + bulk adds) plus a findDuplicates helper for UI preview.
  • Improve document parsing flow with staged progress (“reading”/“parsing”), request-scoped logging/telemetry, cancellation handling, and deduplication of extracted items.
  • Update import UI to highlight duplicates and show “new vs currently saved” content, plus import dialog logging updates.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/types/import.ts Adds existingDuplicate?: MemoryEntry to support duplicate previews in import UIs.
src/lib/storage/content-hash.ts Introduces buildAnswerKey(answer, category) for fallback dedup matching.
src/lib/storage/memories.ts Implements merge-on-duplicate in addEntry/addEntries and adds findDuplicates for preview.
src/lib/document/document-parser.ts Adds staged parsing options, telemetry/logging improvements, partial cancellation support, and extracted-item deduping.
src/components/features/document/document-import-dialog.tsx Adds progress stages, abort handling, duplicate preview enrichment, and duplicate-import suppression.
src/components/features/import/import-dialog-shared.tsx Adds duplicate highlighting and “Use new / Keep existing” controls in the list UI.
src/hooks/use-import-dialog.ts Improves import success/error logging context.
.changeset/tender-months-wish.md Adds a changeset for duplicate preview/dedupe updates (currently marked major).
.changeset/huge-suns-prove.md Adds a changeset for doc parser prompt + dedupe behavior.
.changeset/fifty-words-lick.md Adds a changeset for document parser observability.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

</p>
<p className="text-xs text-muted-foreground/70 line-clamp-2 italic">
{item.existingDuplicate.answer}
</p>
Comment on lines +44 to +46
const existingIndex =
currentEntries.findIndex((e) => e.contentHash === contentHash) !== -1
? currentEntries.findIndex((e) => e.contentHash === contentHash)
Comment on lines +238 to +260
try {
const result = await generateObject({
model,
schema: ExtractedInfoSchema,
schemaName: "ExtractedInfo",
schemaDescription:
"Information extracted from a document for form filling",
system: DOCUMENT_PARSING_PROMPT,
prompt: `Extract all useful personal and professional information from this document:\n\n${text}`,
temperature: 0.1,
experimental_telemetry: {
isEnabled: DEBUG,
functionId: "document-parsing",
metadata: {
provider: selectedProvider,
textLength: text.length,
},
},
});
object = result.object;
} catch (error) {
throw error;
}
Comment on lines +196 to +228
<button
type="button"
onClick={(e) => {
e.preventDefault();
e.stopPropagation();
if (!item.selected) onToggleItem(item.id);
}}
className={cn(
"text-[11px] px-2 py-0.5 rounded border font-medium transition-colors",
item.selected
? "bg-primary text-primary-foreground border-primary"
: "bg-background text-muted-foreground border-border hover:border-primary hover:text-primary",
)}
>
Use new
</button>
<button
type="button"
onClick={(e) => {
e.preventDefault();
e.stopPropagation();
if (item.selected) onToggleItem(item.id);
}}
className={cn(
"text-[11px] px-2 py-0.5 rounded border font-medium transition-colors",
!item.selected
? "bg-primary text-primary-foreground border-primary"
: "bg-background text-muted-foreground border-border hover:border-primary hover:text-primary",
)}
>
Keep existing
</button>
</div>
Comment thread .changeset/tender-months-wish.md Outdated
"superfill.ai": major
---

Updated the deduplication logic to give the preview of the new and exisitng duplicate memory in the UI
@@ -0,0 +1,5 @@
---
"superfill.ai": major
Comment thread .changeset/huge-suns-prove.md Outdated
"superfill.ai": minor
---

Updated doc parser prompt to stop redeundant memory creation and added logic for deduplication of memory
Comment thread .changeset/fifty-words-lick.md Outdated
"superfill.ai": patch
---

add obersvability in the document parser
Copilot AI review requested due to automatic review settings March 14, 2026 10:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves document parsing/import by reducing duplicate memory creation, enhancing deduplication/merge behavior in storage, and surfacing duplicate previews + richer parsing telemetry/cancellation hooks in the UI.

Changes:

  • Add answer+category–based dedup fallback (buildAnswerKey) and extend memory add/bulk-add to merge duplicates (tags/confidence/updatedAt).
  • Update document parser prompt + add item deduplication, parsing stages, request-scoped logging, and cancellation plumbing.
  • Enhance import UI to highlight duplicates and show “new vs currently saved” previews.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
src/types/import.ts Extends import item type to carry an optional existing-duplicate MemoryEntry.
src/lib/storage/memories.ts Implements dedup/merge on add/addEntries and adds findDuplicates helper.
src/lib/storage/content-hash.ts Adds buildAnswerKey for answer+category fallback matching.
src/lib/document/document-parser.ts Improves parsing prompt, adds cancellation checks, telemetry, and dedupes extracted items.
src/hooks/use-import-dialog.ts Improves import logging on success/error.
src/components/features/import/import-dialog-shared.tsx UI: highlights duplicates and shows “new vs existing” details with actions.
src/components/features/document/document-import-dialog.tsx Adds staged progress (“reading/parsing”), abort handling, duplicate detection, and toasts.
.changeset/tender-months-wish.md Changeset entry for dedup/duplicate-preview UI (contains a typo).
.changeset/huge-suns-prove.md Changeset entry for doc parser prompt + dedup logic.
.changeset/fifty-words-lick.md Changeset entry for parser observability.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +44 to +46
const existingIndex =
currentEntries.findIndex((e) => e.contentHash === contentHash) !== -1
? currentEntries.findIndex((e) => e.contentHash === contentHash)
Comment on lines +105 to +118
for (let index = 0; index < updatedEntries.length; index++) {
const existing = updatedEntries[index];
const existingHash =
existing.contentHash ||
(await computeContentHash(
existing.question,
existing.answer,
existing.category,
));

if (!existing.contentHash) {
updatedEntries[index] = { ...existing, contentHash: existingHash };
}

Comment on lines +193 to +205
for (const entry of existing) {
const hash =
entry.contentHash ||
(await computeContentHash(entry.question, entry.answer, entry.category));
hashToEntry.set(hash, entry);
answerKeyToEntry.set(buildAnswerKey(entry.answer, entry.category), entry);
}

const result = new Map<number, MemoryEntry>();

for (let i = 0; i < candidates.length; i++) {
const c = candidates[i];
const hash = await computeContentHash(c.question, c.answer, c.category);
Comment on lines +84 to +86
}

const pdfjsLib = _pdfjsLib;
Comment on lines +119 to +122
logger.error(
`Import error — attempted: ${importItems.filter((i) => i.selected).length} items —`,
err,
);
"superfill.ai": major
---

Updated the deduplication logic to give the preview of the new and existing duplicate memory in the UI
Comment on lines +238 to +260
try {
const result = await generateObject({
model,
schema: ExtractedInfoSchema,
schemaName: "ExtractedInfo",
schemaDescription:
"Information extracted from a document for form filling",
system: DOCUMENT_PARSING_PROMPT,
prompt: `Extract all useful personal and professional information from this document:\n\n${text}`,
temperature: 0.1,
experimental_telemetry: {
isEnabled: DEBUG,
functionId: "document-parsing",
metadata: {
provider: selectedProvider,
textLength: text.length,
},
},
});
object = result.object;
} catch (error) {
throw error;
}
Comment on lines +239 to +243
const result = await generateObject({
model,
schema: ExtractedInfoSchema,
schemaName: "ExtractedInfo",
schemaDescription:
Comment on lines +148 to +184
const result = await parseDocument(file, {
requestId,
signal: controller.signal,
onStageChange: (stage) => {
if (requestIdRef.current !== currentRequestId) return;
setStatus(stage);
},
});

if (requestIdRef.current !== currentRequestId) return;

if (!result.success || !result.items) {
if (result.error === "cancelled") return;

const errorMsg =
result.error || "Failed to extract data from document";
setStatus("error");
setError(errorMsg);
toast.error(errorMsg);
return;
}

const items = convertToImportItems(result.items);
const duplicatesMap = await findDuplicates(items, existingMemories);
const enrichedItems = items.map((item, i) => {
const duplicate = duplicatesMap.get(i);
return duplicate ? { ...item, existingDuplicate: duplicate } : item;
});

setImportItems(enrichedItems);
setStatus("success");

logger.debug(
`[req:${requestId}] Successfully extracted document data:`,
items.length,
"items",
);
Comment on lines +133 to +137
lastImportKeyRef.current = importKey;
lastImportTimeRef.current = now;
abortControllerRef.current?.abort();
const controller = new AbortController();
abortControllerRef.current = controller;
@mikr13 mikr13 changed the title Improve doc parsing docs: improve doc parsing Mar 17, 2026
@mikr13 mikr13 merged commit eb872ae into main Mar 17, 2026
5 of 7 checks passed
@mikr13 mikr13 deleted the improve-doc-parsing branch March 17, 2026 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants