docs: improve doc parsing by pratham-kpatil · Pull Request #116 · superfill-ai/superfill.ai

pratham-kpatil · 2026-03-14T10:32:20Z

No description provided.

…ries and clarify extraction rules

…ving memory handling

…nd error handling

…for file imports

…ebounce functionality

… selection options

coderabbitai · 2026-03-14T10:32:35Z

Walkthrough

This PR adds duplicate detection and deduplication to the document import workflow. It enriches parsed items with existing memory references, provides UI controls to compare and select between new vs existing duplicates, strengthens memory storage with duplicate detection logic, and introduces cancellation support for document parsing operations.

Changes

Cohort / File(s)	Summary
Changeset Documentation `.changeset/fifty-words-lick.md`, `.changeset/huge-suns-prove.md`, `.changeset/tender-months-wish.md`	Three changelog entries documenting patch, minor, and major version bumps; noting observability additions, prompt updates for deduplication, and UI preview of duplicate memory comparisons.
Document Parsing & Cancellation `src/lib/document/document-parser.ts`	Introduces AbortSignal-based cancellation, caches pdfjs-dist module with worker setup, expands parseDocument API with options (requestId, onStageChange, signal), adds stage callbacks and telemetry hooks, implements deduplication and normalization (normalizeTags, normalizeKey, deduplicateItems), updates prompt guidance for memory merging.
Document Import Dialog `src/components/features/document/document-import-dialog.tsx`	Adds memoized handleFileSelect with useCallback, debounces duplicate imports via lastImportKeyRef/lastImportTimeRef, integrates AbortController for cancellation, enriches parsed items with existingDuplicate metadata via findDuplicates lookup, adds toast error notifications, implements lifecycle-aware cleanup.
Import Dialog UI & Duplication Display `src/components/features/import/import-dialog-shared.tsx`	Displays duplicate count badge in header, adds conditional styling for duplicate items, includes "New" vs "Currently saved" comparison sections, adds "Use new" and "Keep existing" action buttons to toggle between values, preserves existing selection behavior.
Memory Deduplication Logic `src/lib/storage/memories.ts`	Adds deduplication via contentHash and secondary answerKey matching, introduces uniqueTags helper for normalization, updates addEntry to merge on duplicate detection (preserves ID, updates hash/tags/confidence/metadata), adds batched deduplication in addEntries, exports new findDuplicates function for candidate-to-existing matching.
Content Hash & Answer Key `src/lib/storage/content-hash.ts`	Adds buildAnswerKey function that normalizes and concatenates answer and category with \|\| delimiter for secondary duplicate detection key.
Import Type Extensions `src/types/import.ts`	Extends BaseImportItem with optional existingDuplicate?: MemoryEntry field, imported MemoryEntry type from memory definitions.
Import Hook Logging `src/hooks/use-import-dialog.ts`	Enhances success and error messages with item counts, introduces finally block to reliably reset isSaving state.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant ImportDialog as Document<br/>Import Dialog
    participant Parser as Document<br/>Parser
    participant Memory as Memory<br/>Storage
    participant UI as Import Dialog<br/>UI

    User->>ImportDialog: Select PDF/TXT file
    ImportDialog->>ImportDialog: Debounce check<br/>(lastImportKey, time)
    ImportDialog->>ImportDialog: Create AbortController
    ImportDialog->>Parser: parseDocument(file, options)
    activate Parser
    Parser->>Parser: Read file (stage: reading)
    Parser->>Parser: Extract text (PDF/TXT)
    Parser->>Parser: Call AI parsing<br/>(stage: parsing)
    Parser->>Parser: Deduplicate & normalize<br/>items
    Parser-->>ImportDialog: Return items[]
    deactivate Parser
    
    ImportDialog->>Memory: findDuplicates(items, existing)
    activate Memory
    Memory->>Memory: Build index by<br/>contentHash & answerKey
    Memory-->>ImportDialog: Map<index, MemoryEntry>
    deactivate Memory
    
    ImportDialog->>ImportDialog: Enrich items with<br/>existingDuplicate
    ImportDialog-->>UI: Display items with<br/>duplicate metadata
    
    User->>UI: Compare & select<br/>"Use new" or "Keep existing"
    UI->>Memory: Save selection

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

feat(SUP-5): scraping of linkedin & PDF #69: Both PRs modify the document import/parsing stack—adding PDF parsing infrastructure (pdfjs-dist integration), the parseDocument API, and import dialog components with duplicate detection support.
feat: new features for v2 #45: Both PRs modify memory storage and deduplication surface (src/lib/storage/memories.ts, type extensions) to support duplicate detection and entry merging.

🚥 Pre-merge checks | ❌ 3

❌ Failed checks (1 warning, 2 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	No pull request description was provided by the author, making it impossible to assess relevance to the changeset.	Add a detailed pull request description explaining the objectives, key changes, and how duplicate detection and deduplication improve the document parsing workflow.
Title check	❓ Inconclusive	The PR title is vague and generic, using the non-descriptive term 'improve' without specifying what aspects of doc parsing are being improved or what the main changes accomplish.	Provide a more specific title that describes the main functionality added, such as 'Add duplicate detection and memory deduplication in document import' or 'Add observability and duplicate handling to document parser'.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch improve-doc-parsing

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

CodeRabbit can use OpenGrep to find security vulnerabilities and bugs across 17+ programming languages.

OpenGrep is compatible with Semgrep configurations. Add an opengrep.yml or semgrep.yml configuration file to your project to enable OpenGrep analysis.

coderabbitai

Actionable comments posted: 12

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.changeset/fifty-words-lick.md:
- Line 5: Replace the misspelled word "obersvability" with "observability" in
the release note text (the line reading "add obersvability in the document
parser") so the sentence reads "add observability in the document parser".

In @.changeset/huge-suns-prove.md:
- Line 5: Replace the misspelled word "redeundant" with "redundant" in the
changeset note — specifically update the phrase "Updated doc parser prompt to
stop redeundant memory creation and added logic for deduplication of memory" to
"Updated doc parser prompt to stop redundant memory creation and added logic for
deduplication of memory" so the release note reads correctly.

In @.changeset/tender-months-wish.md:
- Line 5: Fix the typo in the changeset description by replacing the misspelled
word "exisitng" with "existing" in the changeset text (the phrase "preview of
the new and exisitng duplicate memory in the UI"); update that exact string so
the sentence reads "preview of the new and existing duplicate memory in the UI".

In `@src/components/features/document/document-import-dialog.tsx`:
- Around line 150-151: The file input value is not cleared on all early-exit
paths in the try/catch around the parseDocument(file, ...) call in
DocumentImportDialog, which prevents re-selecting the same file; update the
handler so that every exit path (success, AbortError/cancellation, other errors,
and any early returns) clears the file input and resets any selectedFile state.
Concretely, locate the function containing parseDocument(file, ...) and ensure
you either call a shared cleanup helper (e.g., resetFileInput()) or directly set
fileInputRef.current.value = '' and setSelectedFile(null) in a finally block or
immediately before each return/throw (including AbortError handling), so the
input is always reset.
- Around line 176-183: After awaiting findDuplicates, the handler can overwrite
UI if a newer import started; capture the current request identifier before the
await (e.g., const currentRequestId = importRequestIdRef.current or a local
requestId variable used by the component) and immediately after the await check
that the global/latest request id still equals currentRequestId; if it does not
match, return early and do not call setImportItems or setStatus. Apply this
guard around the block that computes duplicatesMap, enrichedItems and calls
setImportItems/setStatus (referencing findDuplicates, duplicatesMap,
enrichedItems, setImportItems, setStatus).

In `@src/components/features/import/import-dialog-shared.tsx`:
- Around line 173-174: Remove the non-essential inline JSX comments in the
render block of the ImportDialogShared component: delete comments like {/* New
value from document */} and the similar comments around the adjacent divs so the
JSX is not cluttered; keep only comments that explain tricky/hacky/TODO behavior
and remove descriptive comments at the three locations mentioned (the comment
currently in the div with the new value and the other two nearby inline
comments).
- Around line 199-230: The two toggle-like buttons rendered near the "Use new" /
"Keep existing" labels don't expose their pressed state for assistive tech;
update the button elements in this component (the buttons that call
onToggleItem(item.id) and use the item.selected condition) to include
aria-pressed set to a boolean reflecting the current choice (e.g.,
aria-pressed={item.selected} for "Use new" and aria-pressed={!item.selected} for
"Keep existing"), leaving existing onClick logic intact so screen readers can
see which option is pressed.

In `@src/lib/document/document-parser.ts`:
- Around line 431-441: normalizeKey currently only trims/lowers and includes
label, which diverges from storage-side dedupe (which normalizes
Unicode/whitespace and collapses by answer+category), causing preview-time and
save-time mismatch; update normalizeKey (and the other similar functions around
lines 443-466 and 471-477) to reuse the exact storage normalization and keying
logic (call the shared storage normalizer or import the storage module) so the
preview dedupe uses the same Unicode/whitespace normalization and the same
composite key (answer+category, not label/question) that addEntries uses,
ensuring consistent duplicate detection between preview and save.
- Around line 80-90: The PDF and AI parsing functions don't regularly observe
the provided AbortSignal, so long-running work continues after cancellation;
update extractTextFromPDF to check signal.aborted inside the per-page loop and
abort immediately (throw an AbortError/DOMException) when observed, and update
parseDocumentWithAI (and callers like parseDocument) to re-check the signal
after any awaited model call and before returning success so a mid-flight AI
call doesn't cause a false success result; reference the AbortSignal parameter
(or add one) in extractTextFromPDF, parseDocumentWithAI, and parseDocument and
ensure all long-running loops and post-await paths throw on abort.

In `@src/lib/storage/content-hash.ts`:
- Around line 10-11: The current buildAnswerKey function concatenates
normalizePart(answer) and normalizePart(category) with "||", which can collide
if either part contains that delimiter; change buildAnswerKey to produce a
collision-safe key by encoding or delimiting parts unambiguously (e.g.,
base64-encode each normalized part or use a length-prefixed or JSON-based
encoding) and join the encoded pieces instead of raw strings; update the
implementation that references buildAnswerKey and ensure normalizePart remains
applied before encoding so lookups use the same deterministic encoding format.

In `@src/lib/storage/memories.ts`:
- Around line 192-220: Wrap the entire async body of findDuplicates in a
try/catch so failures from computeContentHash (called for existing entries and
candidates) do not reject the whole flow; inside the catch, log the error with
contextual info (e.g. candidates.length and which loop phase) and return an
empty Map as a safe fallback (or rethrow a new Error with that context if you
prefer failing loudly). Ensure you protect both loops that build
hashToEntry/answerKeyToEntry and the candidate iteration that computes
hash/answerKey, and include the function name findDuplicates and symbols
computeContentHash, hashToEntry, answerKeyToEntry in your log/context to aid
debugging.
- Around line 42-49: The secondary dedupe fallback using buildAnswerKey(answer,
category) can incorrectly collide unrelated memories that share an answer;
update the logic in the places referencing answerKey and existingIndex (used by
addEntry/addEntries and findDuplicates) so the fallback only applies to
legacy/questionless entries (e.g., where e.question is null/empty or a legacy
flag exists) or include a normalized question component when computing the key;
specifically, change the existingIndex computation to first match by
contentHash, then if not found only match by buildAnswerKey(e.answer,
e.category) when e.question is falsy (or otherwise normalize e.question and
include it in buildAnswerKey) to avoid destructive overwrites.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: b1dfe4cd-9a90-4c5e-bafe-aa1669d10709

📥 Commits

Reviewing files that changed from the base of the PR and between 16b4d71 and 5acf4bb.

📒 Files selected for processing (10)

.changeset/fifty-words-lick.md
.changeset/huge-suns-prove.md
.changeset/tender-months-wish.md
src/components/features/document/document-import-dialog.tsx
src/components/features/import/import-dialog-shared.tsx
src/hooks/use-import-dialog.ts
src/lib/document/document-parser.ts
src/lib/storage/content-hash.ts
src/lib/storage/memories.ts
src/types/import.ts

coderabbitai · 2026-03-14T10:42:40Z

+      try {
+        const result = await parseDocument(file, {


⚠️ Potential issue | 🟠 Major

Ensure file input is reset on all exit paths.

There are early returns in the try/catch flow (e.g., cancellation / AbortError) before Line [203], so the file input may not be cleared. That can prevent selecting the same file again.

🔧 Suggested fix

-try { +try { // existing logic } catch (err) { // existing logic +} finally { + if (fileInputRef.current) { + fileInputRef.current.value = ""; + } } - -if (fileInputRef.current) { - fileInputRef.current.value = ""; -}

Also applies to: 163-170, 190-205

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/components/features/document/document-import-dialog.tsx` around lines 150 - 151, The file input value is not cleared on all early-exit paths in the try/catch around the parseDocument(file, ...) call in DocumentImportDialog, which prevents re-selecting the same file; update the handler so that every exit path (success, AbortError/cancellation, other errors, and any early returns) clears the file input and resets any selectedFile state. Concretely, locate the function containing parseDocument(file, ...) and ensure you either call a shared cleanup helper (e.g., resetFileInput()) or directly set fileInputRef.current.value = '' and setSelectedFile(null) in a finally block or immediately before each return/throw (including AbortError handling), so the input is always reset.

coderabbitai · 2026-03-14T10:42:40Z

+        const duplicatesMap = await findDuplicates(items, existingMemories);
+        const enrichedItems = items.map((item, i) => {
+          const duplicate = duplicatesMap.get(i);
+          return duplicate ? { ...item, existingDuplicate: duplicate } : item;
+        });
+
+        setImportItems(enrichedItems);
+        setStatus("success");


⚠️ Potential issue | 🟠 Major

Add stale-request guard after duplicate lookup.

findDuplicates(...) is awaited, but there’s no request-id recheck afterward. A newer import can start during that await, and this older request can still overwrite UI state.

🔧 Suggested fix

const duplicatesMap = await findDuplicates(items, existingMemories); +if (requestIdRef.current !== currentRequestId) return; const enrichedItems = items.map((item, i) => { const duplicate = duplicatesMap.get(i); return duplicate ? { ...item, existingDuplicate: duplicate } : item; });

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/components/features/document/document-import-dialog.tsx` around lines 176 - 183, After awaiting findDuplicates, the handler can overwrite UI if a newer import started; capture the current request identifier before the await (e.g., const currentRequestId = importRequestIdRef.current or a local requestId variable used by the component) and immediately after the await check that the global/latest request id still equals currentRequestId; if it does not match, return early and do not call setImportItems or setStatus. Apply this guard around the block that computes duplicatesMap, enrichedItems and calls setImportItems/setStatus (referencing findDuplicates, duplicatesMap, enrichedItems, setImportItems, setStatus).

coderabbitai · 2026-03-14T10:42:40Z

 export async function extractTextFromPDF(file: File): Promise<string> {
-  const pdfjsLib = await import("pdfjs-dist");
-
-  pdfjsLib.GlobalWorkerOptions.workerSrc =
-    browser.runtime.getURL("/pdf.worker.mjs");
+  // Cache pdfjs module + worker URL so it's only initialized once per session
+  if (!_pdfjsLib) {
+    _pdfjsLib = await import("pdfjs-dist");
+    _pdfjsLib.GlobalWorkerOptions.workerSrc =
+      browser.runtime.getURL("/pdf.worker.mjs");
+  }
+  const pdfjsLib = _pdfjsLib;

  const arrayBuffer = await file.arrayBuffer();
  const pdf = await pdfjsLib.getDocument({ data: arrayBuffer }).promise;


⚠️ Potential issue | 🟠 Major

Abort support still misses the longest-running work.

Cancellation is only checked at phase boundaries. During PDF parsing, extractTextFromPDF(...) keeps walking every page after the caller aborts; during AI extraction, a mid-flight abort is ignored entirely because parseDocumentWithAI(...) never re-checks the signal after the awaited model call, so parseDocument(...) can still return success: true after the dialog was cancelled.

Possible fix

-export async function extractTextFromPDF(file: File): Promise<string> { +export async function extractTextFromPDF( + file: File, + signal?: AbortSignal, +): Promise<string> { // Cache pdfjs module + worker URL so it's only initialized once per session if (!_pdfjsLib) { _pdfjsLib = await import("pdfjs-dist"); _pdfjsLib.GlobalWorkerOptions.workerSrc = browser.runtime.getURL("/pdf.worker.mjs"); @@ for (let i = 1; i <= pdf.numPages; i++) { + throwIfAborted(signal); const page = await pdf.getPage(i); const textContent = await page.getTextContent(); @@ const result = await generateObject({ model, schema: ExtractedInfoSchema, @@ }); object = result.object; + throwIfAborted(signal); @@ - text = await extractTextFromPDF(file); + text = await extractTextFromPDF(file, signal);

Also applies to: 95-146, 210-211, 251-269, 351-358, 387-397

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lib/document/document-parser.ts` around lines 80 - 90, The PDF and AI parsing functions don't regularly observe the provided AbortSignal, so long-running work continues after cancellation; update extractTextFromPDF to check signal.aborted inside the per-page loop and abort immediately (throw an AbortError/DOMException) when observed, and update parseDocumentWithAI (and callers like parseDocument) to re-check the signal after any awaited model call and before returning success so a mid-flight AI call doesn't cause a false success result; reference the AbortSignal parameter (or add one) in extractTextFromPDF, parseDocumentWithAI, and parseDocument and ensure all long-running loops and post-await paths throw on abort.

coderabbitai · 2026-03-14T10:42:40Z

+function normalizeKey(item: ExtractedItem): string {
+  const safe = (value: string) => value.trim().toLowerCase();
+  return [
+    safe(item.label),
+    safe(item.question),
+    safe(item.answer),
+    item.category,
+  ]
+    .map((part) => part || "")
+    .join("|");
+}


⚠️ Potential issue | 🟠 Major

Parser-side dedupe key does not match storage-side duplicate matching.

normalizeKey(...) only trims/lowercases and also includes label, while storage dedupe normalizes Unicode/whitespace differently and can collapse records by answer + category even when labels/questions differ. That means the import preview can keep rows that addEntries(...) later merges, or flag a different duplicate than storage will. Reuse the same normalization/keying strategy as the storage layer so preview-time dedupe and save-time dedupe stay consistent.

Also applies to: 443-466, 471-477

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lib/document/document-parser.ts` around lines 431 - 441, normalizeKey currently only trims/lowers and includes label, which diverges from storage-side dedupe (which normalizes Unicode/whitespace and collapses by answer+category), causing preview-time and save-time mismatch; update normalizeKey (and the other similar functions around lines 443-466 and 471-477) to reuse the exact storage normalization and keying logic (call the shared storage normalizer or import the storage module) so the preview dedupe uses the same Unicode/whitespace normalization and the same composite key (answer+category, not label/question) that addEntries uses, ensuring consistent duplicate detection between preview and save.

coderabbitai · 2026-03-14T10:42:40Z

+export const buildAnswerKey = (answer: string, category: string): string =>
+  `${normalizePart(answer)}||${normalizePart(category)}`;


⚠️ Potential issue | 🟠 Major

Use collision-safe encoding for buildAnswerKey.

At Line [11], joining normalized parts with || can collide when input values include the delimiter, causing false duplicate matches.

🔧 Suggested fix

export const buildAnswerKey = (answer: string, category: string): string => - `${normalizePart(answer)}||${normalizePart(category)}`; + JSON.stringify([normalizePart(answer), normalizePart(category)]);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

export const buildAnswerKey = (answer: string, category: string): string =>

`${normalizePart(answer)}||${normalizePart(category)}`;

export const buildAnswerKey = (answer: string, category: string): string =>

JSON.stringify([normalizePart(answer), normalizePart(category)]);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lib/storage/content-hash.ts` around lines 10 - 11, The current buildAnswerKey function concatenates normalizePart(answer) and normalizePart(category) with "||", which can collide if either part contains that delimiter; change buildAnswerKey to produce a collision-safe key by encoding or delimiting parts unambiguously (e.g., base64-encode each normalized part or use a length-prefixed or JSON-based encoding) and join the encoded pieces instead of raw strings; update the implementation that references buildAnswerKey and ensure normalizePart remains applied before encoding so lookups use the same deterministic encoding format.

coderabbitai · 2026-03-14T10:42:40Z

+    const answerKey = buildAnswerKey(entry.answer, entry.category);
+
+    const existingIndex =
+      currentEntries.findIndex((e) => e.contentHash === contentHash) !== -1
+        ? currentEntries.findIndex((e) => e.contentHash === contentHash)
+        : currentEntries.findIndex(
+            (e) => buildAnswerKey(e.answer, e.category) === answerKey,
+          );


⚠️ Potential issue | 🔴 Critical

Secondary dedupe on answer + category can overwrite unrelated memories.

buildAnswerKey(...) ignores question, so two different facts in the same category that happen to share an answer will collide here. A common case is two location memories with "New York" (for example, birth city vs current city): addEntry/addEntries will overwrite one with the other, and findDuplicates will point the UI at the wrong existing record. This fallback needs to be restricted to truly questionless legacy rows, or it needs a normalized question component before it is used for destructive merges. Based on learnings, "Maintain backward compatibility with stored data when making changes".

Also applies to: 120-123, 136-137, 204-214

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lib/storage/memories.ts` around lines 42 - 49, The secondary dedupe fallback using buildAnswerKey(answer, category) can incorrectly collide unrelated memories that share an answer; update the logic in the places referencing answerKey and existingIndex (used by addEntry/addEntries and findDuplicates) so the fallback only applies to legacy/questionless entries (e.g., where e.question is null/empty or a legacy flag exists) or include a normalized question component when computing the key; specifically, change the existingIndex computation to first match by contentHash, then if not found only match by buildAnswerKey(e.answer, e.category) when e.question is falsy (or otherwise normalize e.question and include it in buildAnswerKey) to avoid destructive overwrites.

coderabbitai · 2026-03-14T10:42:40Z

+export const findDuplicates = async (
+  candidates: Array<{ question: string; answer: string; category: string }>,
+  existing: MemoryEntry[],
+): Promise<Map<number, MemoryEntry>> => {
+  const hashToEntry = new Map<string, MemoryEntry>();
+  const answerKeyToEntry = new Map<string, MemoryEntry>();
+
+  for (const entry of existing) {
+    const hash =
+      entry.contentHash ||
+      (await computeContentHash(entry.question, entry.answer, entry.category));
+    hashToEntry.set(hash, entry);
+    answerKeyToEntry.set(buildAnswerKey(entry.answer, entry.category), entry);
+  }
+
+  const result = new Map<number, MemoryEntry>();
+
+  for (let i = 0; i < candidates.length; i++) {
+    const c = candidates[i];
+    const hash = await computeContentHash(c.question, c.answer, c.category);
+    const answerKey = buildAnswerKey(c.answer, c.category);
+
+    const matched = hashToEntry.get(hash) ?? answerKeyToEntry.get(answerKey);
+    if (matched) {
+      result.set(i, matched);
+    }
+  }
+
+  return result;


⚠️ Potential issue | 🟡 Minor

findDuplicates is the only async storage helper here without local error handling.

This path awaits computeContentHash(...) for both existing and candidate entries, but any failure currently rejects the whole duplicate-detection flow without the logging/context the rest of this module provides. Wrap the body in try/catch and either rethrow with context or return a safe fallback if duplicate detection is best-effort.

Possible fix

export const findDuplicates = async ( candidates: Array<{ question: string; answer: string; category: string }>, existing: MemoryEntry[], ): Promise<Map<number, MemoryEntry>> => { - const hashToEntry = new Map<string, MemoryEntry>(); - const answerKeyToEntry = new Map<string, MemoryEntry>(); - - for (const entry of existing) { - const hash = - entry.contentHash || - (await computeContentHash(entry.question, entry.answer, entry.category)); - hashToEntry.set(hash, entry); - answerKeyToEntry.set(buildAnswerKey(entry.answer, entry.category), entry); - } - - const result = new Map<number, MemoryEntry>(); - - for (let i = 0; i < candidates.length; i++) { - const c = candidates[i]; - const hash = await computeContentHash(c.question, c.answer, c.category); - const answerKey = buildAnswerKey(c.answer, c.category); - - const matched = hashToEntry.get(hash) ?? answerKeyToEntry.get(answerKey); - if (matched) { - result.set(i, matched); - } - } - - return result; + try { + const hashToEntry = new Map<string, MemoryEntry>(); + const answerKeyToEntry = new Map<string, MemoryEntry>(); + + for (const entry of existing) { + const hash = + entry.contentHash || + (await computeContentHash(entry.question, entry.answer, entry.category)); + hashToEntry.set(hash, entry); + answerKeyToEntry.set(buildAnswerKey(entry.answer, entry.category), entry); + } + + const result = new Map<number, MemoryEntry>(); + + for (let i = 0; i < candidates.length; i++) { + const c = candidates[i]; + const hash = await computeContentHash(c.question, c.answer, c.category); + const answerKey = buildAnswerKey(c.answer, c.category); + + const matched = hashToEntry.get(hash) ?? answerKeyToEntry.get(answerKey); + if (matched) { + result.set(i, matched); + } + } + + return result; + } catch (error) { + logger.error("Failed to find duplicates:", error); + throw error; + } };

As per coding guidelines, "Handle errors gracefully with try-catch blocks and null checks in all async operations".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

export const findDuplicates = async (

candidates: Array<{ question: string; answer: string; category: string }>,

existing: MemoryEntry[],

): Promise<Map<number, MemoryEntry>> => {

const hashToEntry = new Map<string, MemoryEntry>();

const answerKeyToEntry = new Map<string, MemoryEntry>();

for (const entry of existing) {

const hash =

entry.contentHash ||

(await computeContentHash(entry.question, entry.answer, entry.category));

hashToEntry.set(hash, entry);

answerKeyToEntry.set(buildAnswerKey(entry.answer, entry.category), entry);

}

const result = new Map<number, MemoryEntry>();

for (let i = 0; i < candidates.length; i++) {

const c = candidates[i];

const hash = await computeContentHash(c.question, c.answer, c.category);

const answerKey = buildAnswerKey(c.answer, c.category);

const matched = hashToEntry.get(hash) ?? answerKeyToEntry.get(answerKey);

if (matched) {

result.set(i, matched);

}

}

return result;

export const findDuplicates = async (

candidates: Array<{ question: string; answer: string; category: string }>,

existing: MemoryEntry[],

): Promise<Map<number, MemoryEntry>> => {

try {

const hashToEntry = new Map<string, MemoryEntry>();

const answerKeyToEntry = new Map<string, MemoryEntry>();

for (const entry of existing) {

const hash =

entry.contentHash ||

(await computeContentHash(entry.question, entry.answer, entry.category));

hashToEntry.set(hash, entry);

answerKeyToEntry.set(buildAnswerKey(entry.answer, entry.category), entry);

}

const result = new Map<number, MemoryEntry>();

for (let i = 0; i < candidates.length; i++) {

const c = candidates[i];

const hash = await computeContentHash(c.question, c.answer, c.category);

const answerKey = buildAnswerKey(c.answer, c.category);

const matched = hashToEntry.get(hash) ?? answerKeyToEntry.get(answerKey);

if (matched) {

result.set(i, matched);

}

}

return result;

} catch (error) {

logger.error("Failed to find duplicates:", error);

throw error;

}

};

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lib/storage/memories.ts` around lines 192 - 220, Wrap the entire async body of findDuplicates in a try/catch so failures from computeContentHash (called for existing entries and candidates) do not reject the whole flow; inside the catch, log the error with contextual info (e.g. candidates.length and which loop phase) and return an empty Map as a safe fallback (or rethrow a new Error with that context if you prefer failing loudly). Ensure you protect both loops that build hashToEntry/answerKeyToEntry and the candidate iteration that computes hash/answerKey, and include the function name findDuplicates and symbols computeContentHash, hashToEntry, answerKeyToEntry in your log/context to aid debugging.

Removed comments related to import debouncing and cancellation.

Removed commented-out code for clarity.

Remove comment about cached pdfjs module.

Removed telemetry-related debug code from document parser.

Remove comment about secondary dedup key in content-hash.ts

Copilot

Pull request overview

This PR improves the document import flow by reducing duplicate memories, previewing potential overwrites in the UI, and adding more robust parsing/observability controls around document-to-memory extraction.

Changes:

Add deduplication + duplicate-detection helpers in storage (contentHash primary, answer+category fallback) and surface duplicates in the import UI.
Enhance document parsing with stage reporting, cancellation, pdfjs module caching, prompt updates, and optional telemetry hooks.
Update import dialog UI to highlight duplicates and allow “Use new / Keep existing” selection per item.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
src/types/import.ts	Adds `existingDuplicate?: MemoryEntry` to import items so UI can render duplicate previews.
src/lib/storage/memories.ts	Updates addEntry/addEntries to merge on duplicates; adds `findDuplicates` helper; normalizes/uniquifies tags.
src/lib/storage/content-hash.ts	Introduces `buildAnswerKey(answer, category)` as a secondary deduplication key.
src/lib/document/document-parser.ts	Adds parse options (requestId/stage/cancel), pdfjs caching, telemetry hooks, prompt tweaks, and extracted-item deduplication.
src/hooks/use-import-dialog.ts	Improves import success/error logging details.
src/components/features/import/import-dialog-shared.tsx	Renders duplicate badges, highlights duplicate rows, and shows “new vs currently saved” previews with toggle buttons.
src/components/features/document/document-import-dialog.tsx	Wires up new parse options, cancellation, duplicate detection, and error toasts for document imports.
.changeset/tender-months-wish.md	Release note for dedup + duplicate preview (contains typos).
.changeset/huge-suns-prove.md	Release note for prompt + dedup work (contains typos).
.changeset/fifty-words-lick.md	Release note for observability work (contains typos).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

+"superfill.ai": minor
+---
+
+Updated doc parser prompt to stop redeundant memory creation and added logic for deduplication of memory


+"superfill.ai": patch
+---
+
+add obersvability in the document parser


+      const answerKey = buildAnswerKey(entry.answer, entry.category);
+
+      const existingIndex =
+        hashToIndex.get(contentHash) ?? answerKeyToIndex.get(answerKey);
+



 const logger = createLogger("document-parser");

+let _pdfjsLib: typeof PdfjsDist | null = null;


+    const answerKey = buildAnswerKey(entry.answer, entry.category);
+
+    const existingIndex =
+      currentEntries.findIndex((e) => e.contentHash === contentHash) !== -1
+        ? currentEntries.findIndex((e) => e.contentHash === contentHash)
+        : currentEntries.findIndex(
+            (e) => buildAnswerKey(e.answer, e.category) === answerKey,
+          );


+    const existingIndex =
+      currentEntries.findIndex((e) => e.contentHash === contentHash) !== -1
+        ? currentEntries.findIndex((e) => e.contentHash === contentHash)


+                                  e.preventDefault();
+                                  e.stopPropagation();
+                                  if (!item.selected) onToggleItem(item.id);
+                                }}
+                                className={cn(
+                                  "text-[11px] px-2 py-0.5 rounded border font-medium transition-colors",
+                                  item.selected
+                                    ? "bg-primary text-primary-foreground border-primary"
+                                    : "bg-background text-muted-foreground border-border hover:border-primary hover:text-primary",
+                                )}
+                              >
+                                Use new
+                              </button>
+                              <button
+                                type="button"
+                                onClick={(e) => {
+                                  e.preventDefault();
+                                  e.stopPropagation();
+                                  if (item.selected) onToggleItem(item.id);
+                                }}
+                                className={cn(
+                                  "text-[11px] px-2 py-0.5 rounded border font-medium transition-colors",
+                                  !item.selected
+                                    ? "bg-primary text-primary-foreground border-primary"
+                                    : "bg-background text-muted-foreground border-border hover:border-primary hover:text-primary",
+                                )}
+                              >
+                                Keep existing
+                              </button>
+                            </div>
+                          </>
+                        )}


+"superfill.ai": major
+---
+
+Updated the deduplication logic to give the preview of the new and exisitng duplicate memory in the UI


Removed unused documentation comment for findDuplicates function.

Remove comment about existingDuplicate property.

Copilot

Pull request overview

Enhances document import parsing and memory deduplication, including a UI preview for duplicates and improved observability/cancellation behavior during document parsing.

Changes:

Add answer+category fallback keying (buildAnswerKey) and apply consistent dedupe/merge behavior in memory storage (single + bulk adds) plus a findDuplicates helper for UI preview.
Improve document parsing flow with staged progress (“reading”/“parsing”), request-scoped logging/telemetry, cancellation handling, and deduplication of extracted items.
Update import UI to highlight duplicates and show “new vs currently saved” content, plus import dialog logging updates.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
src/types/import.ts	Adds `existingDuplicate?: MemoryEntry` to support duplicate previews in import UIs.
src/lib/storage/content-hash.ts	Introduces `buildAnswerKey(answer, category)` for fallback dedup matching.
src/lib/storage/memories.ts	Implements merge-on-duplicate in `addEntry`/`addEntries` and adds `findDuplicates` for preview.
src/lib/document/document-parser.ts	Adds staged parsing options, telemetry/logging improvements, partial cancellation support, and extracted-item deduping.
src/components/features/document/document-import-dialog.tsx	Adds progress stages, abort handling, duplicate preview enrichment, and duplicate-import suppression.
src/components/features/import/import-dialog-shared.tsx	Adds duplicate highlighting and “Use new / Keep existing” controls in the list UI.
src/hooks/use-import-dialog.ts	Improves import success/error logging context.
.changeset/tender-months-wish.md	Adds a changeset for duplicate preview/dedupe updates (currently marked major).
.changeset/huge-suns-prove.md	Adds a changeset for doc parser prompt + dedupe behavior.
.changeset/fifty-words-lick.md	Adds a changeset for document parser observability.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

+                            </p>
+                            <p className="text-xs text-muted-foreground/70 line-clamp-2 italic">
+                              {item.existingDuplicate.answer}
+                            </p>


+    const existingIndex =
+      currentEntries.findIndex((e) => e.contentHash === contentHash) !== -1
+        ? currentEntries.findIndex((e) => e.contentHash === contentHash)


+  try {
+    const result = await generateObject({
+      model,
+      schema: ExtractedInfoSchema,
+      schemaName: "ExtractedInfo",
+      schemaDescription:
+        "Information extracted from a document for form filling",
+      system: DOCUMENT_PARSING_PROMPT,
+      prompt: `Extract all useful personal and professional information from this document:\n\n${text}`,
+      temperature: 0.1,
+      experimental_telemetry: {
+        isEnabled: DEBUG,
+        functionId: "document-parsing",
+        metadata: {
+          provider: selectedProvider,
+          textLength: text.length,
+        },
+      },
+    });
+    object = result.object;
+  } catch (error) {
+    throw error;
+  }


+                              <button
+                                type="button"
+                                onClick={(e) => {
+                                  e.preventDefault();
+                                  e.stopPropagation();
+                                  if (!item.selected) onToggleItem(item.id);
+                                }}
+                                className={cn(
+                                  "text-[11px] px-2 py-0.5 rounded border font-medium transition-colors",
+                                  item.selected
+                                    ? "bg-primary text-primary-foreground border-primary"
+                                    : "bg-background text-muted-foreground border-border hover:border-primary hover:text-primary",
+                                )}
+                              >
+                                Use new
+                              </button>
+                              <button
+                                type="button"
+                                onClick={(e) => {
+                                  e.preventDefault();
+                                  e.stopPropagation();
+                                  if (item.selected) onToggleItem(item.id);
+                                }}
+                                className={cn(
+                                  "text-[11px] px-2 py-0.5 rounded border font-medium transition-colors",
+                                  !item.selected
+                                    ? "bg-primary text-primary-foreground border-primary"
+                                    : "bg-background text-muted-foreground border-border hover:border-primary hover:text-primary",
+                                )}
+                              >
+                                Keep existing
+                              </button>
+                            </div>


+"superfill.ai": major
+---
+
+Updated the deduplication logic to give the preview of the new and exisitng duplicate memory in the UI


@@ -0,0 +1,5 @@
+---
+"superfill.ai": major


+"superfill.ai": minor
+---
+
+Updated doc parser prompt to stop redeundant memory creation and added logic for deduplication of memory


+"superfill.ai": patch
+---
+
+add obersvability in the document parser


Copilot

Pull request overview

This PR improves document parsing/import by reducing duplicate memory creation, enhancing deduplication/merge behavior in storage, and surfacing duplicate previews + richer parsing telemetry/cancellation hooks in the UI.

Changes:

Add answer+category–based dedup fallback (buildAnswerKey) and extend memory add/bulk-add to merge duplicates (tags/confidence/updatedAt).
Update document parser prompt + add item deduplication, parsing stages, request-scoped logging, and cancellation plumbing.
Enhance import UI to highlight duplicates and show “new vs currently saved” previews.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
src/types/import.ts	Extends import item type to carry an optional existing-duplicate `MemoryEntry`.
src/lib/storage/memories.ts	Implements dedup/merge on add/addEntries and adds `findDuplicates` helper.
src/lib/storage/content-hash.ts	Adds `buildAnswerKey` for answer+category fallback matching.
src/lib/document/document-parser.ts	Improves parsing prompt, adds cancellation checks, telemetry, and dedupes extracted items.
src/hooks/use-import-dialog.ts	Improves import logging on success/error.
src/components/features/import/import-dialog-shared.tsx	UI: highlights duplicates and shows “new vs existing” details with actions.
src/components/features/document/document-import-dialog.tsx	Adds staged progress (“reading/parsing”), abort handling, duplicate detection, and toasts.
.changeset/tender-months-wish.md	Changeset entry for dedup/duplicate-preview UI (contains a typo).
.changeset/huge-suns-prove.md	Changeset entry for doc parser prompt + dedup logic.
.changeset/fifty-words-lick.md	Changeset entry for parser observability.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

+    const existingIndex =
+      currentEntries.findIndex((e) => e.contentHash === contentHash) !== -1
+        ? currentEntries.findIndex((e) => e.contentHash === contentHash)


+    for (let index = 0; index < updatedEntries.length; index++) {
+      const existing = updatedEntries[index];
+      const existingHash =
+        existing.contentHash ||
+        (await computeContentHash(
+          existing.question,
+          existing.answer,
+          existing.category,
+        ));
+
+      if (!existing.contentHash) {
+        updatedEntries[index] = { ...existing, contentHash: existingHash };
+      }
+


+  for (const entry of existing) {
+    const hash =
+      entry.contentHash ||
+      (await computeContentHash(entry.question, entry.answer, entry.category));
+    hashToEntry.set(hash, entry);
+    answerKeyToEntry.set(buildAnswerKey(entry.answer, entry.category), entry);
+  }
+
+  const result = new Map<number, MemoryEntry>();
+
+  for (let i = 0; i < candidates.length; i++) {
+    const c = candidates[i];
+    const hash = await computeContentHash(c.question, c.answer, c.category);


+  }
+
+  const pdfjsLib = _pdfjsLib;


+      logger.error(
+        `Import error — attempted: ${importItems.filter((i) => i.selected).length} items —`,
+        err,
+      );


+"superfill.ai": major
+---
+
+Updated the deduplication logic to give the preview of the new and existing duplicate memory in the UI


+  try {
+    const result = await generateObject({
+      model,
+      schema: ExtractedInfoSchema,
+      schemaName: "ExtractedInfo",
+      schemaDescription:
+        "Information extracted from a document for form filling",
+      system: DOCUMENT_PARSING_PROMPT,
+      prompt: `Extract all useful personal and professional information from this document:\n\n${text}`,
+      temperature: 0.1,
+      experimental_telemetry: {
+        isEnabled: DEBUG,
+        functionId: "document-parsing",
+        metadata: {
+          provider: selectedProvider,
+          textLength: text.length,
+        },
+      },
+    });
+    object = result.object;
+  } catch (error) {
+    throw error;
+  }


+    const result = await generateObject({
+      model,
+      schema: ExtractedInfoSchema,
+      schemaName: "ExtractedInfo",
+      schemaDescription:


+        const result = await parseDocument(file, {
+          requestId,
+          signal: controller.signal,
+          onStageChange: (stage) => {
+            if (requestIdRef.current !== currentRequestId) return;
+            setStatus(stage);
+          },
+        });
+
+        if (requestIdRef.current !== currentRequestId) return;
+
+        if (!result.success || !result.items) {
+          if (result.error === "cancelled") return;
+
+          const errorMsg =
+            result.error || "Failed to extract data from document";
+          setStatus("error");
+          setError(errorMsg);
+          toast.error(errorMsg);
+          return;
+        }
+
+        const items = convertToImportItems(result.items);
+        const duplicatesMap = await findDuplicates(items, existingMemories);
+        const enrichedItems = items.map((item, i) => {
+          const duplicate = duplicatesMap.get(i);
+          return duplicate ? { ...item, existingDuplicate: duplicate } : item;
+        });
+
+        setImportItems(enrichedItems);
+        setStatus("success");
+
+        logger.debug(
+          `[req:${requestId}] Successfully extracted document data:`,
+          items.length,
+          "items",
+        );


+      lastImportKeyRef.current = importKey;
+      lastImportTimeRef.current = now;
+      abortControllerRef.current?.abort();
+      const controller = new AbortController();
+      abortControllerRef.current = controller;


pratham-kpatil added 7 commits March 14, 2026 14:16

fix: update guidelines for document parsing to prevent duplicate memo…

8e22daa

…ries and clarify extraction rules

fix: enhance document parsing by adding deduplication logic and impro…

4e12a42

…ving memory handling

fix: improve document import process with enhanced status messaging a…

e6e73a0

…nd error handling

fix: add observability to document parser with telemetry integration

aad22b8

fix: enhance document parsing with abort handling and debounce logic …

3ce150c

…for file imports

fix: refactor document import logic for improved error handling and d…

a47c8d0

…ebounce functionality

fix: enhance document import dialog with duplicate memory preview and…

5acf4bb

… selection options

pratham-kpatil requested review from Copilot and mikr13 and removed request for Copilot March 14, 2026 10:32

Copilot started reviewing on behalf of pratham-kpatil March 14, 2026 10:32 View session

coderabbitai Bot requested changes Mar 14, 2026

View reviewed changes

chore: clean up comments in document import dialog

a532bee

Removed comments related to import debouncing and cancellation.

Copilot AI review requested due to automatic review settings March 14, 2026 10:42

Copilot started reviewing on behalf of mikr13 March 14, 2026 10:43 View session

mikr13 added 5 commits March 14, 2026 16:13

chore: Clean up import-dialog-shared.tsx by removing comments

2141fbe

Removed commented-out code for clarity.

chore: remove comment on pdfjs module caching

dfe6680

Remove comment about cached pdfjs module.

refactor: PDF text extraction and normalization functions

f4295ad

chore: remove telemetry debug code from document-parser.ts

543cec2

Removed telemetry-related debug code from document parser.

chore: Remove comment on secondary dedup key

304395e

Remove comment about secondary dedup key in content-hash.ts

Copilot AI reviewed Mar 14, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings March 14, 2026 10:48

Copilot started reviewing on behalf of mikr13 March 14, 2026 10:49 View session

mikr13 added 2 commits March 14, 2026 16:20

chore: Remove documentation for findDuplicates function

21ae305

Removed unused documentation comment for findDuplicates function.

Remove comment from BaseImport type

b6c30de

Remove comment about existingDuplicate property.

Copilot AI reviewed Mar 14, 2026

View reviewed changes

fix: typo

3f61c12

Copilot AI review requested due to automatic review settings March 14, 2026 10:53

Fix typo in doc parser prompt update message

36a27b1

Copilot started reviewing on behalf of mikr13 March 14, 2026 10:54 View session

fix: typo

e0bdbf7

Copilot AI reviewed Mar 14, 2026

View reviewed changes

mikr13 changed the title ~~Improve doc parsing~~ docs: improve doc parsing Mar 17, 2026

mikr13 merged commit eb872ae into main Mar 17, 2026
5 of 7 checks passed

mikr13 deleted the improve-doc-parsing branch March 17, 2026 17:06

		export const buildAnswerKey = (answer: string, category: string): string =>
		`${normalizePart(answer)}\|\|${normalizePart(category)}`;


		const logger = createLogger("document-parser");

		let _pdfjsLib: typeof PdfjsDist \| null = null;

Conversation

pratham-kpatil commented Mar 14, 2026

Uh oh!

coderabbitai Bot commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning, 2 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai Bot commented Mar 14, 2026 •

edited

Loading