Skip to content

fix: deduplicate imports on retry and clean up error sources after import#242

Open
Andre-4711 wants to merge 1 commit intoteng-lin:mainfrom
Andre-4711:fix/dedup-import-cleanup-error-sources
Open

fix: deduplicate imports on retry and clean up error sources after import#242
Andre-4711 wants to merge 1 commit intoteng-lin:mainfrom
Andre-4711:fix/dedup-import-cleanup-error-sources

Conversation

@Andre-4711
Copy link
Copy Markdown

@Andre-4711 Andre-4711 commented Apr 3, 2026

Summary

Fixes three related issues with research wait --import-all that cause duplicate and broken sources in notebooks.

Closes #241

Changes

1. Dedup on retry (import_with_retry in helpers.py)

Before each RPCTimeoutError retry, fetch the existing notebook sources and filter already-imported URLs out of pending_sources. Previously, every retry re-imported the full source list, resulting in e.g. 6× duplicates for 6 retry attempts (414 sources instead of ~67).

2. Pre-filter unsupported sources (filter_unsupported_sources in helpers.py)

Skip direct PDF/document URLs and known download path patterns (/fileadmin/, /SharedDocs/Downloads/, /sites/default/files/, /download/) before sending sources to NotebookLM. These consistently result in error status since NotebookLM cannot fetch and parse them as web pages.

3. Post-import cleanup (cleanup_error_sources in helpers.py)

After import completes, automatically delete any sources that ended up with error status. Keeps the notebook clean without requiring manual intervention.

Behavior change

research wait --import-all now prints:

Skipping 12 unsupported source(s) (PDFs/downloads) before import
Imported 55 sources
Removed 3 failed source(s) after import

JSON output gains an additional error_sources_removed field.

Testing

Reproduced with a deep research query returning 67 sources, 6 timeout retries, and ~12 PDF/download URLs.

Summary by CodeRabbit

Release Notes

  • Improvements
    • Enhanced source import handling with automatic filtering of unsupported file types during notebook creation.
    • Added automatic cleanup of sources that fail to import.
    • Improved import retry mechanism to prevent accidental duplicate source additions.

…port

Three related improvements to `research wait --import-all`:

1. **Dedup on retry** (`import_with_retry`): Before each RPCTimeout retry,
   fetch existing notebook sources and filter out already-imported URLs from
   `pending_sources`. Prevents 6x duplicates when multiple retries occur.

2. **Pre-filter unsupported sources** (`filter_unsupported_sources`): Skip
   direct PDF/document URLs and known download paths (e.g. /fileadmin/,
   /SharedDocs/Downloads/) before sending to NotebookLM. These consistently
   fail with error status since NotebookLM cannot parse them as web pages.

3. **Post-import cleanup** (`cleanup_error_sources`): After import completes,
   automatically delete any sources that ended up in error status. Keeps the
   notebook clean without manual intervention.

Fixes teng-lin#241
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 3, 2026

📝 Walkthrough

Walkthrough

The changes add source validation and cleanup utilities while enhancing the import retry mechanism to prevent duplicate sources when RPC timeouts occur. The retry now maintains a deduplicated list of pending sources by checking against already-imported URLs before each retry attempt.

Changes

Cohort / File(s) Summary
Source Management Helpers
src/notebooklm/cli/helpers.py
Added filter_unsupported_sources() to pre-filter sources by removing URLs ending with document/archive extensions and containing blocked patterns. Added cleanup_error_sources() to delete sources with error status from a notebook.
Deduplication on Timeout Retry
src/notebooklm/cli/helpers.py
Enhanced import_with_retry() to maintain a pending_sources list that deduplicates against already-imported sources on RPCTimeoutError; attempts to fetch existing source URLs and removes duplicates before retry, falling back gracefully if listing fails.
Research Command Integration
src/notebooklm/cli/research.py
Updated research wait --import-all to filter sources before import and clean up error sources afterward; integrates new helper functions in both JSON and non-JSON output paths.

Sequence Diagram

sequenceDiagram
    participant Client as Client/CLI
    participant Importer as import_with_retry
    participant Server as NotebookLM API
    participant DB as Notebook Sources

    Client->>Importer: Call with sources list
    Importer->>Server: import_sources(pending_sources)
    Server--xImporter: RPCTimeoutError
    
    rect rgba(255, 100, 100, 0.5)
        Note over Importer: Deduplication on Retry
        Importer->>Server: list existing sources
        Server->>DB: fetch all sources
        DB-->>Server: existing sources + URLs
        Server-->>Importer: existing sources
        Importer->>Importer: Build URL set<br/>Filter pending_sources
    end
    
    Importer->>Server: import_sources(filtered_pending)
    Server->>DB: add deduplicated sources
    DB-->>Server: success
    Server-->>Importer: result
    Importer-->>Client: imported sources
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Poem

🐰 When timeouts arrive with a timeout's harsh call,
We fetch what exists and don't reimport all!
Filter, dedupe, and gracefully retry,
No more 414 sources—just 67! Hooray! 🎉
Sources now filtered with wisdom and care

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the primary changes: deduplicating imports on retry and cleaning up error sources after import, matching the main objectives.
Linked Issues check ✅ Passed The pull request implements all coding requirements from issue #241: deduplication logic in import_with_retry, maintaining pending_sources list, filtering on RPCTimeoutError with safe fallback, plus additional features (pre-filtering unsupported sources and post-import cleanup).
Out of Scope Changes check ✅ Passed All changes are directly related to fixing the duplicate import issue and enhancing source handling. Pre-filtering unsupported sources and cleaning error sources are aligned extensions of the core deduplication objective.
Docstring Coverage ✅ Passed Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces source filtering and cleanup mechanisms to the NotebookLM CLI. It adds logic to pre-filter unsupported file types and blocked domains, and a cleanup function to remove sources that failed during import. Additionally, import_with_retry was updated to prevent duplicate imports by checking existing sources before retrying. Review feedback identifies critical runtime errors where source objects are treated as dictionaries, specifically when accessing IDs and URLs. There is also a suggestion to improve the robustness of file extension detection by accounting for URL query parameters.

"""
try:
sources = await client.sources.list(notebook_id)
error_ids = [s.get("id") for s in sources if s.get("status") == "error" and s.get("id")]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

client.sources.list(notebook_id) returns a list of Source objects, which do not have a .get() method. This line will raise an AttributeError at runtime. You should access attributes directly (e.g., s.id). Additionally, to maintain API consistency within the class, use the is_error property instead of comparing the status to a string literal or enum, following the established pattern of properties like is_completed or is_processing.

Suggested change
error_ids = [s.get("id") for s in sources if s.get("status") == "error" and s.get("id")]
error_ids = [s.id for s in sources if s.is_error and s.id]
References
  1. Maintain API consistency within a class. New properties should follow established patterns (e.g., is_failed) rather than requiring direct enum comparisons.

# Filter out sources already imported to avoid duplicates on retry
try:
existing = await client.sources.list(notebook_id)
existing_urls = {s.get("url") for s in existing if s.get("url")}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the issue in cleanup_error_sources, existing is a list of Source objects. Calling .get("url") will result in an AttributeError. Use the .url attribute instead.

Suggested change
existing_urls = {s.get("url") for s in existing if s.get("url")}
existing_urls = {s.url for s in existing if s.url}

skipped = []
for s in sources:
url = (s.get("url") or "").lower()
if url.endswith(PDF_EXTENSIONS) or any(p in url for p in BLOCKED_PATTERNS):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The check url.endswith(PDF_EXTENSIONS) will fail to identify files if the URL contains query parameters (e.g., https://example.com/file.pdf?dl=1). For robust file matching, parse the URL and check the filename directly (e.g., using pathlib.Path(path).name) instead of checking for a substring or suffix in the full path.

References
  1. For robust file matching, check the filename directly (e.g., path.name) instead of checking for a substring in the full path.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/notebooklm/cli/helpers.py`:
- Around line 144-148: The loop currently returns the raw result of
client.research.import_sources(notebook_id, task_id, pending_sources) which may
omit items that were imported during a timed-out attempt; instead, after each
import_sources call fetch the canonical state with
client.sources.list(notebook_id) and reconcile that against the original URL set
to build the final imported_sources list, update pending_sources by removing
URLs now present, and short-circuit/return once pending_sources is empty; apply
the same reconciliation logic to the similar loop at lines 156-160 so the CLI
and JSON output reflect the true imported set rather than only the last
import_sources response.
- Around line 156-162: client.sources.list() returns objects, not dicts, so
using s.get("url") raises and the except swallows the error leaving
pending_sources unchanged; change the existing set construction to use attribute
access (e.g., existing_urls = {getattr(s, "url", None) for s in existing if
getattr(s, "url", None)}) and then filter pending_sources with the existing_urls
(pending_sources = [s for s in pending_sources if s.get("url") not in
existing_urls]); also avoid silently swallowing errors from
client.sources.list() — at minimum log the exception in the except block so
failures are visible.
- Around line 109-121: client.sources.list() yields objects with attributes, not
dicts, so change the filtering to iterate sources and use attribute access
(e.g., source.status == "error" and source.id) instead of .get(...) and when
calling client.sources.delete only delete sources that belong to the current
import batch by comparing the source's batch identifier property (e.g.,
source.import_batch_id or source.batch_id) to the current batch id passed into
the function; perform deletion inside a try/except per-source, increment the
removed count only on successful deletes, and return that successful-deletes
count (use client.sources.list, source.status, source.id,
source.<batch_id_field>, and client.sources.delete to locate changes).
- Around line 85-92: The current filter loop uses the full lowercased URL string
(`url`) to check PDF_EXTENSIONS and BLOCKED_PATTERNS which misses cases like
query strings and mixed-case paths; modify the loop in helpers.py to parse each
source URL (from `s.get("url")`) with a URL parser (e.g., urlparse), extract the
path component, lowercase that path once into a variable (e.g., `path` or
`normalized_path`), then test path.endswith(PDF_EXTENSIONS) and any(p in path
for p in map(str.lower, BLOCKED_PATTERNS)); update the conditional that
currently references `url` to use the normalized path instead and keep the rest
of the filtering logic the same.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4c6c3664-d0ce-4b8d-abf2-069caa037551

📥 Commits

Reviewing files that changed from the base of the PR and between 7a4c3f7 and 359c91e.

📒 Files selected for processing (2)
  • src/notebooklm/cli/helpers.py
  • src/notebooklm/cli/research.py

Comment on lines +85 to +92
PDF_EXTENSIONS = (".pdf", ".docx", ".xlsx", ".pptx", ".zip")
BLOCKED_PATTERNS = ("/fileadmin/", "/download/", "/sites/default/files/", "/SharedDocs/Downloads/")

filtered = []
skipped = []
for s in sources:
url = (s.get("url") or "").lower()
if url.endswith(PDF_EXTENSIONS) or any(p in url for p in BLOCKED_PATTERNS):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Match against a normalized URL path here.

This misses common unsupported URLs: endswith(...) runs on the full string, so file.pdf?download=1 is kept, and the lowercased URL will never match "/SharedDocs/Downloads/". Parse the path once, lowercase it, and compare against lowercase patterns.

Proposed fix
+from urllib.parse import urlsplit
+
 def filter_unsupported_sources(sources: list[dict], *, json_output: bool = False) -> list[dict]:
@@
-    PDF_EXTENSIONS = (".pdf", ".docx", ".xlsx", ".pptx", ".zip")
-    BLOCKED_PATTERNS = ("/fileadmin/", "/download/", "/sites/default/files/", "/SharedDocs/Downloads/")
+    document_extensions = (".pdf", ".docx", ".xlsx", ".pptx", ".zip")
+    blocked_patterns = (
+        "/fileadmin/",
+        "/download/",
+        "/sites/default/files/",
+        "/shareddocs/downloads/",
+    )
@@
-        url = (s.get("url") or "").lower()
-        if url.endswith(PDF_EXTENSIONS) or any(p in url for p in BLOCKED_PATTERNS):
+        raw_url = s.get("url") or ""
+        path = urlsplit(raw_url).path.lower()
+        if path.endswith(document_extensions) or any(p in path for p in blocked_patterns):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/notebooklm/cli/helpers.py` around lines 85 - 92, The current filter loop
uses the full lowercased URL string (`url`) to check PDF_EXTENSIONS and
BLOCKED_PATTERNS which misses cases like query strings and mixed-case paths;
modify the loop in helpers.py to parse each source URL (from `s.get("url")`)
with a URL parser (e.g., urlparse), extract the path component, lowercase that
path once into a variable (e.g., `path` or `normalized_path`), then test
path.endswith(PDF_EXTENSIONS) and any(p in path for p in map(str.lower,
BLOCKED_PATTERNS)); update the conditional that currently references `url` to
use the normalized path instead and keep the rest of the filtering logic the
same.

Comment on lines +109 to +121
try:
sources = await client.sources.list(notebook_id)
error_ids = [s.get("id") for s in sources if s.get("status") == "error" and s.get("id")]
for source_id in error_ids:
try:
await client.sources.delete(notebook_id, source_id)
except Exception:
pass
if error_ids and not json_output:
console.print(f"[dim]Removed {len(error_ids)} failed source(s) after import[/dim]")
return len(error_ids)
except Exception:
return 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

client.sources.list() isn't returning dicts here.

Elsewhere in this module the same API is consumed via item.id / item.title. These .get("status") / .get("id") calls will raise AttributeError, and the blanket except then turns cleanup into a silent no-op. After fixing that access, please also scope deletion to the current import batch and count only successful deletes; otherwise error_sources_removed can include older notebook errors and failed deletions.

🧰 Tools
🪛 Ruff (0.15.7)

[error] 115-116: try-except-pass detected, consider logging the exception

(S110)


[warning] 115-115: Do not catch blind exception: Exception

(BLE001)


[warning] 120-120: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/notebooklm/cli/helpers.py` around lines 109 - 121, client.sources.list()
yields objects with attributes, not dicts, so change the filtering to iterate
sources and use attribute access (e.g., source.status == "error" and source.id)
instead of .get(...) and when calling client.sources.delete only delete sources
that belong to the current import batch by comparing the source's batch
identifier property (e.g., source.import_batch_id or source.batch_id) to the
current batch id passed into the function; perform deletion inside a try/except
per-source, increment the removed count only on successful deletes, and return
that successful-deletes count (use client.sources.list, source.status,
source.id, source.<batch_id_field>, and client.sources.delete to locate
changes).

Comment on lines +144 to +148
pending_sources = list(sources)

while True:
try:
return await client.research.import_sources(notebook_id, task_id, sources)
return await client.research.import_sources(notebook_id, task_id, pending_sources)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't return only the last import_sources() response.

src/notebooklm/_research.py:297-350 already documents that import_sources() may return fewer items than were actually imported. After a timeout, pending_sources only holds the leftovers, so this helper can under-report imports or even return [] when the first timed-out attempt already finished. Rebuild the final result from client.sources.list(notebook_id) for the original URL set, and short-circuit when pending_sources becomes empty; otherwise both the CLI count and JSON imported_sources stay wrong after retries.

Also applies to: 156-160

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/notebooklm/cli/helpers.py` around lines 144 - 148, The loop currently
returns the raw result of client.research.import_sources(notebook_id, task_id,
pending_sources) which may omit items that were imported during a timed-out
attempt; instead, after each import_sources call fetch the canonical state with
client.sources.list(notebook_id) and reconcile that against the original URL set
to build the final imported_sources list, update pending_sources by removing
URLs now present, and short-circuit/return once pending_sources is empty; apply
the same reconciliation logic to the similar loop at lines 156-160 so the CLI
and JSON output reflect the true imported set rather than only the last
import_sources response.

Comment on lines +156 to +162
# Filter out sources already imported to avoid duplicates on retry
try:
existing = await client.sources.list(notebook_id)
existing_urls = {s.get("url") for s in existing if s.get("url")}
pending_sources = [s for s in pending_sources if s.get("url") not in existing_urls]
except Exception:
pass # If listing fails, retry with original list
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Retry dedup currently never removes anything.

client.sources.list() is used as an object list elsewhere in this file, so .get("url") throws here too. Because the exception is swallowed, pending_sources stays unchanged and every RPC timeout still retries the full batch, which recreates the duplicate-import bug this PR is trying to fix.

🧰 Tools
🪛 Ruff (0.15.7)

[error] 161-162: try-except-pass detected, consider logging the exception

(S110)


[warning] 161-161: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/notebooklm/cli/helpers.py` around lines 156 - 162, client.sources.list()
returns objects, not dicts, so using s.get("url") raises and the except swallows
the error leaving pending_sources unchanged; change the existing set
construction to use attribute access (e.g., existing_urls = {getattr(s, "url",
None) for s in existing if getattr(s, "url", None)}) and then filter
pending_sources with the existing_urls (pending_sources = [s for s in
pending_sources if s.get("url") not in existing_urls]); also avoid silently
swallowing errors from client.sources.list() — at minimum log the exception in
the except block so failures are visible.

@teng-lin teng-lin added the bot-generated Likely AI/bot-generated contribution label Apr 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bot-generated Likely AI/bot-generated contribution

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: research wait --import-all imports duplicate sources on timeout retry

2 participants