fix: deduplicate imports on retry and clean up error sources after import by Andre-4711 · Pull Request #242 · teng-lin/notebooklm-py

Andre-4711 · 2026-04-03T00:15:16Z

Summary

Fixes three related issues with research wait --import-all that cause duplicate and broken sources in notebooks.

Closes #241

Changes

1. Dedup on retry (`import_with_retry` in `helpers.py`)

Before each RPCTimeoutError retry, fetch the existing notebook sources and filter already-imported URLs out of pending_sources. Previously, every retry re-imported the full source list, resulting in e.g. 6× duplicates for 6 retry attempts (414 sources instead of ~67).

2. Pre-filter unsupported sources (`filter_unsupported_sources` in `helpers.py`)

Skip direct PDF/document URLs and known download path patterns (/fileadmin/, /SharedDocs/Downloads/, /sites/default/files/, /download/) before sending sources to NotebookLM. These consistently result in error status since NotebookLM cannot fetch and parse them as web pages.

3. Post-import cleanup (`cleanup_error_sources` in `helpers.py`)

After import completes, automatically delete any sources that ended up with error status. Keeps the notebook clean without requiring manual intervention.

Behavior change

research wait --import-all now prints:

Skipping 12 unsupported source(s) (PDFs/downloads) before import
Imported 55 sources
Removed 3 failed source(s) after import

JSON output gains an additional error_sources_removed field.

Testing

Reproduced with a deep research query returning 67 sources, 6 timeout retries, and ~12 PDF/download URLs.

Summary by CodeRabbit

Release Notes

Improvements
- Enhanced source import handling with automatic filtering of unsupported file types during notebook creation.
- Added automatic cleanup of sources that fail to import.
- Improved import retry mechanism to prevent accidental duplicate source additions.

…port Three related improvements to `research wait --import-all`: 1. **Dedup on retry** (`import_with_retry`): Before each RPCTimeout retry, fetch existing notebook sources and filter out already-imported URLs from `pending_sources`. Prevents 6x duplicates when multiple retries occur. 2. **Pre-filter unsupported sources** (`filter_unsupported_sources`): Skip direct PDF/document URLs and known download paths (e.g. /fileadmin/, /SharedDocs/Downloads/) before sending to NotebookLM. These consistently fail with error status since NotebookLM cannot parse them as web pages. 3. **Post-import cleanup** (`cleanup_error_sources`): After import completes, automatically delete any sources that ended up in error status. Keeps the notebook clean without manual intervention. Fixes teng-lin#241

coderabbitai · 2026-04-03T00:15:31Z

📝 Walkthrough

Walkthrough

The changes add source validation and cleanup utilities while enhancing the import retry mechanism to prevent duplicate sources when RPC timeouts occur. The retry now maintains a deduplicated list of pending sources by checking against already-imported URLs before each retry attempt.

Changes

Cohort / File(s)	Summary
Source Management Helpers `src/notebooklm/cli/helpers.py`	Added `filter_unsupported_sources()` to pre-filter sources by removing URLs ending with document/archive extensions and containing blocked patterns. Added `cleanup_error_sources()` to delete sources with error status from a notebook.
Deduplication on Timeout Retry `src/notebooklm/cli/helpers.py`	Enhanced `import_with_retry()` to maintain a `pending_sources` list that deduplicates against already-imported sources on RPCTimeoutError; attempts to fetch existing source URLs and removes duplicates before retry, falling back gracefully if listing fails.
Research Command Integration `src/notebooklm/cli/research.py`	Updated `research wait --import-all` to filter sources before import and clean up error sources afterward; integrates new helper functions in both JSON and non-JSON output paths.

Sequence Diagram

sequenceDiagram
    participant Client as Client/CLI
    participant Importer as import_with_retry
    participant Server as NotebookLM API
    participant DB as Notebook Sources

    Client->>Importer: Call with sources list
    Importer->>Server: import_sources(pending_sources)
    Server--xImporter: RPCTimeoutError
    
    rect rgba(255, 100, 100, 0.5)
        Note over Importer: Deduplication on Retry
        Importer->>Server: list existing sources
        Server->>DB: fetch all sources
        DB-->>Server: existing sources + URLs
        Server-->>Importer: existing sources
        Importer->>Importer: Build URL set<br/>Filter pending_sources
    end
    
    Importer->>Server: import_sources(filtered_pending)
    Server->>DB: add deduplicated sources
    DB-->>Server: success
    Server-->>Importer: result
    Importer-->>Client: imported sources

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Poem

🐰 When timeouts arrive with a timeout's harsh call,
We fetch what exists and don't reimport all!
Filter, dedupe, and gracefully retry,
No more 414 sources—just 67! Hooray! 🎉
✨ Sources now filtered with wisdom and care

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the primary changes: deduplicating imports on retry and cleaning up error sources after import, matching the main objectives.
Linked Issues check	✅ Passed	The pull request implements all coding requirements from issue `#241`: deduplication logic in import_with_retry, maintaining pending_sources list, filtering on RPCTimeoutError with safe fallback, plus additional features (pre-filtering unsupported sources and post-import cleanup).
Out of Scope Changes check	✅ Passed	All changes are directly related to fixing the duplicate import issue and enhancing source handling. Pre-filtering unsupported sources and cleaning error sources are aligned extensions of the core deduplication objective.
Docstring Coverage	✅ Passed	Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces source filtering and cleanup mechanisms to the NotebookLM CLI. It adds logic to pre-filter unsupported file types and blocked domains, and a cleanup function to remove sources that failed during import. Additionally, import_with_retry was updated to prevent duplicate imports by checking existing sources before retrying. Review feedback identifies critical runtime errors where source objects are treated as dictionaries, specifically when accessing IDs and URLs. There is also a suggestion to improve the robustness of file extension detection by accounting for URL query parameters.

gemini-code-assist · 2026-04-03T00:16:59Z

src/notebooklm/cli/helpers.py

+    """
+    try:
+        sources = await client.sources.list(notebook_id)
+        error_ids = [s.get("id") for s in sources if s.get("status") == "error" and s.get("id")]


client.sources.list(notebook_id) returns a list of Source objects, which do not have a .get() method. This line will raise an AttributeError at runtime. You should access attributes directly (e.g., s.id). Additionally, to maintain API consistency within the class, use the is_error property instead of comparing the status to a string literal or enum, following the established pattern of properties like is_completed or is_processing.

Suggested change

error_ids = [s.get("id") for s in sources if s.get("status") == "error" and s.get("id")]

error_ids = [s.id for s in sources if s.is_error and s.id]

References

Maintain API consistency within a class. New properties should follow established patterns (e.g., is_failed) rather than requiring direct enum comparisons.

gemini-code-assist · 2026-04-03T00:16:59Z

src/notebooklm/cli/helpers.py

+            # Filter out sources already imported to avoid duplicates on retry
+            try:
+                existing = await client.sources.list(notebook_id)
+                existing_urls = {s.get("url") for s in existing if s.get("url")}


Similar to the issue in cleanup_error_sources, existing is a list of Source objects. Calling .get("url") will result in an AttributeError. Use the .url attribute instead.

Suggested change

existing_urls = {s.get("url") for s in existing if s.get("url")}

existing_urls = {s.url for s in existing if s.url}

gemini-code-assist · 2026-04-03T00:16:59Z

src/notebooklm/cli/helpers.py

+    skipped = []
+    for s in sources:
+        url = (s.get("url") or "").lower()
+        if url.endswith(PDF_EXTENSIONS) or any(p in url for p in BLOCKED_PATTERNS):


The check url.endswith(PDF_EXTENSIONS) will fail to identify files if the URL contains query parameters (e.g., https://example.com/file.pdf?dl=1). For robust file matching, parse the URL and check the filename directly (e.g., using pathlib.Path(path).name) instead of checking for a substring or suffix in the full path.

References

For robust file matching, check the filename directly (e.g., path.name) instead of checking for a substring in the full path.

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/notebooklm/cli/helpers.py`:
- Around line 144-148: The loop currently returns the raw result of
client.research.import_sources(notebook_id, task_id, pending_sources) which may
omit items that were imported during a timed-out attempt; instead, after each
import_sources call fetch the canonical state with
client.sources.list(notebook_id) and reconcile that against the original URL set
to build the final imported_sources list, update pending_sources by removing
URLs now present, and short-circuit/return once pending_sources is empty; apply
the same reconciliation logic to the similar loop at lines 156-160 so the CLI
and JSON output reflect the true imported set rather than only the last
import_sources response.
- Around line 156-162: client.sources.list() returns objects, not dicts, so
using s.get("url") raises and the except swallows the error leaving
pending_sources unchanged; change the existing set construction to use attribute
access (e.g., existing_urls = {getattr(s, "url", None) for s in existing if
getattr(s, "url", None)}) and then filter pending_sources with the existing_urls
(pending_sources = [s for s in pending_sources if s.get("url") not in
existing_urls]); also avoid silently swallowing errors from
client.sources.list() — at minimum log the exception in the except block so
failures are visible.
- Around line 109-121: client.sources.list() yields objects with attributes, not
dicts, so change the filtering to iterate sources and use attribute access
(e.g., source.status == "error" and source.id) instead of .get(...) and when
calling client.sources.delete only delete sources that belong to the current
import batch by comparing the source's batch identifier property (e.g.,
source.import_batch_id or source.batch_id) to the current batch id passed into
the function; perform deletion inside a try/except per-source, increment the
removed count only on successful deletes, and return that successful-deletes
count (use client.sources.list, source.status, source.id,
source.<batch_id_field>, and client.sources.delete to locate changes).
- Around line 85-92: The current filter loop uses the full lowercased URL string
(`url`) to check PDF_EXTENSIONS and BLOCKED_PATTERNS which misses cases like
query strings and mixed-case paths; modify the loop in helpers.py to parse each
source URL (from `s.get("url")`) with a URL parser (e.g., urlparse), extract the
path component, lowercase that path once into a variable (e.g., `path` or
`normalized_path`), then test path.endswith(PDF_EXTENSIONS) and any(p in path
for p in map(str.lower, BLOCKED_PATTERNS)); update the conditional that
currently references `url` to use the normalized path instead and keep the rest
of the filtering logic the same.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4c6c3664-d0ce-4b8d-abf2-069caa037551

📥 Commits

Reviewing files that changed from the base of the PR and between 7a4c3f7 and 359c91e.

📒 Files selected for processing (2)

src/notebooklm/cli/helpers.py
src/notebooklm/cli/research.py

coderabbitai · 2026-04-03T00:24:33Z

src/notebooklm/cli/helpers.py

+    PDF_EXTENSIONS = (".pdf", ".docx", ".xlsx", ".pptx", ".zip")
+    BLOCKED_PATTERNS = ("/fileadmin/", "/download/", "/sites/default/files/", "/SharedDocs/Downloads/")
+
+    filtered = []
+    skipped = []
+    for s in sources:
+        url = (s.get("url") or "").lower()
+        if url.endswith(PDF_EXTENSIONS) or any(p in url for p in BLOCKED_PATTERNS):


⚠️ Potential issue | 🟠 Major

Match against a normalized URL path here.

This misses common unsupported URLs: endswith(...) runs on the full string, so file.pdf?download=1 is kept, and the lowercased URL will never match "/SharedDocs/Downloads/". Parse the path once, lowercase it, and compare against lowercase patterns.

Proposed fix

+from urllib.parse import urlsplit + def filter_unsupported_sources(sources: list[dict], *, json_output: bool = False) -> list[dict]: @@ - PDF_EXTENSIONS = (".pdf", ".docx", ".xlsx", ".pptx", ".zip") - BLOCKED_PATTERNS = ("/fileadmin/", "/download/", "/sites/default/files/", "/SharedDocs/Downloads/") + document_extensions = (".pdf", ".docx", ".xlsx", ".pptx", ".zip") + blocked_patterns = ( + "/fileadmin/", + "/download/", + "/sites/default/files/", + "/shareddocs/downloads/", + ) @@ - url = (s.get("url") or "").lower() - if url.endswith(PDF_EXTENSIONS) or any(p in url for p in BLOCKED_PATTERNS): + raw_url = s.get("url") or "" + path = urlsplit(raw_url).path.lower() + if path.endswith(document_extensions) or any(p in path for p in blocked_patterns):

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/notebooklm/cli/helpers.py` around lines 85 - 92, The current filter loop uses the full lowercased URL string (`url`) to check PDF_EXTENSIONS and BLOCKED_PATTERNS which misses cases like query strings and mixed-case paths; modify the loop in helpers.py to parse each source URL (from `s.get("url")`) with a URL parser (e.g., urlparse), extract the path component, lowercase that path once into a variable (e.g., `path` or `normalized_path`), then test path.endswith(PDF_EXTENSIONS) and any(p in path for p in map(str.lower, BLOCKED_PATTERNS)); update the conditional that currently references `url` to use the normalized path instead and keep the rest of the filtering logic the same.

coderabbitai · 2026-04-03T00:24:33Z

src/notebooklm/cli/helpers.py

+    try:
+        sources = await client.sources.list(notebook_id)
+        error_ids = [s.get("id") for s in sources if s.get("status") == "error" and s.get("id")]
+        for source_id in error_ids:
+            try:
+                await client.sources.delete(notebook_id, source_id)
+            except Exception:
+                pass
+        if error_ids and not json_output:
+            console.print(f"[dim]Removed {len(error_ids)} failed source(s) after import[/dim]")
+        return len(error_ids)
+    except Exception:
+        return 0


⚠️ Potential issue | 🔴 Critical

client.sources.list() isn't returning dicts here.

Elsewhere in this module the same API is consumed via item.id / item.title. These .get("status") / .get("id") calls will raise AttributeError, and the blanket except then turns cleanup into a silent no-op. After fixing that access, please also scope deletion to the current import batch and count only successful deletes; otherwise error_sources_removed can include older notebook errors and failed deletions.

🧰 Tools

🪛 Ruff (0.15.7)

[error] 115-116: try-except-pass detected, consider logging the exception

(S110)

[warning] 115-115: Do not catch blind exception: Exception

(BLE001)

[warning] 120-120: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/notebooklm/cli/helpers.py` around lines 109 - 121, client.sources.list() yields objects with attributes, not dicts, so change the filtering to iterate sources and use attribute access (e.g., source.status == "error" and source.id) instead of .get(...) and when calling client.sources.delete only delete sources that belong to the current import batch by comparing the source's batch identifier property (e.g., source.import_batch_id or source.batch_id) to the current batch id passed into the function; perform deletion inside a try/except per-source, increment the removed count only on successful deletes, and return that successful-deletes count (use client.sources.list, source.status, source.id, source.<batch_id_field>, and client.sources.delete to locate changes).

coderabbitai · 2026-04-03T00:24:33Z

src/notebooklm/cli/helpers.py

+    pending_sources = list(sources)

    while True:
        try:
-            return await client.research.import_sources(notebook_id, task_id, sources)
+            return await client.research.import_sources(notebook_id, task_id, pending_sources)


⚠️ Potential issue | 🟠 Major

Don't return only the last import_sources() response.

src/notebooklm/_research.py:297-350 already documents that import_sources() may return fewer items than were actually imported. After a timeout, pending_sources only holds the leftovers, so this helper can under-report imports or even return [] when the first timed-out attempt already finished. Rebuild the final result from client.sources.list(notebook_id) for the original URL set, and short-circuit when pending_sources becomes empty; otherwise both the CLI count and JSON imported_sources stay wrong after retries.

Also applies to: 156-160

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/notebooklm/cli/helpers.py` around lines 144 - 148, The loop currently returns the raw result of client.research.import_sources(notebook_id, task_id, pending_sources) which may omit items that were imported during a timed-out attempt; instead, after each import_sources call fetch the canonical state with client.sources.list(notebook_id) and reconcile that against the original URL set to build the final imported_sources list, update pending_sources by removing URLs now present, and short-circuit/return once pending_sources is empty; apply the same reconciliation logic to the similar loop at lines 156-160 so the CLI and JSON output reflect the true imported set rather than only the last import_sources response.

coderabbitai · 2026-04-03T00:24:33Z

src/notebooklm/cli/helpers.py

+            # Filter out sources already imported to avoid duplicates on retry
+            try:
+                existing = await client.sources.list(notebook_id)
+                existing_urls = {s.get("url") for s in existing if s.get("url")}
+                pending_sources = [s for s in pending_sources if s.get("url") not in existing_urls]
+            except Exception:
+                pass  # If listing fails, retry with original list


⚠️ Potential issue | 🔴 Critical

Retry dedup currently never removes anything.

client.sources.list() is used as an object list elsewhere in this file, so .get("url") throws here too. Because the exception is swallowed, pending_sources stays unchanged and every RPC timeout still retries the full batch, which recreates the duplicate-import bug this PR is trying to fix.

🧰 Tools

🪛 Ruff (0.15.7)

[error] 161-162: try-except-pass detected, consider logging the exception

(S110)

[warning] 161-161: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/notebooklm/cli/helpers.py` around lines 156 - 162, client.sources.list() returns objects, not dicts, so using s.get("url") raises and the except swallows the error leaving pending_sources unchanged; change the existing set construction to use attribute access (e.g., existing_urls = {getattr(s, "url", None) for s in existing if getattr(s, "url", None)}) and then filter pending_sources with the existing_urls (pending_sources = [s for s in pending_sources if s.get("url") not in existing_urls]); also avoid silently swallowing errors from client.sources.list() — at minimum log the exception in the except block so failures are visible.

gemini-code-assist bot reviewed Apr 3, 2026

View reviewed changes

coderabbitai bot reviewed Apr 3, 2026

View reviewed changes

teng-lin added the bot-generated Likely AI/bot-generated contribution label Apr 5, 2026

teng-lin mentioned this pull request Apr 5, 2026

docs: add contribution templates and triage bot-generated PRs #251

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: deduplicate imports on retry and clean up error sources after import#242

fix: deduplicate imports on retry and clean up error sources after import#242
Andre-4711 wants to merge 1 commit intoteng-lin:mainfrom
Andre-4711:fix/dedup-import-cleanup-error-sources

Andre-4711 commented Apr 3, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 3, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 3, 2026

Uh oh!

gemini-code-assist bot Apr 3, 2026

Uh oh!

gemini-code-assist bot Apr 3, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 3, 2026

Uh oh!

coderabbitai bot Apr 3, 2026

Uh oh!

coderabbitai bot Apr 3, 2026

Uh oh!

coderabbitai bot Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	error_ids = [s.get("id") for s in sources if s.get("status") == "error" and s.get("id")]
	error_ids = [s.id for s in sources if s.is_error and s.id]

	existing_urls = {s.get("url") for s in existing if s.get("url")}
	existing_urls = {s.url for s in existing if s.url}

Conversation

Andre-4711 commented Apr 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

1. Dedup on retry (import_with_retry in helpers.py)

2. Pre-filter unsupported sources (filter_unsupported_sources in helpers.py)

3. Post-import cleanup (cleanup_error_sources in helpers.py)

Behavior change

Testing

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Andre-4711 commented Apr 3, 2026 •

edited by coderabbitai bot

Loading

1. Dedup on retry (`import_with_retry` in `helpers.py`)

2. Pre-filter unsupported sources (`filter_unsupported_sources` in `helpers.py`)

3. Post-import cleanup (`cleanup_error_sources` in `helpers.py`)

coderabbitai bot commented Apr 3, 2026 •

edited

Loading