fix: deduplicate imports on retry and clean up error sources after import#242
fix: deduplicate imports on retry and clean up error sources after import#242Andre-4711 wants to merge 1 commit intoteng-lin:mainfrom
Conversation
…port Three related improvements to `research wait --import-all`: 1. **Dedup on retry** (`import_with_retry`): Before each RPCTimeout retry, fetch existing notebook sources and filter out already-imported URLs from `pending_sources`. Prevents 6x duplicates when multiple retries occur. 2. **Pre-filter unsupported sources** (`filter_unsupported_sources`): Skip direct PDF/document URLs and known download paths (e.g. /fileadmin/, /SharedDocs/Downloads/) before sending to NotebookLM. These consistently fail with error status since NotebookLM cannot parse them as web pages. 3. **Post-import cleanup** (`cleanup_error_sources`): After import completes, automatically delete any sources that ended up in error status. Keeps the notebook clean without manual intervention. Fixes teng-lin#241
📝 WalkthroughWalkthroughThe changes add source validation and cleanup utilities while enhancing the import retry mechanism to prevent duplicate sources when RPC timeouts occur. The retry now maintains a deduplicated list of pending sources by checking against already-imported URLs before each retry attempt. Changes
Sequence DiagramsequenceDiagram
participant Client as Client/CLI
participant Importer as import_with_retry
participant Server as NotebookLM API
participant DB as Notebook Sources
Client->>Importer: Call with sources list
Importer->>Server: import_sources(pending_sources)
Server--xImporter: RPCTimeoutError
rect rgba(255, 100, 100, 0.5)
Note over Importer: Deduplication on Retry
Importer->>Server: list existing sources
Server->>DB: fetch all sources
DB-->>Server: existing sources + URLs
Server-->>Importer: existing sources
Importer->>Importer: Build URL set<br/>Filter pending_sources
end
Importer->>Server: import_sources(filtered_pending)
Server->>DB: add deduplicated sources
DB-->>Server: success
Server-->>Importer: result
Importer-->>Client: imported sources
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces source filtering and cleanup mechanisms to the NotebookLM CLI. It adds logic to pre-filter unsupported file types and blocked domains, and a cleanup function to remove sources that failed during import. Additionally, import_with_retry was updated to prevent duplicate imports by checking existing sources before retrying. Review feedback identifies critical runtime errors where source objects are treated as dictionaries, specifically when accessing IDs and URLs. There is also a suggestion to improve the robustness of file extension detection by accounting for URL query parameters.
| """ | ||
| try: | ||
| sources = await client.sources.list(notebook_id) | ||
| error_ids = [s.get("id") for s in sources if s.get("status") == "error" and s.get("id")] |
There was a problem hiding this comment.
client.sources.list(notebook_id) returns a list of Source objects, which do not have a .get() method. This line will raise an AttributeError at runtime. You should access attributes directly (e.g., s.id). Additionally, to maintain API consistency within the class, use the is_error property instead of comparing the status to a string literal or enum, following the established pattern of properties like is_completed or is_processing.
| error_ids = [s.get("id") for s in sources if s.get("status") == "error" and s.get("id")] | |
| error_ids = [s.id for s in sources if s.is_error and s.id] |
References
- Maintain API consistency within a class. New properties should follow established patterns (e.g., is_failed) rather than requiring direct enum comparisons.
| # Filter out sources already imported to avoid duplicates on retry | ||
| try: | ||
| existing = await client.sources.list(notebook_id) | ||
| existing_urls = {s.get("url") for s in existing if s.get("url")} |
There was a problem hiding this comment.
Similar to the issue in cleanup_error_sources, existing is a list of Source objects. Calling .get("url") will result in an AttributeError. Use the .url attribute instead.
| existing_urls = {s.get("url") for s in existing if s.get("url")} | |
| existing_urls = {s.url for s in existing if s.url} |
| skipped = [] | ||
| for s in sources: | ||
| url = (s.get("url") or "").lower() | ||
| if url.endswith(PDF_EXTENSIONS) or any(p in url for p in BLOCKED_PATTERNS): |
There was a problem hiding this comment.
The check url.endswith(PDF_EXTENSIONS) will fail to identify files if the URL contains query parameters (e.g., https://example.com/file.pdf?dl=1). For robust file matching, parse the URL and check the filename directly (e.g., using pathlib.Path(path).name) instead of checking for a substring or suffix in the full path.
References
- For robust file matching, check the filename directly (e.g., path.name) instead of checking for a substring in the full path.
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/notebooklm/cli/helpers.py`:
- Around line 144-148: The loop currently returns the raw result of
client.research.import_sources(notebook_id, task_id, pending_sources) which may
omit items that were imported during a timed-out attempt; instead, after each
import_sources call fetch the canonical state with
client.sources.list(notebook_id) and reconcile that against the original URL set
to build the final imported_sources list, update pending_sources by removing
URLs now present, and short-circuit/return once pending_sources is empty; apply
the same reconciliation logic to the similar loop at lines 156-160 so the CLI
and JSON output reflect the true imported set rather than only the last
import_sources response.
- Around line 156-162: client.sources.list() returns objects, not dicts, so
using s.get("url") raises and the except swallows the error leaving
pending_sources unchanged; change the existing set construction to use attribute
access (e.g., existing_urls = {getattr(s, "url", None) for s in existing if
getattr(s, "url", None)}) and then filter pending_sources with the existing_urls
(pending_sources = [s for s in pending_sources if s.get("url") not in
existing_urls]); also avoid silently swallowing errors from
client.sources.list() — at minimum log the exception in the except block so
failures are visible.
- Around line 109-121: client.sources.list() yields objects with attributes, not
dicts, so change the filtering to iterate sources and use attribute access
(e.g., source.status == "error" and source.id) instead of .get(...) and when
calling client.sources.delete only delete sources that belong to the current
import batch by comparing the source's batch identifier property (e.g.,
source.import_batch_id or source.batch_id) to the current batch id passed into
the function; perform deletion inside a try/except per-source, increment the
removed count only on successful deletes, and return that successful-deletes
count (use client.sources.list, source.status, source.id,
source.<batch_id_field>, and client.sources.delete to locate changes).
- Around line 85-92: The current filter loop uses the full lowercased URL string
(`url`) to check PDF_EXTENSIONS and BLOCKED_PATTERNS which misses cases like
query strings and mixed-case paths; modify the loop in helpers.py to parse each
source URL (from `s.get("url")`) with a URL parser (e.g., urlparse), extract the
path component, lowercase that path once into a variable (e.g., `path` or
`normalized_path`), then test path.endswith(PDF_EXTENSIONS) and any(p in path
for p in map(str.lower, BLOCKED_PATTERNS)); update the conditional that
currently references `url` to use the normalized path instead and keep the rest
of the filtering logic the same.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4c6c3664-d0ce-4b8d-abf2-069caa037551
📒 Files selected for processing (2)
src/notebooklm/cli/helpers.pysrc/notebooklm/cli/research.py
| PDF_EXTENSIONS = (".pdf", ".docx", ".xlsx", ".pptx", ".zip") | ||
| BLOCKED_PATTERNS = ("/fileadmin/", "/download/", "/sites/default/files/", "/SharedDocs/Downloads/") | ||
|
|
||
| filtered = [] | ||
| skipped = [] | ||
| for s in sources: | ||
| url = (s.get("url") or "").lower() | ||
| if url.endswith(PDF_EXTENSIONS) or any(p in url for p in BLOCKED_PATTERNS): |
There was a problem hiding this comment.
Match against a normalized URL path here.
This misses common unsupported URLs: endswith(...) runs on the full string, so file.pdf?download=1 is kept, and the lowercased URL will never match "/SharedDocs/Downloads/". Parse the path once, lowercase it, and compare against lowercase patterns.
Proposed fix
+from urllib.parse import urlsplit
+
def filter_unsupported_sources(sources: list[dict], *, json_output: bool = False) -> list[dict]:
@@
- PDF_EXTENSIONS = (".pdf", ".docx", ".xlsx", ".pptx", ".zip")
- BLOCKED_PATTERNS = ("/fileadmin/", "/download/", "/sites/default/files/", "/SharedDocs/Downloads/")
+ document_extensions = (".pdf", ".docx", ".xlsx", ".pptx", ".zip")
+ blocked_patterns = (
+ "/fileadmin/",
+ "/download/",
+ "/sites/default/files/",
+ "/shareddocs/downloads/",
+ )
@@
- url = (s.get("url") or "").lower()
- if url.endswith(PDF_EXTENSIONS) or any(p in url for p in BLOCKED_PATTERNS):
+ raw_url = s.get("url") or ""
+ path = urlsplit(raw_url).path.lower()
+ if path.endswith(document_extensions) or any(p in path for p in blocked_patterns):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/notebooklm/cli/helpers.py` around lines 85 - 92, The current filter loop
uses the full lowercased URL string (`url`) to check PDF_EXTENSIONS and
BLOCKED_PATTERNS which misses cases like query strings and mixed-case paths;
modify the loop in helpers.py to parse each source URL (from `s.get("url")`)
with a URL parser (e.g., urlparse), extract the path component, lowercase that
path once into a variable (e.g., `path` or `normalized_path`), then test
path.endswith(PDF_EXTENSIONS) and any(p in path for p in map(str.lower,
BLOCKED_PATTERNS)); update the conditional that currently references `url` to
use the normalized path instead and keep the rest of the filtering logic the
same.
| try: | ||
| sources = await client.sources.list(notebook_id) | ||
| error_ids = [s.get("id") for s in sources if s.get("status") == "error" and s.get("id")] | ||
| for source_id in error_ids: | ||
| try: | ||
| await client.sources.delete(notebook_id, source_id) | ||
| except Exception: | ||
| pass | ||
| if error_ids and not json_output: | ||
| console.print(f"[dim]Removed {len(error_ids)} failed source(s) after import[/dim]") | ||
| return len(error_ids) | ||
| except Exception: | ||
| return 0 |
There was a problem hiding this comment.
client.sources.list() isn't returning dicts here.
Elsewhere in this module the same API is consumed via item.id / item.title. These .get("status") / .get("id") calls will raise AttributeError, and the blanket except then turns cleanup into a silent no-op. After fixing that access, please also scope deletion to the current import batch and count only successful deletes; otherwise error_sources_removed can include older notebook errors and failed deletions.
🧰 Tools
🪛 Ruff (0.15.7)
[error] 115-116: try-except-pass detected, consider logging the exception
(S110)
[warning] 115-115: Do not catch blind exception: Exception
(BLE001)
[warning] 120-120: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/notebooklm/cli/helpers.py` around lines 109 - 121, client.sources.list()
yields objects with attributes, not dicts, so change the filtering to iterate
sources and use attribute access (e.g., source.status == "error" and source.id)
instead of .get(...) and when calling client.sources.delete only delete sources
that belong to the current import batch by comparing the source's batch
identifier property (e.g., source.import_batch_id or source.batch_id) to the
current batch id passed into the function; perform deletion inside a try/except
per-source, increment the removed count only on successful deletes, and return
that successful-deletes count (use client.sources.list, source.status,
source.id, source.<batch_id_field>, and client.sources.delete to locate
changes).
| pending_sources = list(sources) | ||
|
|
||
| while True: | ||
| try: | ||
| return await client.research.import_sources(notebook_id, task_id, sources) | ||
| return await client.research.import_sources(notebook_id, task_id, pending_sources) |
There was a problem hiding this comment.
Don't return only the last import_sources() response.
src/notebooklm/_research.py:297-350 already documents that import_sources() may return fewer items than were actually imported. After a timeout, pending_sources only holds the leftovers, so this helper can under-report imports or even return [] when the first timed-out attempt already finished. Rebuild the final result from client.sources.list(notebook_id) for the original URL set, and short-circuit when pending_sources becomes empty; otherwise both the CLI count and JSON imported_sources stay wrong after retries.
Also applies to: 156-160
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/notebooklm/cli/helpers.py` around lines 144 - 148, The loop currently
returns the raw result of client.research.import_sources(notebook_id, task_id,
pending_sources) which may omit items that were imported during a timed-out
attempt; instead, after each import_sources call fetch the canonical state with
client.sources.list(notebook_id) and reconcile that against the original URL set
to build the final imported_sources list, update pending_sources by removing
URLs now present, and short-circuit/return once pending_sources is empty; apply
the same reconciliation logic to the similar loop at lines 156-160 so the CLI
and JSON output reflect the true imported set rather than only the last
import_sources response.
| # Filter out sources already imported to avoid duplicates on retry | ||
| try: | ||
| existing = await client.sources.list(notebook_id) | ||
| existing_urls = {s.get("url") for s in existing if s.get("url")} | ||
| pending_sources = [s for s in pending_sources if s.get("url") not in existing_urls] | ||
| except Exception: | ||
| pass # If listing fails, retry with original list |
There was a problem hiding this comment.
Retry dedup currently never removes anything.
client.sources.list() is used as an object list elsewhere in this file, so .get("url") throws here too. Because the exception is swallowed, pending_sources stays unchanged and every RPC timeout still retries the full batch, which recreates the duplicate-import bug this PR is trying to fix.
🧰 Tools
🪛 Ruff (0.15.7)
[error] 161-162: try-except-pass detected, consider logging the exception
(S110)
[warning] 161-161: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/notebooklm/cli/helpers.py` around lines 156 - 162, client.sources.list()
returns objects, not dicts, so using s.get("url") raises and the except swallows
the error leaving pending_sources unchanged; change the existing set
construction to use attribute access (e.g., existing_urls = {getattr(s, "url",
None) for s in existing if getattr(s, "url", None)}) and then filter
pending_sources with the existing_urls (pending_sources = [s for s in
pending_sources if s.get("url") not in existing_urls]); also avoid silently
swallowing errors from client.sources.list() — at minimum log the exception in
the except block so failures are visible.
Summary
Fixes three related issues with
research wait --import-allthat cause duplicate and broken sources in notebooks.Closes #241
Changes
1. Dedup on retry (
import_with_retryinhelpers.py)Before each
RPCTimeoutErrorretry, fetch the existing notebook sources and filter already-imported URLs out ofpending_sources. Previously, every retry re-imported the full source list, resulting in e.g. 6× duplicates for 6 retry attempts (414 sources instead of ~67).2. Pre-filter unsupported sources (
filter_unsupported_sourcesinhelpers.py)Skip direct PDF/document URLs and known download path patterns (
/fileadmin/,/SharedDocs/Downloads/,/sites/default/files/,/download/) before sending sources to NotebookLM. These consistently result inerrorstatus since NotebookLM cannot fetch and parse them as web pages.3. Post-import cleanup (
cleanup_error_sourcesinhelpers.py)After import completes, automatically delete any sources that ended up with
errorstatus. Keeps the notebook clean without requiring manual intervention.Behavior change
research wait --import-allnow prints:JSON output gains an additional
error_sources_removedfield.Testing
Reproduced with a deep research query returning 67 sources, 6 timeout retries, and ~12 PDF/download URLs.
Summary by CodeRabbit
Release Notes