feat(source): add ‘source clean’ command#261
Conversation
Adds `notebooklm source clean` to automatically remove junk sources from a notebook in bulk. Useful after failed deep-research runs, bot-blocked crawls, or bulk imports that left behind error or duplicate sources. Removes sources that match any of: - Status is 'error' or 'unknown' - Title matches gateway/anti-bot patterns (403, 404, Access Denied, Cloudflare 'Just a Moment', CAPTCHA, etc.) - URL is a duplicate of an already-seen source (keeps oldest) Flags: --dry-run Preview what would be deleted without deleting -y/--yes Skip confirmation prompt -n Target a specific notebook Deletions are batched in chunks of 10 with a 0.5s delay to avoid hitting rate limits on large notebooks.
📝 WalkthroughWalkthroughA new Changes
Sequence DiagramsequenceDiagram
actor User
participant CLI as CLI Handler
participant NotebookResolver as Notebook Service
participant SourceLister as Source Service
participant Validator as Validation Logic
participant Deleter as Deletion Service
participant Feedback as User Feedback
User->>CLI: source clean --notebook X [--dry-run] [--yes]
CLI->>NotebookResolver: resolve_notebook_id(X)
NotebookResolver-->>CLI: notebook_id
CLI->>SourceLister: list_sources(notebook_id)
SourceLister-->>CLI: sources[]
CLI->>Validator: identify_candidates_to_delete(sources)
Note over Validator: Filter by: error status,<br/>gateway patterns, duplicates
Validator-->>CLI: candidate_ids[]
alt dry_run mode
CLI->>Feedback: report count only
else normal mode
alt not --yes flag
CLI->>User: confirm deletion?
User-->>CLI: approved
end
CLI->>Deleter: delete_in_batches(candidate_ids, batch_size=10)
loop batch processing with delay
Deleter->>SourceLister: delete_source(id)
SourceLister-->>Deleter: success/failure
end
Deleter-->>CLI: results[]
CLI->>Feedback: report deletions successful
end
Feedback-->>User: operation complete
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/notebooklm/cli/source.py`:
- Around line 982-992: The deletion loop currently swallows all errors by
calling asyncio.gather(..., return_exceptions=True) and always prints success;
update the loop around client.sources.delete / asyncio.gather to inspect the
gathered results, count actual successful deletions vs exceptions, and log or
print any failures (include exception messages and the corresponding sid) as
they occur; after the loop, change the console.print call to report the real
number of successful deletions (and optionally failed count) using the counted
successes so users aren’t shown a false “Successfully cleaned” message for
delete_list, chunk_size, delete_tasks, and nb_id_resolved.
- Around line 945-950: The code incorrectly converts the enum s.status to a
string with str(...).lower(), so comparisons never match; replace that
conversion with the helper source_status_to_str(s.status) (handle None if
needed) and use that result to check if status is "error" or "unknown", then add
s.id to to_delete as before (look at variables s, to_delete, and function
source_status_to_str for where to change).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4ae7d06c-d332-4bb1-9a3a-f0412794bf36
📒 Files selected for processing (1)
src/notebooklm/cli/source.py
| status = str(s.status).lower() if s.status else "unknown" | ||
|
|
||
| # Remove sources in error or unknown state | ||
| if status in ["error", "unknown"]: | ||
| to_delete.add(s.id) | ||
| continue |
There was a problem hiding this comment.
Bug: Status comparison will never match error sources.
s.status is an integer enum (SourceStatus), not a string. Converting it with str(s.status).lower() produces values like "3" or "sourcestatus.error", not "error". The condition on line 948 will never be true, so error sources are never cleaned.
Use the existing source_status_to_str() function (already imported on line 27) which properly maps status codes to string representations.
🐛 Proposed fix
for s in sorted_sources:
title = (s.title or "").strip()
- status = str(s.status).lower() if s.status else "unknown"
+ status = source_status_to_str(s.status)
# Remove sources in error or unknown state
if status in ["error", "unknown"]:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| status = str(s.status).lower() if s.status else "unknown" | |
| # Remove sources in error or unknown state | |
| if status in ["error", "unknown"]: | |
| to_delete.add(s.id) | |
| continue | |
| status = source_status_to_str(s.status) | |
| # Remove sources in error or unknown state | |
| if status in ["error", "unknown"]: | |
| to_delete.add(s.id) | |
| continue |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/notebooklm/cli/source.py` around lines 945 - 950, The code incorrectly
converts the enum s.status to a string with str(...).lower(), so comparisons
never match; replace that conversion with the helper
source_status_to_str(s.status) (handle None if needed) and use that result to
check if status is "error" or "unknown", then add s.id to to_delete as before
(look at variables s, to_delete, and function source_status_to_str for where to
change).
| # Chunk deletions to avoid rate limiting | ||
| delete_list = list(to_delete) | ||
| chunk_size = 10 | ||
| for i in range(0, len(delete_list), chunk_size): | ||
| chunk = delete_list[i : i + chunk_size] | ||
| delete_tasks = [client.sources.delete(nb_id_resolved, sid) for sid in chunk] | ||
| await asyncio.gather(*delete_tasks, return_exceptions=True) | ||
| if i + chunk_size < len(delete_list): | ||
| await asyncio.sleep(0.5) | ||
|
|
||
| console.print(f"[green]Successfully cleaned {len(to_delete)} source(s).[/green]") |
There was a problem hiding this comment.
Silent failures: deletions may fail without user notification.
Using asyncio.gather(..., return_exceptions=True) catches all exceptions (rate limits, auth failures, network errors) but the results are never inspected. The success message on line 992 will display even if every deletion failed.
🛠️ Proposed fix to count actual successes
# Chunk deletions to avoid rate limiting
delete_list = list(to_delete)
chunk_size = 10
+ deleted_count = 0
+ failed_count = 0
for i in range(0, len(delete_list), chunk_size):
chunk = delete_list[i : i + chunk_size]
delete_tasks = [client.sources.delete(nb_id_resolved, sid) for sid in chunk]
- await asyncio.gather(*delete_tasks, return_exceptions=True)
+ results = await asyncio.gather(*delete_tasks, return_exceptions=True)
+ for result in results:
+ if isinstance(result, Exception):
+ failed_count += 1
+ else:
+ deleted_count += 1
if i + chunk_size < len(delete_list):
await asyncio.sleep(0.5)
- console.print(f"[green]Successfully cleaned {len(to_delete)} source(s).[/green]")
+ if failed_count:
+ console.print(
+ f"[yellow]Cleaned {deleted_count} source(s), {failed_count} failed.[/yellow]"
+ )
+ else:
+ console.print(f"[green]Successfully cleaned {deleted_count} source(s).[/green]")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Chunk deletions to avoid rate limiting | |
| delete_list = list(to_delete) | |
| chunk_size = 10 | |
| for i in range(0, len(delete_list), chunk_size): | |
| chunk = delete_list[i : i + chunk_size] | |
| delete_tasks = [client.sources.delete(nb_id_resolved, sid) for sid in chunk] | |
| await asyncio.gather(*delete_tasks, return_exceptions=True) | |
| if i + chunk_size < len(delete_list): | |
| await asyncio.sleep(0.5) | |
| console.print(f"[green]Successfully cleaned {len(to_delete)} source(s).[/green]") | |
| # Chunk deletions to avoid rate limiting | |
| delete_list = list(to_delete) | |
| chunk_size = 10 | |
| deleted_count = 0 | |
| failed_count = 0 | |
| for i in range(0, len(delete_list), chunk_size): | |
| chunk = delete_list[i : i + chunk_size] | |
| delete_tasks = [client.sources.delete(nb_id_resolved, sid) for sid in chunk] | |
| results = await asyncio.gather(*delete_tasks, return_exceptions=True) | |
| for result in results: | |
| if isinstance(result, Exception): | |
| failed_count += 1 | |
| else: | |
| deleted_count += 1 | |
| if i + chunk_size < len(delete_list): | |
| await asyncio.sleep(0.5) | |
| if failed_count: | |
| console.print( | |
| f"[yellow]Cleaned {deleted_count} source(s), {failed_count} failed.[/yellow]" | |
| ) | |
| else: | |
| console.print(f"[green]Successfully cleaned {deleted_count} source(s).[/green]") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/notebooklm/cli/source.py` around lines 982 - 992, The deletion loop
currently swallows all errors by calling asyncio.gather(...,
return_exceptions=True) and always prints success; update the loop around
client.sources.delete / asyncio.gather to inspect the gathered results, count
actual successful deletions vs exceptions, and log or print any failures
(include exception messages and the corresponding sid) as they occur; after the
loop, change the console.print call to report the real number of successful
deletions (and optionally failed count) using the counted successes so users
aren’t shown a false “Successfully cleaned” message for delete_list, chunk_size,
delete_tasks, and nb_id_resolved.
There was a problem hiding this comment.
Code Review
This pull request introduces a new source clean command to the CLI, designed to automatically remove duplicate, error, and access-blocked sources from a NotebookLM notebook. The command identifies sources based on their status, title, and normalized URL, and includes options for dry-run and confirmation. A critical issue was identified in the status checking logic: the s.status enum is incorrectly converted to a string, which prevents sources in an error or unknown state from being properly detected and cleaned. The suggested fix is to use the source_status_to_str() helper function.
|
|
||
| for s in sorted_sources: | ||
| title = (s.title or "").strip() | ||
| status = str(s.status).lower() if s.status else "unknown" |
There was a problem hiding this comment.
The current logic for checking the source status is incorrect. The s.status attribute is an integer from the SourceStatus enum. Converting it to a string via str(s.status) will produce a digit string (e.g., '3'), not the word 'error'. As a result, the check status in ["error", "unknown"] will never be true, and sources in an error state won't be cleaned up.
You should use the source_status_to_str() helper function, which is designed for this purpose and already used elsewhere in the file.
| status = str(s.status).lower() if s.status else "unknown" | |
| status = source_status_to_str(s.status) |
Summary
After failed deep-research runs, bulk imports, or bot-blocked crawls, notebooks can accumulate junk sources: errored ones, gateway/anti-bot pages (Cloudflare, 403, 404), and duplicates. There is currently no way to clean these up in bulk — users must delete them one by one.
This PR adds
notebooklm source clean, which automatically identifies and removes junk sources in one command.What it removes
statusiserrororunknown403,404,Forbidden,Access Denied,Just a Moment,Attention Required,Security Check,CAPTCHAFlags
Implementation notes
asyncio.gatherpattern consistent with other bulk operations in the codebaseTest plan
--dry-runprints count and exits without deletingstatus=errorare removed"403 Forbidden") are removed"Notebook is already clean"-yskips confirmation promptSummary by CodeRabbit
notebooklm source cleanCLI command to automatically remove problematic sources from notebooks, including those with error status, gateway/security block patterns, and duplicate URLs.--dry-runflag to preview deletions and--yesflag to skip confirmation prompts for streamlined operations.