Skip to content

feat(source): add ‘source clean’ command#261

Open
Flosters wants to merge 2 commits intoteng-lin:mainfrom
Flosters:feat/source-clean
Open

feat(source): add ‘source clean’ command#261
Flosters wants to merge 2 commits intoteng-lin:mainfrom
Flosters:feat/source-clean

Conversation

@Flosters
Copy link
Copy Markdown

@Flosters Flosters commented Apr 9, 2026

Summary

After failed deep-research runs, bulk imports, or bot-blocked crawls, notebooks can accumulate junk sources: errored ones, gateway/anti-bot pages (Cloudflare, 403, 404), and duplicates. There is currently no way to clean these up in bulk — users must delete them one by one.

This PR adds notebooklm source clean, which automatically identifies and removes junk sources in one command.

What it removes

Category Criteria
Error/unknown status is error or unknown
Gateway / anti-bot Title matches patterns: 403, 404, Forbidden, Access Denied, Just a Moment, Attention Required, Security Check, CAPTCHA
Duplicates Same URL already seen (normalized: scheme+host+path, no query/fragment); keeps the oldest copy

Flags

notebooklm source clean             # interactive confirmation
notebooklm source clean --dry-run   # preview without deleting
notebooklm source clean -y          # skip confirmation
notebooklm source clean -n <id>     # target specific notebook

Implementation notes

  • Sources are sorted oldest-first before deduplication so the earliest copy is always kept
  • Deletions are chunked in batches of 10 with a 0.5s inter-chunk delay to avoid rate limiting on large notebooks
  • Uses existing asyncio.gather pattern consistent with other bulk operations in the codebase

Test plan

  • --dry-run prints count and exits without deleting
  • Sources with status=error are removed
  • Duplicate URLs: only the newest copy is deleted
  • Gateway-titled sources (e.g. title "403 Forbidden") are removed
  • Clean notebook prints "Notebook is already clean"
  • -y skips confirmation prompt
  • Deletions are batched (no 429 on 50+ junk sources)

Summary by CodeRabbit

  • New Features
    • Added notebooklm source clean CLI command to automatically remove problematic sources from notebooks, including those with error status, gateway/security block patterns, and duplicate URLs.
    • Includes --dry-run flag to preview deletions and --yes flag to skip confirmation prompts for streamlined operations.

agustinsilvazambrano added 2 commits April 8, 2026 21:27
Adds `notebooklm source clean` to automatically remove junk sources from
a notebook in bulk. Useful after failed deep-research runs, bot-blocked
crawls, or bulk imports that left behind error or duplicate sources.

Removes sources that match any of:
- Status is 'error' or 'unknown'
- Title matches gateway/anti-bot patterns (403, 404, Access Denied,
  Cloudflare 'Just a Moment', CAPTCHA, etc.)
- URL is a duplicate of an already-seen source (keeps oldest)

Flags:
  --dry-run   Preview what would be deleted without deleting
  -y/--yes    Skip confirmation prompt
  -n          Target a specific notebook

Deletions are batched in chunks of 10 with a 0.5s delay to avoid
hitting rate limits on large notebooks.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 9, 2026

📝 Walkthrough

Walkthrough

A new source clean CLI subcommand was added to remove sources from notebooks based on error status, gateway/anti-bot pattern matching in titles, and duplicate URL detection. The command supports dry-run mode and performs deletions in batched asynchronous operations.

Changes

Cohort / File(s) Summary
New Source Clean Command
src/notebooklm/cli/source.py
Added source_clean CLI subcommand with logic to identify problematic sources (error/unknown status, gateway/security patterns, duplicate URLs), support for --dry-run and --yes flags, and chunked async deletion with inter-batch delays. URL normalization added via locally imported urlparse/urlunparse.

Sequence Diagram

sequenceDiagram
    actor User
    participant CLI as CLI Handler
    participant NotebookResolver as Notebook Service
    participant SourceLister as Source Service
    participant Validator as Validation Logic
    participant Deleter as Deletion Service
    participant Feedback as User Feedback

    User->>CLI: source clean --notebook X [--dry-run] [--yes]
    CLI->>NotebookResolver: resolve_notebook_id(X)
    NotebookResolver-->>CLI: notebook_id
    CLI->>SourceLister: list_sources(notebook_id)
    SourceLister-->>CLI: sources[]
    CLI->>Validator: identify_candidates_to_delete(sources)
    Note over Validator: Filter by: error status,<br/>gateway patterns, duplicates
    Validator-->>CLI: candidate_ids[]
    alt dry_run mode
        CLI->>Feedback: report count only
    else normal mode
        alt not --yes flag
            CLI->>User: confirm deletion?
            User-->>CLI: approved
        end
        CLI->>Deleter: delete_in_batches(candidate_ids, batch_size=10)
        loop batch processing with delay
            Deleter->>SourceLister: delete_source(id)
            SourceLister-->>Deleter: success/failure
        end
        Deleter-->>CLI: results[]
        CLI->>Feedback: report deletions successful
    end
    Feedback-->>User: operation complete
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 A hop through the sources so messy and worn,
Deleting the broken, the gateways, forlorn,
Duplicates vanish in batches so neat,
With delays and confirmations, the cleanup's complete!
Our notebooks now sparkle, no errors in sight! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(source): add 'source clean' command' is directly related to the main change—the addition of a new 'source clean' CLI subcommand. It accurately summarizes the primary feature being introduced.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/notebooklm/cli/source.py`:
- Around line 982-992: The deletion loop currently swallows all errors by
calling asyncio.gather(..., return_exceptions=True) and always prints success;
update the loop around client.sources.delete / asyncio.gather to inspect the
gathered results, count actual successful deletions vs exceptions, and log or
print any failures (include exception messages and the corresponding sid) as
they occur; after the loop, change the console.print call to report the real
number of successful deletions (and optionally failed count) using the counted
successes so users aren’t shown a false “Successfully cleaned” message for
delete_list, chunk_size, delete_tasks, and nb_id_resolved.
- Around line 945-950: The code incorrectly converts the enum s.status to a
string with str(...).lower(), so comparisons never match; replace that
conversion with the helper source_status_to_str(s.status) (handle None if
needed) and use that result to check if status is "error" or "unknown", then add
s.id to to_delete as before (look at variables s, to_delete, and function
source_status_to_str for where to change).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4ae7d06c-d332-4bb1-9a3a-f0412794bf36

📥 Commits

Reviewing files that changed from the base of the PR and between a997718 and da0b8bd.

📒 Files selected for processing (1)
  • src/notebooklm/cli/source.py

Comment on lines +945 to +950
status = str(s.status).lower() if s.status else "unknown"

# Remove sources in error or unknown state
if status in ["error", "unknown"]:
to_delete.add(s.id)
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bug: Status comparison will never match error sources.

s.status is an integer enum (SourceStatus), not a string. Converting it with str(s.status).lower() produces values like "3" or "sourcestatus.error", not "error". The condition on line 948 will never be true, so error sources are never cleaned.

Use the existing source_status_to_str() function (already imported on line 27) which properly maps status codes to string representations.

🐛 Proposed fix
             for s in sorted_sources:
                 title = (s.title or "").strip()
-                status = str(s.status).lower() if s.status else "unknown"
+                status = source_status_to_str(s.status)

                 # Remove sources in error or unknown state
                 if status in ["error", "unknown"]:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
status = str(s.status).lower() if s.status else "unknown"
# Remove sources in error or unknown state
if status in ["error", "unknown"]:
to_delete.add(s.id)
continue
status = source_status_to_str(s.status)
# Remove sources in error or unknown state
if status in ["error", "unknown"]:
to_delete.add(s.id)
continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/notebooklm/cli/source.py` around lines 945 - 950, The code incorrectly
converts the enum s.status to a string with str(...).lower(), so comparisons
never match; replace that conversion with the helper
source_status_to_str(s.status) (handle None if needed) and use that result to
check if status is "error" or "unknown", then add s.id to to_delete as before
(look at variables s, to_delete, and function source_status_to_str for where to
change).

Comment on lines +982 to +992
# Chunk deletions to avoid rate limiting
delete_list = list(to_delete)
chunk_size = 10
for i in range(0, len(delete_list), chunk_size):
chunk = delete_list[i : i + chunk_size]
delete_tasks = [client.sources.delete(nb_id_resolved, sid) for sid in chunk]
await asyncio.gather(*delete_tasks, return_exceptions=True)
if i + chunk_size < len(delete_list):
await asyncio.sleep(0.5)

console.print(f"[green]Successfully cleaned {len(to_delete)} source(s).[/green]")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Silent failures: deletions may fail without user notification.

Using asyncio.gather(..., return_exceptions=True) catches all exceptions (rate limits, auth failures, network errors) but the results are never inspected. The success message on line 992 will display even if every deletion failed.

🛠️ Proposed fix to count actual successes
             # Chunk deletions to avoid rate limiting
             delete_list = list(to_delete)
             chunk_size = 10
+            deleted_count = 0
+            failed_count = 0
             for i in range(0, len(delete_list), chunk_size):
                 chunk = delete_list[i : i + chunk_size]
                 delete_tasks = [client.sources.delete(nb_id_resolved, sid) for sid in chunk]
-                await asyncio.gather(*delete_tasks, return_exceptions=True)
+                results = await asyncio.gather(*delete_tasks, return_exceptions=True)
+                for result in results:
+                    if isinstance(result, Exception):
+                        failed_count += 1
+                    else:
+                        deleted_count += 1
                 if i + chunk_size < len(delete_list):
                     await asyncio.sleep(0.5)

-            console.print(f"[green]Successfully cleaned {len(to_delete)} source(s).[/green]")
+            if failed_count:
+                console.print(
+                    f"[yellow]Cleaned {deleted_count} source(s), {failed_count} failed.[/yellow]"
+                )
+            else:
+                console.print(f"[green]Successfully cleaned {deleted_count} source(s).[/green]")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Chunk deletions to avoid rate limiting
delete_list = list(to_delete)
chunk_size = 10
for i in range(0, len(delete_list), chunk_size):
chunk = delete_list[i : i + chunk_size]
delete_tasks = [client.sources.delete(nb_id_resolved, sid) for sid in chunk]
await asyncio.gather(*delete_tasks, return_exceptions=True)
if i + chunk_size < len(delete_list):
await asyncio.sleep(0.5)
console.print(f"[green]Successfully cleaned {len(to_delete)} source(s).[/green]")
# Chunk deletions to avoid rate limiting
delete_list = list(to_delete)
chunk_size = 10
deleted_count = 0
failed_count = 0
for i in range(0, len(delete_list), chunk_size):
chunk = delete_list[i : i + chunk_size]
delete_tasks = [client.sources.delete(nb_id_resolved, sid) for sid in chunk]
results = await asyncio.gather(*delete_tasks, return_exceptions=True)
for result in results:
if isinstance(result, Exception):
failed_count += 1
else:
deleted_count += 1
if i + chunk_size < len(delete_list):
await asyncio.sleep(0.5)
if failed_count:
console.print(
f"[yellow]Cleaned {deleted_count} source(s), {failed_count} failed.[/yellow]"
)
else:
console.print(f"[green]Successfully cleaned {deleted_count} source(s).[/green]")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/notebooklm/cli/source.py` around lines 982 - 992, The deletion loop
currently swallows all errors by calling asyncio.gather(...,
return_exceptions=True) and always prints success; update the loop around
client.sources.delete / asyncio.gather to inspect the gathered results, count
actual successful deletions vs exceptions, and log or print any failures
(include exception messages and the corresponding sid) as they occur; after the
loop, change the console.print call to report the real number of successful
deletions (and optionally failed count) using the counted successes so users
aren’t shown a false “Successfully cleaned” message for delete_list, chunk_size,
delete_tasks, and nb_id_resolved.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new source clean command to the CLI, designed to automatically remove duplicate, error, and access-blocked sources from a NotebookLM notebook. The command identifies sources based on their status, title, and normalized URL, and includes options for dry-run and confirmation. A critical issue was identified in the status checking logic: the s.status enum is incorrectly converted to a string, which prevents sources in an error or unknown state from being properly detected and cleaned. The suggested fix is to use the source_status_to_str() helper function.


for s in sorted_sources:
title = (s.title or "").strip()
status = str(s.status).lower() if s.status else "unknown"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current logic for checking the source status is incorrect. The s.status attribute is an integer from the SourceStatus enum. Converting it to a string via str(s.status) will produce a digit string (e.g., '3'), not the word 'error'. As a result, the check status in ["error", "unknown"] will never be true, and sources in an error state won't be cleaned up.

You should use the source_status_to_str() helper function, which is designed for this purpose and already used elsewhere in the file.

Suggested change
status = str(s.status).lower() if s.status else "unknown"
status = source_status_to_str(s.status)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant