feat(source): add ‘source clean’ command by Flosters · Pull Request #261 · teng-lin/notebooklm-py

Flosters · 2026-04-09T01:54:05Z

Summary

After failed deep-research runs, bulk imports, or bot-blocked crawls, notebooks can accumulate junk sources: errored ones, gateway/anti-bot pages (Cloudflare, 403, 404), and duplicates. There is currently no way to clean these up in bulk — users must delete them one by one.

This PR adds notebooklm source clean, which automatically identifies and removes junk sources in one command.

What it removes

Category	Criteria
Error/unknown	`status` is `error` or `unknown`
Gateway / anti-bot	Title matches patterns: `403`, `404`, `Forbidden`, `Access Denied`, `Just a Moment`, `Attention Required`, `Security Check`, `CAPTCHA`
Duplicates	Same URL already seen (normalized: scheme+host+path, no query/fragment); keeps the oldest copy

Flags

notebooklm source clean             # interactive confirmation
notebooklm source clean --dry-run   # preview without deleting
notebooklm source clean -y          # skip confirmation
notebooklm source clean -n <id>     # target specific notebook

Implementation notes

Sources are sorted oldest-first before deduplication so the earliest copy is always kept
Deletions are chunked in batches of 10 with a 0.5s inter-chunk delay to avoid rate limiting on large notebooks
Uses existing asyncio.gather pattern consistent with other bulk operations in the codebase

Test plan

--dry-run prints count and exits without deleting
Sources with status=error are removed
Duplicate URLs: only the newest copy is deleted
Gateway-titled sources (e.g. title "403 Forbidden") are removed
Clean notebook prints "Notebook is already clean"
-y skips confirmation prompt
Deletions are batched (no 429 on 50+ junk sources)

Summary by CodeRabbit

New Features
- Added notebooklm source clean CLI command to automatically remove problematic sources from notebooks, including those with error status, gateway/security block patterns, and duplicate URLs.
- Includes --dry-run flag to preview deletions and --yes flag to skip confirmation prompts for streamlined operations.

Adds `notebooklm source clean` to automatically remove junk sources from a notebook in bulk. Useful after failed deep-research runs, bot-blocked crawls, or bulk imports that left behind error or duplicate sources. Removes sources that match any of: - Status is 'error' or 'unknown' - Title matches gateway/anti-bot patterns (403, 404, Access Denied, Cloudflare 'Just a Moment', CAPTCHA, etc.) - URL is a duplicate of an already-seen source (keeps oldest) Flags: --dry-run Preview what would be deleted without deleting -y/--yes Skip confirmation prompt -n Target a specific notebook Deletions are batched in chunks of 10 with a 0.5s delay to avoid hitting rate limits on large notebooks.

coderabbitai · 2026-04-09T01:54:23Z

📝 Walkthrough

Walkthrough

A new source clean CLI subcommand was added to remove sources from notebooks based on error status, gateway/anti-bot pattern matching in titles, and duplicate URL detection. The command supports dry-run mode and performs deletions in batched asynchronous operations.

Changes

Cohort / File(s)	Summary
New Source Clean Command `src/notebooklm/cli/source.py`	Added `source_clean` CLI subcommand with logic to identify problematic sources (error/unknown status, gateway/security patterns, duplicate URLs), support for `--dry-run` and `--yes` flags, and chunked async deletion with inter-batch delays. URL normalization added via locally imported `urlparse`/`urlunparse`.

Sequence Diagram

sequenceDiagram
    actor User
    participant CLI as CLI Handler
    participant NotebookResolver as Notebook Service
    participant SourceLister as Source Service
    participant Validator as Validation Logic
    participant Deleter as Deletion Service
    participant Feedback as User Feedback

    User->>CLI: source clean --notebook X [--dry-run] [--yes]
    CLI->>NotebookResolver: resolve_notebook_id(X)
    NotebookResolver-->>CLI: notebook_id
    CLI->>SourceLister: list_sources(notebook_id)
    SourceLister-->>CLI: sources[]
    CLI->>Validator: identify_candidates_to_delete(sources)
    Note over Validator: Filter by: error status,<br/>gateway patterns, duplicates
    Validator-->>CLI: candidate_ids[]
    alt dry_run mode
        CLI->>Feedback: report count only
    else normal mode
        alt not --yes flag
            CLI->>User: confirm deletion?
            User-->>CLI: approved
        end
        CLI->>Deleter: delete_in_batches(candidate_ids, batch_size=10)
        loop batch processing with delay
            Deleter->>SourceLister: delete_source(id)
            SourceLister-->>Deleter: success/failure
        end
        Deleter-->>CLI: results[]
        CLI->>Feedback: report deletions successful
    end
    Feedback-->>User: operation complete

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 A hop through the sources so messy and worn,
Deleting the broken, the gateways, forlorn,
Duplicates vanish in batches so neat,
With delays and confirmations, the cleanup's complete!
Our notebooks now sparkle, no errors in sight! ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(source): add 'source clean' command' is directly related to the main change—the addition of a new 'source clean' CLI subcommand. It accurately summarizes the primary feature being introduced.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/notebooklm/cli/source.py`:
- Around line 982-992: The deletion loop currently swallows all errors by
calling asyncio.gather(..., return_exceptions=True) and always prints success;
update the loop around client.sources.delete / asyncio.gather to inspect the
gathered results, count actual successful deletions vs exceptions, and log or
print any failures (include exception messages and the corresponding sid) as
they occur; after the loop, change the console.print call to report the real
number of successful deletions (and optionally failed count) using the counted
successes so users aren’t shown a false “Successfully cleaned” message for
delete_list, chunk_size, delete_tasks, and nb_id_resolved.
- Around line 945-950: The code incorrectly converts the enum s.status to a
string with str(...).lower(), so comparisons never match; replace that
conversion with the helper source_status_to_str(s.status) (handle None if
needed) and use that result to check if status is "error" or "unknown", then add
s.id to to_delete as before (look at variables s, to_delete, and function
source_status_to_str for where to change).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4ae7d06c-d332-4bb1-9a3a-f0412794bf36

📥 Commits

Reviewing files that changed from the base of the PR and between a997718 and da0b8bd.

📒 Files selected for processing (1)

src/notebooklm/cli/source.py

coderabbitai · 2026-04-09T01:57:35Z

src/notebooklm/cli/source.py

+                status = str(s.status).lower() if s.status else "unknown"
+
+                # Remove sources in error or unknown state
+                if status in ["error", "unknown"]:
+                    to_delete.add(s.id)
+                    continue


⚠️ Potential issue | 🔴 Critical

Bug: Status comparison will never match error sources.

s.status is an integer enum (SourceStatus), not a string. Converting it with str(s.status).lower() produces values like "3" or "sourcestatus.error", not "error". The condition on line 948 will never be true, so error sources are never cleaned.

Use the existing source_status_to_str() function (already imported on line 27) which properly maps status codes to string representations.

🐛 Proposed fix

for s in sorted_sources: title = (s.title or "").strip() - status = str(s.status).lower() if s.status else "unknown" + status = source_status_to_str(s.status) # Remove sources in error or unknown state if status in ["error", "unknown"]:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

status = str(s.status).lower() if s.status else "unknown"

# Remove sources in error or unknown state

if status in ["error", "unknown"]:

to_delete.add(s.id)

continue

status = source_status_to_str(s.status)

# Remove sources in error or unknown state

if status in ["error", "unknown"]:

to_delete.add(s.id)

continue

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/notebooklm/cli/source.py` around lines 945 - 950, The code incorrectly converts the enum s.status to a string with str(...).lower(), so comparisons never match; replace that conversion with the helper source_status_to_str(s.status) (handle None if needed) and use that result to check if status is "error" or "unknown", then add s.id to to_delete as before (look at variables s, to_delete, and function source_status_to_str for where to change).

coderabbitai · 2026-04-09T01:57:35Z

src/notebooklm/cli/source.py

+            # Chunk deletions to avoid rate limiting
+            delete_list = list(to_delete)
+            chunk_size = 10
+            for i in range(0, len(delete_list), chunk_size):
+                chunk = delete_list[i : i + chunk_size]
+                delete_tasks = [client.sources.delete(nb_id_resolved, sid) for sid in chunk]
+                await asyncio.gather(*delete_tasks, return_exceptions=True)
+                if i + chunk_size < len(delete_list):
+                    await asyncio.sleep(0.5)
+
+            console.print(f"[green]Successfully cleaned {len(to_delete)} source(s).[/green]")


⚠️ Potential issue | 🟠 Major

Silent failures: deletions may fail without user notification.

Using asyncio.gather(..., return_exceptions=True) catches all exceptions (rate limits, auth failures, network errors) but the results are never inspected. The success message on line 992 will display even if every deletion failed.

🛠️ Proposed fix to count actual successes

# Chunk deletions to avoid rate limiting delete_list = list(to_delete) chunk_size = 10 + deleted_count = 0 + failed_count = 0 for i in range(0, len(delete_list), chunk_size): chunk = delete_list[i : i + chunk_size] delete_tasks = [client.sources.delete(nb_id_resolved, sid) for sid in chunk] - await asyncio.gather(*delete_tasks, return_exceptions=True) + results = await asyncio.gather(*delete_tasks, return_exceptions=True) + for result in results: + if isinstance(result, Exception): + failed_count += 1 + else: + deleted_count += 1 if i + chunk_size < len(delete_list): await asyncio.sleep(0.5) - console.print(f"[green]Successfully cleaned {len(to_delete)} source(s).[/green]") + if failed_count: + console.print( + f"[yellow]Cleaned {deleted_count} source(s), {failed_count} failed.[/yellow]" + ) + else: + console.print(f"[green]Successfully cleaned {deleted_count} source(s).[/green]")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Chunk deletions to avoid rate limiting

delete_list = list(to_delete)

chunk_size = 10

for i in range(0, len(delete_list), chunk_size):

chunk = delete_list[i : i + chunk_size]

delete_tasks = [client.sources.delete(nb_id_resolved, sid) for sid in chunk]

await asyncio.gather(*delete_tasks, return_exceptions=True)

if i + chunk_size < len(delete_list):

await asyncio.sleep(0.5)

console.print(f"[green]Successfully cleaned {len(to_delete)} source(s).[/green]")

# Chunk deletions to avoid rate limiting

delete_list = list(to_delete)

chunk_size = 10

deleted_count = 0

failed_count = 0

for i in range(0, len(delete_list), chunk_size):

chunk = delete_list[i : i + chunk_size]

delete_tasks = [client.sources.delete(nb_id_resolved, sid) for sid in chunk]

results = await asyncio.gather(*delete_tasks, return_exceptions=True)

for result in results:

if isinstance(result, Exception):

failed_count += 1

else:

deleted_count += 1

if i + chunk_size < len(delete_list):

await asyncio.sleep(0.5)

if failed_count:

console.print(

f"[yellow]Cleaned {deleted_count} source(s), {failed_count} failed.[/yellow]"

)

else:

console.print(f"[green]Successfully cleaned {deleted_count} source(s).[/green]")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/notebooklm/cli/source.py` around lines 982 - 992, The deletion loop currently swallows all errors by calling asyncio.gather(..., return_exceptions=True) and always prints success; update the loop around client.sources.delete / asyncio.gather to inspect the gathered results, count actual successful deletions vs exceptions, and log or print any failures (include exception messages and the corresponding sid) as they occur; after the loop, change the console.print call to report the real number of successful deletions (and optionally failed count) using the counted successes so users aren’t shown a false “Successfully cleaned” message for delete_list, chunk_size, delete_tasks, and nb_id_resolved.

gemini-code-assist

Code Review

This pull request introduces a new source clean command to the CLI, designed to automatically remove duplicate, error, and access-blocked sources from a NotebookLM notebook. The command identifies sources based on their status, title, and normalized URL, and includes options for dry-run and confirmation. A critical issue was identified in the status checking logic: the s.status enum is incorrectly converted to a string, which prevents sources in an error or unknown state from being properly detected and cleaned. The suggested fix is to use the source_status_to_str() helper function.

gemini-code-assist · 2026-04-09T02:10:55Z

src/notebooklm/cli/source.py

+
+            for s in sorted_sources:
+                title = (s.title or "").strip()
+                status = str(s.status).lower() if s.status else "unknown"


The current logic for checking the source status is incorrect. The s.status attribute is an integer from the SourceStatus enum. Converting it to a string via str(s.status) will produce a digit string (e.g., '3'), not the word 'error'. As a result, the check status in ["error", "unknown"] will never be true, and sources in an error state won't be cleaned up.

You should use the source_status_to_str() helper function, which is designed for this purpose and already used elsewhere in the file.

Suggested change

status = str(s.status).lower() if s.status else "unknown"

status = source_status_to_str(s.status)

agustinsilvazambrano added 2 commits April 8, 2026 21:27

style: ruff format source.py

da0b8bd

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

gemini-code-assist bot reviewed Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(source): add ‘source clean’ command#261

feat(source): add ‘source clean’ command#261
Flosters wants to merge 2 commits intoteng-lin:mainfrom
Flosters:feat/source-clean

Flosters commented Apr 9, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 9, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 9, 2026

Uh oh!

coderabbitai bot Apr 9, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	status = str(s.status).lower() if s.status else "unknown"
	status = source_status_to_str(s.status)

Conversation

Flosters commented Apr 9, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What it removes

Flags

Implementation notes

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Flosters commented Apr 9, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 9, 2026 •

edited

Loading