Skip to content

fix: dedupe research import retries after timeout#257

Open
voidborne-d wants to merge 2 commits intoteng-lin:mainfrom
voidborne-d:fix/import-all-dedup-on-timeout
Open

fix: dedupe research import retries after timeout#257
voidborne-d wants to merge 2 commits intoteng-lin:mainfrom
voidborne-d:fix/import-all-dedup-on-timeout

Conversation

@voidborne-d
Copy link
Copy Markdown
Contributor

@voidborne-d voidborne-d commented Apr 6, 2026

Summary

  • dedupe the retry batch in import_with_retry() by checking which source URLs are already present in the notebook after a timeout
  • stop retrying early when every source has already landed, instead of resubmitting the whole batch
  • add regression tests covering partial and complete retry deduplication

Testing

  • PYTHONPATH=src python3 -m pytest tests/unit/cli/test_helpers.py tests/unit/cli/test_research.py tests/unit/cli/test_source.py -q

Closes #241.

Summary by CodeRabbit

  • Bug Fixes

    • Smarter retry behavior for importing sources: deduplicates and incrementally shrinks retry batches by reconciling with existing notebook sources, stops retrying when nothing remains, and continues retrying if reconciliation fails.
  • Tests

    • Expanded async test coverage validating timeout/retry reconciliation, deduplication, early stop behavior, and resilience when source-listing fails.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 6, 2026

📝 Walkthrough

Walkthrough

import_with_retry() now maintains a mutable pending_sources list and, after each RPCTimeoutError, calls client.sources.list(notebook_id) to remove already-imported sources from pending_sources; it stops retrying if no pending sources remain. If listing fails, it retries with the current pending set.

Changes

Cohort / File(s) Summary
Source Import Deduplication Logic
src/notebooklm/cli/helpers.py
Reworked import_with_retry() to initialize pending_sources from sources, pass pending_sources to client.research.import_sources(), and on RPC timeout call client.sources.list() to compute existing source identities and filter out already-imported items; returns early if nothing remains. Added helper functions _source_identity and _existing_source_identities.
Test Coverage for Retry Reconciliation
tests/unit/cli/test_helpers.py
Updated tests to stub client.sources.list and added multiple async tests verifying: pending-source filtering after timeout, early termination when all sources are present, behavior when sources.list fails, handling of report-like sources, and correct returned batch for mixed-source retries.

Sequence Diagram(s)

sequenceDiagram
    participant Caller as import_with_retry()
    participant Client as client (research & sources)
    participant Notebook as Notebook Service

    Caller->>Client: import_sources(pending_sources)
    Note right of Caller: on RPCTimeoutError
    Caller->>Client: sources.list(notebook_id)
    Client->>Notebook: list request
    Notebook-->>Client: existing sources
    Client-->>Caller: existing sources
    Caller->>Caller: compute identities, filter pending_sources
    alt pending_sources not empty
        Caller->>Client: import_sources(filtered_pending_sources)
    else pending_sources empty
        Caller-->>Caller: return [] (stop retries)
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I hopped in to stop duplicates' spree,
I checked what's already by the tree,
Filtered the list, kept only what's new,
Retries now gentle, tidy, and true,
No more sixfold echoes — just one cozy me.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.77% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: dedupe research import retries after timeout' clearly and concisely summarizes the main change: deduplicating sources during retry logic after RPC timeouts.
Linked Issues check ✅ Passed The PR fully addresses issue #241 by implementing deduplication in import_with_retry() to prevent duplicate imports on timeout retry, filtering pending sources against already-imported URLs, and stopping retries when all sources exist.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the deduplication fix: modifications to import_with_retry() logic, helper functions for identity matching, and comprehensive regression tests for the timeout reconciliation behavior.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
tests/unit/cli/test_helpers.py (1)

730-732: Consider adding a test for when client.sources.list raises an exception.

The implementation has a defensive except Exception block (lines 122-127 in helpers.py) that continues retrying with the current pending_sources if listing fails. This path isn't tested and could regress.

📝 Suggested test for the fallback path
`@pytest.mark.asyncio`
async def test_continues_retrying_when_sources_list_fails(self):
    """Verify retry continues with original batch if sources.list fails."""
    client = MagicMock()
    client.research.import_sources = AsyncMock(
        side_effect=[
            RPCTimeoutError("Timed out", timeout_seconds=30.0),
            [{"id": "src_1", "title": "Source 1"}],
        ]
    )
    client.sources.list = AsyncMock(side_effect=Exception("API error"))
    sources = [{"url": "https://example.com", "title": "Source 1"}]

    with patch("notebooklm.cli.helpers.asyncio.sleep", new_callable=AsyncMock):
        imported = await import_with_retry(client, "nb_123", "task_123", sources)

    assert imported == [{"id": "src_1", "title": "Source 1"}]
    # Should retry with original sources since listing failed
    assert client.research.import_sources.await_args_list[1].args[2] == sources

Also applies to: 751-752

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/cli/test_helpers.py` around lines 730 - 732, Add a unit test that
simulates client.sources.list raising an exception to cover the except Exception
path in import_with_retry: create a MagicMock client where
client.research.import_sources first raises RPCTimeoutError then returns the
expected imported list, and set client.sources.list to
AsyncMock(side_effect=Exception("API error")); call import_with_retry with a
single-source batch, patch asyncio.sleep to avoid delays, assert the final
returned imports match the expected result and verify the second import_sources
call received the original pending_sources (ensuring the retry used the original
batch when listing failed).
src/notebooklm/cli/helpers.py (1)

110-121: Consider distinguishing "all sources already imported" from "no sources imported".

When all sources are already present, returning [] is semantically ambiguous—callers like research.py (lines 190-201) will report "imported": 0 without indicating that sources were already present. This could confuse users who expect confirmation that their sources exist.

Consider returning a distinct signal (e.g., the list of already-present sources, or a different return type/tuple) so callers can provide clearer feedback. At minimum, the info-level log at line 117-120 helps with debugging, but the JSON output path has no visibility into this.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/notebooklm/cli/helpers.py` around lines 110 - 121, The current timeout
branch in the function that calls client.sources.list(notebook_id) returns []
when pending_sources is empty, which hides the fact that all sources were
already present; change the function to return a distinct signal such as the
list of already-present source URLs or a tuple like (imported_count,
already_present_urls) so callers (e.g., research.py) can distinguish "0 imported
because already present" from "0 imported because none existed"; update the
branch that computes existing_urls and pending_sources and return the chosen
distinct value instead of [] and adjust callers to handle the new return shape
(or document the new contract) so logger.info remains for debugging but JSON
output can report already-present sources.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/notebooklm/cli/helpers.py`:
- Around line 110-121: The current timeout branch in the function that calls
client.sources.list(notebook_id) returns [] when pending_sources is empty, which
hides the fact that all sources were already present; change the function to
return a distinct signal such as the list of already-present source URLs or a
tuple like (imported_count, already_present_urls) so callers (e.g., research.py)
can distinguish "0 imported because already present" from "0 imported because
none existed"; update the branch that computes existing_urls and pending_sources
and return the chosen distinct value instead of [] and adjust callers to handle
the new return shape (or document the new contract) so logger.info remains for
debugging but JSON output can report already-present sources.

In `@tests/unit/cli/test_helpers.py`:
- Around line 730-732: Add a unit test that simulates client.sources.list
raising an exception to cover the except Exception path in import_with_retry:
create a MagicMock client where client.research.import_sources first raises
RPCTimeoutError then returns the expected imported list, and set
client.sources.list to AsyncMock(side_effect=Exception("API error")); call
import_with_retry with a single-source batch, patch asyncio.sleep to avoid
delays, assert the final returned imports match the expected result and verify
the second import_sources call received the original pending_sources (ensuring
the retry used the original batch when listing failed).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6e8d3980-00fd-4244-8f88-705faa5eb773

📥 Commits

Reviewing files that changed from the base of the PR and between a997718 and 15fb3fc.

📒 Files selected for processing (2)
  • src/notebooklm/cli/helpers.py
  • tests/unit/cli/test_helpers.py

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the source import retry logic by implementing a deduplication step after a timeout occurs. The import_with_retry function now fetches existing sources from the notebook and filters the pending list to avoid redundant imports. Review feedback points out that the deduplication logic currently relies solely on URLs, which will cause source types without URLs (such as report sources) to be retried indefinitely. Additionally, it was noted that the function's return value now only includes metadata for sources imported in the final successful batch, potentially losing information about sources that were successfully processed during previous attempts that timed out.

Comment on lines +112 to +115
existing_urls = {source.url for source in existing_sources if getattr(source, "url", None)}
pending_sources = [
source for source in pending_sources if source.get("url") not in existing_urls
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The deduplication logic relies exclusively on the url field. However, research import batches can also contain "report sources" (result_type 5) which do not have a URL but instead contain report_markdown. Currently, any source without a URL will have source.get("url") return None, which is never present in existing_urls. This causes report sources to be kept in pending_sources indefinitely, leading to unnecessary retries and potential duplication. Per repository guidelines, please add comments to explain this complex logic, especially how it addresses varying source types.

References
  1. Add comments to explain complex logic, especially when it addresses varying API response formats or source types.

while True:
try:
return await client.research.import_sources(notebook_id, task_id, sources)
return await client.research.import_sources(notebook_id, task_id, pending_sources)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When a retry occurs with a subset of the original sources, the return value of import_with_retry will only contain the IDs and titles of the sources imported in that specific successful attempt. If some sources landed during a previous attempt that timed out, their information is lost. While reconciling the full list would be ideal, avoid making breaking API changes. Instead, document this existing behavior in tests and flag it for future consideration, as per repository rules.

References
  1. Avoid making breaking API changes. Instead, document the existing behavior in tests and flag it for future consideration.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/notebooklm/cli/helpers.py (1)

165-170: Consider narrowing the exception type or re-raising unexpected errors.

The broad except Exception is flagged by static analysis (BLE001). While the defensive fallback is reasonable for network/API errors, catching all exceptions could mask unexpected bugs (e.g., TypeError, AttributeError from code issues).

Consider either:

  1. Catching specific expected exceptions (e.g., network-related errors from the client)
  2. Re-raising after logging if the exception is unexpected

That said, the current approach is safe for reliability — the worst case is an extra retry with the original batch.

🔧 Optional: Narrow the exception handling
-            except Exception as e:  # pragma: no cover - defensive: retry original pending batch
+            except (OSError, ConnectionError, TimeoutError) as e:  # pragma: no cover - defensive
                 logger.debug(
                     "Failed to list existing sources before retrying research import for %s: %s",
                     notebook_id,
                     e,
                 )

Or, if the client raises specific exception types, catch those instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/notebooklm/cli/helpers.py` around lines 165 - 170, The broad `except
Exception` around the "Failed to list existing sources before retrying research
import" debug log should be narrowed: replace `except Exception as e` with
catching the specific client/network exceptions your listing call can raise
(e.g., the HTTP/grpc/client error type used by the client) and keep the
logger.debug for those; for any other unexpected exceptions, re-raise after
logging so real bugs (TypeError/AttributeError) aren't swallowed. Locate the
handler around the listing call that references `notebook_id` and the debug log
and update it to use a tuple of expected exception types and a separate fallback
that re-raises unexpected exceptions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/notebooklm/cli/helpers.py`:
- Around line 100-113: The dedupe branch never runs because it checks
non-existent attributes `result_type` and `report_markdown`; update the function
`_existing_source_identities` to use the actual Source fields: read the source
type via `getattr(source, "_type_code", None)` (replace `result_type`) and, if
you need report content, use an existing field such as `getattr(source, "title",
"")` (or another real field from the Source dataclass) instead of
`report_markdown`; ensure the conditional uses `_type_code == 5` and only builds
the ("report", ...) identity when the chosen real field is a non-empty string so
report-based deduplication can actually run.

---

Nitpick comments:
In `@src/notebooklm/cli/helpers.py`:
- Around line 165-170: The broad `except Exception` around the "Failed to list
existing sources before retrying research import" debug log should be narrowed:
replace `except Exception as e` with catching the specific client/network
exceptions your listing call can raise (e.g., the HTTP/grpc/client error type
used by the client) and keep the logger.debug for those; for any other
unexpected exceptions, re-raise after logging so real bugs
(TypeError/AttributeError) aren't swallowed. Locate the handler around the
listing call that references `notebook_id` and the debug log and update it to
use a tuple of expected exception types and a separate fallback that re-raises
unexpected exceptions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b246a1cd-c276-4dae-9a32-c24632e6b27d

📥 Commits

Reviewing files that changed from the base of the PR and between 15fb3fc and aa11180.

📒 Files selected for processing (2)
  • src/notebooklm/cli/helpers.py
  • tests/unit/cli/test_helpers.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/unit/cli/test_helpers.py

Comment on lines +100 to +113
def _existing_source_identities(existing_sources: Iterable[object]) -> set[tuple[str, str]]:
"""Build dedupe keys from notebook sources returned by the API."""
identities: set[tuple[str, str]] = set()
for source in existing_sources:
url = getattr(source, "url", None)
if isinstance(url, str) and url:
identities.add(("url", url))

result_type = getattr(source, "result_type", None)
report_markdown = getattr(source, "report_markdown", None)
if result_type == 5 and isinstance(report_markdown, str) and report_markdown:
identities.add(("report", f"{getattr(source, 'title', '') or ''}\n{report_markdown}"))

return identities
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if Source class has result_type or report_markdown attributes

echo "=== Source class definition ==="
ast-grep --pattern $'class Source:
  $$$
'

echo ""
echo "=== Check for result_type in types.py ==="
rg -n "result_type" src/notebooklm/types.py || echo "Not found"

echo ""
echo "=== Check for report_markdown in types.py ==="
rg -n "report_markdown" src/notebooklm/types.py || echo "Not found"

Repository: teng-lin/notebooklm-py

Length of output: 9685


Report deduplication logic will never execute due to missing attributes.

The Source dataclass in src/notebooklm/types.py:484-507 defines attributes id, title, url, _type_code, created_at, and status, but not result_type or report_markdown.

Lines 108–111 attempt to access these non-existent attributes:

  • getattr(source, "result_type", None) always returns None
  • The condition result_type == 5 never evaluates to True
  • Report-based identities are never added to the deduplication set

URL-based deduplication (lines 104–106) works correctly, but report-type sources will not be deduplicated against existing sources, risking duplicates on retry.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/notebooklm/cli/helpers.py` around lines 100 - 113, The dedupe branch
never runs because it checks non-existent attributes `result_type` and
`report_markdown`; update the function `_existing_source_identities` to use the
actual Source fields: read the source type via `getattr(source, "_type_code",
None)` (replace `result_type`) and, if you need report content, use an existing
field such as `getattr(source, "title", "")` (or another real field from the
Source dataclass) instead of `report_markdown`; ensure the conditional uses
`_type_code == 5` and only builds the ("report", ...) identity when the chosen
real field is a non-empty string so report-based deduplication can actually run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: research wait --import-all imports duplicate sources on timeout retry

2 participants