fix: dedupe research import retries after timeout#257
fix: dedupe research import retries after timeout#257voidborne-d wants to merge 2 commits intoteng-lin:mainfrom
Conversation
📝 WalkthroughWalkthroughimport_with_retry() now maintains a mutable pending_sources list and, after each RPCTimeoutError, calls Changes
Sequence Diagram(s)sequenceDiagram
participant Caller as import_with_retry()
participant Client as client (research & sources)
participant Notebook as Notebook Service
Caller->>Client: import_sources(pending_sources)
Note right of Caller: on RPCTimeoutError
Caller->>Client: sources.list(notebook_id)
Client->>Notebook: list request
Notebook-->>Client: existing sources
Client-->>Caller: existing sources
Caller->>Caller: compute identities, filter pending_sources
alt pending_sources not empty
Caller->>Client: import_sources(filtered_pending_sources)
else pending_sources empty
Caller-->>Caller: return [] (stop retries)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
tests/unit/cli/test_helpers.py (1)
730-732: Consider adding a test for whenclient.sources.listraises an exception.The implementation has a defensive
except Exceptionblock (lines 122-127 in helpers.py) that continues retrying with the currentpending_sourcesif listing fails. This path isn't tested and could regress.📝 Suggested test for the fallback path
`@pytest.mark.asyncio` async def test_continues_retrying_when_sources_list_fails(self): """Verify retry continues with original batch if sources.list fails.""" client = MagicMock() client.research.import_sources = AsyncMock( side_effect=[ RPCTimeoutError("Timed out", timeout_seconds=30.0), [{"id": "src_1", "title": "Source 1"}], ] ) client.sources.list = AsyncMock(side_effect=Exception("API error")) sources = [{"url": "https://example.com", "title": "Source 1"}] with patch("notebooklm.cli.helpers.asyncio.sleep", new_callable=AsyncMock): imported = await import_with_retry(client, "nb_123", "task_123", sources) assert imported == [{"id": "src_1", "title": "Source 1"}] # Should retry with original sources since listing failed assert client.research.import_sources.await_args_list[1].args[2] == sourcesAlso applies to: 751-752
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/cli/test_helpers.py` around lines 730 - 732, Add a unit test that simulates client.sources.list raising an exception to cover the except Exception path in import_with_retry: create a MagicMock client where client.research.import_sources first raises RPCTimeoutError then returns the expected imported list, and set client.sources.list to AsyncMock(side_effect=Exception("API error")); call import_with_retry with a single-source batch, patch asyncio.sleep to avoid delays, assert the final returned imports match the expected result and verify the second import_sources call received the original pending_sources (ensuring the retry used the original batch when listing failed).src/notebooklm/cli/helpers.py (1)
110-121: Consider distinguishing "all sources already imported" from "no sources imported".When all sources are already present, returning
[]is semantically ambiguous—callers likeresearch.py(lines 190-201) will report"imported": 0without indicating that sources were already present. This could confuse users who expect confirmation that their sources exist.Consider returning a distinct signal (e.g., the list of already-present sources, or a different return type/tuple) so callers can provide clearer feedback. At minimum, the info-level log at line 117-120 helps with debugging, but the JSON output path has no visibility into this.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/notebooklm/cli/helpers.py` around lines 110 - 121, The current timeout branch in the function that calls client.sources.list(notebook_id) returns [] when pending_sources is empty, which hides the fact that all sources were already present; change the function to return a distinct signal such as the list of already-present source URLs or a tuple like (imported_count, already_present_urls) so callers (e.g., research.py) can distinguish "0 imported because already present" from "0 imported because none existed"; update the branch that computes existing_urls and pending_sources and return the chosen distinct value instead of [] and adjust callers to handle the new return shape (or document the new contract) so logger.info remains for debugging but JSON output can report already-present sources.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/notebooklm/cli/helpers.py`:
- Around line 110-121: The current timeout branch in the function that calls
client.sources.list(notebook_id) returns [] when pending_sources is empty, which
hides the fact that all sources were already present; change the function to
return a distinct signal such as the list of already-present source URLs or a
tuple like (imported_count, already_present_urls) so callers (e.g., research.py)
can distinguish "0 imported because already present" from "0 imported because
none existed"; update the branch that computes existing_urls and pending_sources
and return the chosen distinct value instead of [] and adjust callers to handle
the new return shape (or document the new contract) so logger.info remains for
debugging but JSON output can report already-present sources.
In `@tests/unit/cli/test_helpers.py`:
- Around line 730-732: Add a unit test that simulates client.sources.list
raising an exception to cover the except Exception path in import_with_retry:
create a MagicMock client where client.research.import_sources first raises
RPCTimeoutError then returns the expected imported list, and set
client.sources.list to AsyncMock(side_effect=Exception("API error")); call
import_with_retry with a single-source batch, patch asyncio.sleep to avoid
delays, assert the final returned imports match the expected result and verify
the second import_sources call received the original pending_sources (ensuring
the retry used the original batch when listing failed).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6e8d3980-00fd-4244-8f88-705faa5eb773
📒 Files selected for processing (2)
src/notebooklm/cli/helpers.pytests/unit/cli/test_helpers.py
There was a problem hiding this comment.
Code Review
This pull request enhances the source import retry logic by implementing a deduplication step after a timeout occurs. The import_with_retry function now fetches existing sources from the notebook and filters the pending list to avoid redundant imports. Review feedback points out that the deduplication logic currently relies solely on URLs, which will cause source types without URLs (such as report sources) to be retried indefinitely. Additionally, it was noted that the function's return value now only includes metadata for sources imported in the final successful batch, potentially losing information about sources that were successfully processed during previous attempts that timed out.
src/notebooklm/cli/helpers.py
Outdated
| existing_urls = {source.url for source in existing_sources if getattr(source, "url", None)} | ||
| pending_sources = [ | ||
| source for source in pending_sources if source.get("url") not in existing_urls | ||
| ] |
There was a problem hiding this comment.
The deduplication logic relies exclusively on the url field. However, research import batches can also contain "report sources" (result_type 5) which do not have a URL but instead contain report_markdown. Currently, any source without a URL will have source.get("url") return None, which is never present in existing_urls. This causes report sources to be kept in pending_sources indefinitely, leading to unnecessary retries and potential duplication. Per repository guidelines, please add comments to explain this complex logic, especially how it addresses varying source types.
References
- Add comments to explain complex logic, especially when it addresses varying API response formats or source types.
| while True: | ||
| try: | ||
| return await client.research.import_sources(notebook_id, task_id, sources) | ||
| return await client.research.import_sources(notebook_id, task_id, pending_sources) |
There was a problem hiding this comment.
When a retry occurs with a subset of the original sources, the return value of import_with_retry will only contain the IDs and titles of the sources imported in that specific successful attempt. If some sources landed during a previous attempt that timed out, their information is lost. While reconciling the full list would be ideal, avoid making breaking API changes. Instead, document this existing behavior in tests and flag it for future consideration, as per repository rules.
References
- Avoid making breaking API changes. Instead, document the existing behavior in tests and flag it for future consideration.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/notebooklm/cli/helpers.py (1)
165-170: Consider narrowing the exception type or re-raising unexpected errors.The broad
except Exceptionis flagged by static analysis (BLE001). While the defensive fallback is reasonable for network/API errors, catching all exceptions could mask unexpected bugs (e.g.,TypeError,AttributeErrorfrom code issues).Consider either:
- Catching specific expected exceptions (e.g., network-related errors from the client)
- Re-raising after logging if the exception is unexpected
That said, the current approach is safe for reliability — the worst case is an extra retry with the original batch.
🔧 Optional: Narrow the exception handling
- except Exception as e: # pragma: no cover - defensive: retry original pending batch + except (OSError, ConnectionError, TimeoutError) as e: # pragma: no cover - defensive logger.debug( "Failed to list existing sources before retrying research import for %s: %s", notebook_id, e, )Or, if the client raises specific exception types, catch those instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/notebooklm/cli/helpers.py` around lines 165 - 170, The broad `except Exception` around the "Failed to list existing sources before retrying research import" debug log should be narrowed: replace `except Exception as e` with catching the specific client/network exceptions your listing call can raise (e.g., the HTTP/grpc/client error type used by the client) and keep the logger.debug for those; for any other unexpected exceptions, re-raise after logging so real bugs (TypeError/AttributeError) aren't swallowed. Locate the handler around the listing call that references `notebook_id` and the debug log and update it to use a tuple of expected exception types and a separate fallback that re-raises unexpected exceptions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/notebooklm/cli/helpers.py`:
- Around line 100-113: The dedupe branch never runs because it checks
non-existent attributes `result_type` and `report_markdown`; update the function
`_existing_source_identities` to use the actual Source fields: read the source
type via `getattr(source, "_type_code", None)` (replace `result_type`) and, if
you need report content, use an existing field such as `getattr(source, "title",
"")` (or another real field from the Source dataclass) instead of
`report_markdown`; ensure the conditional uses `_type_code == 5` and only builds
the ("report", ...) identity when the chosen real field is a non-empty string so
report-based deduplication can actually run.
---
Nitpick comments:
In `@src/notebooklm/cli/helpers.py`:
- Around line 165-170: The broad `except Exception` around the "Failed to list
existing sources before retrying research import" debug log should be narrowed:
replace `except Exception as e` with catching the specific client/network
exceptions your listing call can raise (e.g., the HTTP/grpc/client error type
used by the client) and keep the logger.debug for those; for any other
unexpected exceptions, re-raise after logging so real bugs
(TypeError/AttributeError) aren't swallowed. Locate the handler around the
listing call that references `notebook_id` and the debug log and update it to
use a tuple of expected exception types and a separate fallback that re-raises
unexpected exceptions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b246a1cd-c276-4dae-9a32-c24632e6b27d
📒 Files selected for processing (2)
src/notebooklm/cli/helpers.pytests/unit/cli/test_helpers.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/unit/cli/test_helpers.py
| def _existing_source_identities(existing_sources: Iterable[object]) -> set[tuple[str, str]]: | ||
| """Build dedupe keys from notebook sources returned by the API.""" | ||
| identities: set[tuple[str, str]] = set() | ||
| for source in existing_sources: | ||
| url = getattr(source, "url", None) | ||
| if isinstance(url, str) and url: | ||
| identities.add(("url", url)) | ||
|
|
||
| result_type = getattr(source, "result_type", None) | ||
| report_markdown = getattr(source, "report_markdown", None) | ||
| if result_type == 5 and isinstance(report_markdown, str) and report_markdown: | ||
| identities.add(("report", f"{getattr(source, 'title', '') or ''}\n{report_markdown}")) | ||
|
|
||
| return identities |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if Source class has result_type or report_markdown attributes
echo "=== Source class definition ==="
ast-grep --pattern $'class Source:
$$$
'
echo ""
echo "=== Check for result_type in types.py ==="
rg -n "result_type" src/notebooklm/types.py || echo "Not found"
echo ""
echo "=== Check for report_markdown in types.py ==="
rg -n "report_markdown" src/notebooklm/types.py || echo "Not found"Repository: teng-lin/notebooklm-py
Length of output: 9685
Report deduplication logic will never execute due to missing attributes.
The Source dataclass in src/notebooklm/types.py:484-507 defines attributes id, title, url, _type_code, created_at, and status, but not result_type or report_markdown.
Lines 108–111 attempt to access these non-existent attributes:
getattr(source, "result_type", None)always returnsNone- The condition
result_type == 5never evaluates toTrue - Report-based identities are never added to the deduplication set
URL-based deduplication (lines 104–106) works correctly, but report-type sources will not be deduplicated against existing sources, risking duplicates on retry.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/notebooklm/cli/helpers.py` around lines 100 - 113, The dedupe branch
never runs because it checks non-existent attributes `result_type` and
`report_markdown`; update the function `_existing_source_identities` to use the
actual Source fields: read the source type via `getattr(source, "_type_code",
None)` (replace `result_type`) and, if you need report content, use an existing
field such as `getattr(source, "title", "")` (or another real field from the
Source dataclass) instead of `report_markdown`; ensure the conditional uses
`_type_code == 5` and only builds the ("report", ...) identity when the chosen
real field is a non-empty string so report-based deduplication can actually run.
Summary
import_with_retry()by checking which source URLs are already present in the notebook after a timeoutTesting
PYTHONPATH=src python3 -m pytest tests/unit/cli/test_helpers.py tests/unit/cli/test_research.py tests/unit/cli/test_source.py -qCloses #241.
Summary by CodeRabbit
Bug Fixes
Tests