Skip to content

fix: extract YouTube source URL from src[2][5] in list() and from_api_response()#267

Open
octo-patch wants to merge 2 commits intoteng-lin:mainfrom
octo-patch:fix/issue-265-youtube-url-extraction
Open

fix: extract YouTube source URL from src[2][5] in list() and from_api_response()#267
octo-patch wants to merge 2 commits intoteng-lin:mainfrom
octo-patch:fix/issue-265-youtube-url-extraction

Conversation

@octo-patch
Copy link
Copy Markdown

@octo-patch octo-patch commented Apr 11, 2026

Fixes #265

Problem

Source.url is always None for YouTube sources when retrieved via client.sources.list() or client.sources.get(). Web page and uploaded file sources return their URLs correctly.

The NotebookLM API stores YouTube URLs at src[2][5] as [url, video_id, channel_name], while web/PDF source URLs are stored at src[2][7]. The existing URL extraction logic only checked src[2][7], so YouTube source URLs were never extracted.

Solution

Added a check for src[2][5] between the existing [7] and [0] checks in all URL extraction paths:

  • _sources.pylist() method: added fallback to src[2][5] when src[2][7] is None
  • types.pySource.from_api_response(): added the same fallback in both the medium-nested and deeply-nested parsing paths

The fix validates that src[2][5] is a list with at least one element that is a string before using it as a URL.

Testing

  • Added unit test in tests/unit/test_types.py: test_from_api_response_youtube_url_at_index_5 — verifies that Source.from_api_response() correctly extracts YouTube URLs stored at entry[2][5]
  • Added integration test in tests/integration/test_sources.py: test_list_sources_youtube_url_at_index_5 — verifies that sources.list() correctly extracts YouTube URLs from the src[2][5] position
  • All existing tests continue to pass (2016 passed)

Summary by CodeRabbit

  • Bug Fixes

    • Improved URL extraction for sources to handle multiple metadata layouts and correctly fall back to alternate URL locations (including YouTube), reducing missed or incorrect source links.
  • Tests

    • Added integration and unit tests covering YouTube and other sources with varied metadata positions to ensure reliable URL extraction across formats.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 11, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5739ab41-2546-4013-b48e-c50a689b0e6e

📥 Commits

Reviewing files that changed from the base of the PR and between 3725052 and 56c00c5.

📒 Files selected for processing (3)
  • src/notebooklm/_sources.py
  • src/notebooklm/types.py
  • tests/unit/test_types.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • tests/unit/test_types.py
  • src/notebooklm/types.py
  • src/notebooklm/_sources.py

📝 Walkthrough

Walkthrough

Extended URL extraction logic in SourcesAPI.list() and Source.from_api_response() to include fallback checks for YouTube source URLs stored at alternative indices (src[2][5] and src[2][0]), along with corresponding integration and unit tests validating the new extraction paths.

Changes

Cohort / File(s) Summary
URL Extraction Logic
src/notebooklm/_sources.py, src/notebooklm/types.py
Added sequential fallback logic to check src[2][7] (web/PDF), then src[2][5] (YouTube data tuple), then src[2][0] (HTTP string) for URL extraction. Conditional guards ensure safe access to list and string indices across both files.
Test Coverage
tests/integration/test_sources.py, tests/unit/test_types.py
Added an integration test for a YouTube source with URL located at src[2][5] and extended an existing integration test with a YouTube URL assertion. Added a unit test validating Source.from_api_response() extracts YouTube URLs from the new fallback index.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hopped through tuples, indexes wide,

Found YouTube links where they did hide.
Now fallbacks hunt them, one, two, three—
No URL lost, hooray for me! 🎉

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main fix: adding YouTube URL extraction from src[2][5] in the list() and from_api_response() methods, matching the core changeset.
Linked Issues check ✅ Passed The PR fully addresses issue #265 by implementing src[2][5] fallback in _sources.py list() and types.py from_api_response(), with proper validation and comprehensive test coverage.
Out of Scope Changes check ✅ Passed All changes are directly scoped to addressing issue #265: YouTube URL extraction updates in source files and corresponding unit and integration tests.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the source parsing logic in _sources.py and types.py to correctly extract YouTube URLs from the API response, which are located at index 5 instead of index 7. It also introduces a fallback for URLs found at index 0 and includes comprehensive unit and integration tests. Review feedback suggests enhancing the robustness of the extraction by adding explicit string type checks for all URL indices and ensuring consistent logic across different parsing paths.

Comment on lines +108 to +118
if len(src[2]) > 7:
url_list = src[2][7]
if isinstance(url_list, list) and len(url_list) > 0:
url = url_list[0]
if not url and len(src[2]) > 5:
yt_data = src[2][5]
if isinstance(yt_data, list) and len(yt_data) > 0 and isinstance(yt_data[0], str):
url = yt_data[0]
if not url and len(src[2]) > 0:
if isinstance(src[2][0], str) and src[2][0].startswith("http"):
url = src[2][0]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The extraction logic for src[2][7] should include a type check for the first element (ensuring it is a string), consistent with the validation added for YouTube URLs at src[2][5]. This prevents potential type errors if the API returns non-string data. Additionally, per repository rules, please add a comment explaining this logic as it addresses varying API response formats.

Suggested change
if len(src[2]) > 7:
url_list = src[2][7]
if isinstance(url_list, list) and len(url_list) > 0:
url = url_list[0]
if not url and len(src[2]) > 5:
yt_data = src[2][5]
if isinstance(yt_data, list) and len(yt_data) > 0 and isinstance(yt_data[0], str):
url = yt_data[0]
if not url and len(src[2]) > 0:
if isinstance(src[2][0], str) and src[2][0].startswith("http"):
url = src[2][0]
# Handle varying API response formats for URL extraction.
if len(src[2]) > 7 and isinstance(src[2][7], list) and len(src[2][7]) > 0 and isinstance(src[2][7][0], str):
url = src[2][7][0]
if not url and len(src[2]) > 5 and isinstance(src[2][5], list) and len(src[2][5]) > 0 and isinstance(src[2][5][0], str):
url = src[2][5][0]
if not url and len(src[2]) > 0:
if isinstance(src[2][0], str) and src[2][0].startswith("http"):
url = src[2][0]
References
  1. Add comments to explain complex logic, such as recursive ID extraction, especially when it addresses varying API response formats.

Comment on lines 588 to +593
if len(entry[2]) > 7 and isinstance(entry[2][7], list):
url = entry[2][7][0] if entry[2][7] else None
if not url and len(entry[2]) > 5:
yt_data = entry[2][5]
if isinstance(yt_data, list) and len(yt_data) > 0 and isinstance(yt_data[0], str):
url = yt_data[0]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The medium-nested parsing path is missing the fallback check for entry[2][0] starting with "http", which is present in the deeply-nested path and in _sources.py. Additionally, the extraction of entry[2][7] should be made more robust by verifying that the first element is a string. Please include a comment explaining this logic to handle varying API response formats.

Suggested change
if len(entry[2]) > 7 and isinstance(entry[2][7], list):
url = entry[2][7][0] if entry[2][7] else None
if not url and len(entry[2]) > 5:
yt_data = entry[2][5]
if isinstance(yt_data, list) and len(yt_data) > 0 and isinstance(yt_data[0], str):
url = yt_data[0]
# Handle varying API response formats for URL extraction.
if len(entry[2]) > 7 and isinstance(entry[2][7], list) and len(entry[2][7]) > 0 and isinstance(entry[2][7][0], str):
url = entry[2][7][0]
if not url and len(entry[2]) > 5 and isinstance(entry[2][5], list) and len(entry[2][5]) > 0 and isinstance(entry[2][5][0], str):
url = entry[2][5][0]
if not url and len(entry[2]) > 0 and isinstance(entry[2][0], str) and entry[2][0].startswith("http"):
url = entry[2][0]
References
  1. Add comments to explain complex logic, such as recursive ID extraction, especially when it addresses varying API response formats.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tests/unit/test_types.py (1)

164-193: Add a medium-nested index-5 regression case

This test only exercises the deeply nested branch. Consider adding one medium-nested payload case so both Source.from_api_response() URL-extraction paths are locked.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/test_types.py` around lines 164 - 193, Add a second test that
covers the medium-nested extraction path for YouTube URLs so both branches in
Source.from_api_response() are exercised: create a new test function (e.g.,
test_from_api_response_youtube_url_medium_nesting) that constructs a payload
with the medium nesting variant of the metadata array containing the YouTube
entry, call Source.from_api_response(data), and assert the resulting Source.id,
Source.url (exact YouTube URL), and Source.kind == SourceType.YOUTUBE to lock
the medium-nesting branch.
tests/integration/test_sources.py (1)

220-274: Optional: add explicit sources.get() regression for index-5 YouTube URLs

This new test validates list() well. Since the issue scope includes get(), a direct get() case would guard against future refactors that decouple get() from list().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/test_sources.py` around lines 220 - 274, Add a
complementary regression test that calls client.sources.get(...) to verify
YouTube URLs stored at src[2][5] are parsed the same as list(); create a new
async test (e.g., test_get_source_youtube_url_at_index_5) that builds the same
RPC GET_NOTEBOOK response used in test_list_sources_youtube_url_at_index_5,
mocks it via httpx_mock, opens NotebookLMClient(auth_tokens), calls await
client.sources.get("nb_123", "src_yt") (or the appropriate single-get
signature), and asserts the returned Source has id "src_yt", kind "youtube", and
url "https://www.youtube.com/watch?v=dcWU-qD8ISQ" to ensure get() handles
index-5 YouTube metadata.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/notebooklm/_sources.py`:
- Around line 108-111: The code assigns url = url_list[0] without verifying its
type; update the block where src is inspected (the branch that checks
len(src[2]) > 7 and sets url_list = src[2][7]) to validate that url_list is a
non-empty list and that its first element is a string (and optionally non-empty
after strip) before assigning to url; if the element is not a string, skip it
and continue the fallback chain (or leave url unset) so malformed payloads
cannot inject non-string values.

---

Nitpick comments:
In `@tests/integration/test_sources.py`:
- Around line 220-274: Add a complementary regression test that calls
client.sources.get(...) to verify YouTube URLs stored at src[2][5] are parsed
the same as list(); create a new async test (e.g.,
test_get_source_youtube_url_at_index_5) that builds the same RPC GET_NOTEBOOK
response used in test_list_sources_youtube_url_at_index_5, mocks it via
httpx_mock, opens NotebookLMClient(auth_tokens), calls await
client.sources.get("nb_123", "src_yt") (or the appropriate single-get
signature), and asserts the returned Source has id "src_yt", kind "youtube", and
url "https://www.youtube.com/watch?v=dcWU-qD8ISQ" to ensure get() handles
index-5 YouTube metadata.

In `@tests/unit/test_types.py`:
- Around line 164-193: Add a second test that covers the medium-nested
extraction path for YouTube URLs so both branches in Source.from_api_response()
are exercised: create a new test function (e.g.,
test_from_api_response_youtube_url_medium_nesting) that constructs a payload
with the medium nesting variant of the metadata array containing the YouTube
entry, call Source.from_api_response(data), and assert the resulting Source.id,
Source.url (exact YouTube URL), and Source.kind == SourceType.YOUTUBE to lock
the medium-nesting branch.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d42de255-271d-4422-89ce-fb4ec0dbfac6

📥 Commits

Reviewing files that changed from the base of the PR and between a997718 and 3725052.

📒 Files selected for processing (4)
  • src/notebooklm/_sources.py
  • src/notebooklm/types.py
  • tests/integration/test_sources.py
  • tests/unit/test_types.py

Comment on lines +108 to +111
if len(src[2]) > 7:
url_list = src[2][7]
if isinstance(url_list, list) and len(url_list) > 0:
url = url_list[0]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Validate index-7 URL element type before assignment

At Line 110, url_list[0] is accepted without a string check. A malformed payload can set a non-string URL and bypass the fallback chain.

Proposed patch
-                    if len(src[2]) > 7:
-                        url_list = src[2][7]
-                        if isinstance(url_list, list) and len(url_list) > 0:
-                            url = url_list[0]
+                    if len(src[2]) > 7:
+                        url_list = src[2][7]
+                        if (
+                            isinstance(url_list, list)
+                            and len(url_list) > 0
+                            and isinstance(url_list[0], str)
+                        ):
+                            url = url_list[0]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/notebooklm/_sources.py` around lines 108 - 111, The code assigns url =
url_list[0] without verifying its type; update the block where src is inspected
(the branch that checks len(src[2]) > 7 and sets url_list = src[2][7]) to
validate that url_list is a non-empty list and that its first element is a
string (and optionally non-empty after strip) before assigning to url; if the
element is not a string, skip it and continue the fallback chain (or leave url
unset) so malformed payloads cannot inject non-string values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Source.url is always None for YouTube sources in list() and from_api_response()

1 participant