fix: extract YouTube source URL from src[2][5] in list() and from_api_response()#267
fix: extract YouTube source URL from src[2][5] in list() and from_api_response()#267octo-patch wants to merge 2 commits intoteng-lin:mainfrom
Conversation
…_response() (fixes teng-lin#265)
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughWalkthroughExtended URL extraction logic in Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request updates the source parsing logic in _sources.py and types.py to correctly extract YouTube URLs from the API response, which are located at index 5 instead of index 7. It also introduces a fallback for URLs found at index 0 and includes comprehensive unit and integration tests. Review feedback suggests enhancing the robustness of the extraction by adding explicit string type checks for all URL indices and ensuring consistent logic across different parsing paths.
| if len(src[2]) > 7: | ||
| url_list = src[2][7] | ||
| if isinstance(url_list, list) and len(url_list) > 0: | ||
| url = url_list[0] | ||
| if not url and len(src[2]) > 5: | ||
| yt_data = src[2][5] | ||
| if isinstance(yt_data, list) and len(yt_data) > 0 and isinstance(yt_data[0], str): | ||
| url = yt_data[0] | ||
| if not url and len(src[2]) > 0: | ||
| if isinstance(src[2][0], str) and src[2][0].startswith("http"): | ||
| url = src[2][0] |
There was a problem hiding this comment.
The extraction logic for src[2][7] should include a type check for the first element (ensuring it is a string), consistent with the validation added for YouTube URLs at src[2][5]. This prevents potential type errors if the API returns non-string data. Additionally, per repository rules, please add a comment explaining this logic as it addresses varying API response formats.
| if len(src[2]) > 7: | |
| url_list = src[2][7] | |
| if isinstance(url_list, list) and len(url_list) > 0: | |
| url = url_list[0] | |
| if not url and len(src[2]) > 5: | |
| yt_data = src[2][5] | |
| if isinstance(yt_data, list) and len(yt_data) > 0 and isinstance(yt_data[0], str): | |
| url = yt_data[0] | |
| if not url and len(src[2]) > 0: | |
| if isinstance(src[2][0], str) and src[2][0].startswith("http"): | |
| url = src[2][0] | |
| # Handle varying API response formats for URL extraction. | |
| if len(src[2]) > 7 and isinstance(src[2][7], list) and len(src[2][7]) > 0 and isinstance(src[2][7][0], str): | |
| url = src[2][7][0] | |
| if not url and len(src[2]) > 5 and isinstance(src[2][5], list) and len(src[2][5]) > 0 and isinstance(src[2][5][0], str): | |
| url = src[2][5][0] | |
| if not url and len(src[2]) > 0: | |
| if isinstance(src[2][0], str) and src[2][0].startswith("http"): | |
| url = src[2][0] |
References
- Add comments to explain complex logic, such as recursive ID extraction, especially when it addresses varying API response formats.
| if len(entry[2]) > 7 and isinstance(entry[2][7], list): | ||
| url = entry[2][7][0] if entry[2][7] else None | ||
| if not url and len(entry[2]) > 5: | ||
| yt_data = entry[2][5] | ||
| if isinstance(yt_data, list) and len(yt_data) > 0 and isinstance(yt_data[0], str): | ||
| url = yt_data[0] |
There was a problem hiding this comment.
The medium-nested parsing path is missing the fallback check for entry[2][0] starting with "http", which is present in the deeply-nested path and in _sources.py. Additionally, the extraction of entry[2][7] should be made more robust by verifying that the first element is a string. Please include a comment explaining this logic to handle varying API response formats.
| if len(entry[2]) > 7 and isinstance(entry[2][7], list): | |
| url = entry[2][7][0] if entry[2][7] else None | |
| if not url and len(entry[2]) > 5: | |
| yt_data = entry[2][5] | |
| if isinstance(yt_data, list) and len(yt_data) > 0 and isinstance(yt_data[0], str): | |
| url = yt_data[0] | |
| # Handle varying API response formats for URL extraction. | |
| if len(entry[2]) > 7 and isinstance(entry[2][7], list) and len(entry[2][7]) > 0 and isinstance(entry[2][7][0], str): | |
| url = entry[2][7][0] | |
| if not url and len(entry[2]) > 5 and isinstance(entry[2][5], list) and len(entry[2][5]) > 0 and isinstance(entry[2][5][0], str): | |
| url = entry[2][5][0] | |
| if not url and len(entry[2]) > 0 and isinstance(entry[2][0], str) and entry[2][0].startswith("http"): | |
| url = entry[2][0] |
References
- Add comments to explain complex logic, such as recursive ID extraction, especially when it addresses varying API response formats.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tests/unit/test_types.py (1)
164-193: Add a medium-nested index-5 regression caseThis test only exercises the deeply nested branch. Consider adding one medium-nested payload case so both
Source.from_api_response()URL-extraction paths are locked.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/test_types.py` around lines 164 - 193, Add a second test that covers the medium-nested extraction path for YouTube URLs so both branches in Source.from_api_response() are exercised: create a new test function (e.g., test_from_api_response_youtube_url_medium_nesting) that constructs a payload with the medium nesting variant of the metadata array containing the YouTube entry, call Source.from_api_response(data), and assert the resulting Source.id, Source.url (exact YouTube URL), and Source.kind == SourceType.YOUTUBE to lock the medium-nesting branch.tests/integration/test_sources.py (1)
220-274: Optional: add explicitsources.get()regression for index-5 YouTube URLsThis new test validates
list()well. Since the issue scope includesget(), a directget()case would guard against future refactors that decoupleget()fromlist().🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/test_sources.py` around lines 220 - 274, Add a complementary regression test that calls client.sources.get(...) to verify YouTube URLs stored at src[2][5] are parsed the same as list(); create a new async test (e.g., test_get_source_youtube_url_at_index_5) that builds the same RPC GET_NOTEBOOK response used in test_list_sources_youtube_url_at_index_5, mocks it via httpx_mock, opens NotebookLMClient(auth_tokens), calls await client.sources.get("nb_123", "src_yt") (or the appropriate single-get signature), and asserts the returned Source has id "src_yt", kind "youtube", and url "https://www.youtube.com/watch?v=dcWU-qD8ISQ" to ensure get() handles index-5 YouTube metadata.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/notebooklm/_sources.py`:
- Around line 108-111: The code assigns url = url_list[0] without verifying its
type; update the block where src is inspected (the branch that checks
len(src[2]) > 7 and sets url_list = src[2][7]) to validate that url_list is a
non-empty list and that its first element is a string (and optionally non-empty
after strip) before assigning to url; if the element is not a string, skip it
and continue the fallback chain (or leave url unset) so malformed payloads
cannot inject non-string values.
---
Nitpick comments:
In `@tests/integration/test_sources.py`:
- Around line 220-274: Add a complementary regression test that calls
client.sources.get(...) to verify YouTube URLs stored at src[2][5] are parsed
the same as list(); create a new async test (e.g.,
test_get_source_youtube_url_at_index_5) that builds the same RPC GET_NOTEBOOK
response used in test_list_sources_youtube_url_at_index_5, mocks it via
httpx_mock, opens NotebookLMClient(auth_tokens), calls await
client.sources.get("nb_123", "src_yt") (or the appropriate single-get
signature), and asserts the returned Source has id "src_yt", kind "youtube", and
url "https://www.youtube.com/watch?v=dcWU-qD8ISQ" to ensure get() handles
index-5 YouTube metadata.
In `@tests/unit/test_types.py`:
- Around line 164-193: Add a second test that covers the medium-nested
extraction path for YouTube URLs so both branches in Source.from_api_response()
are exercised: create a new test function (e.g.,
test_from_api_response_youtube_url_medium_nesting) that constructs a payload
with the medium nesting variant of the metadata array containing the YouTube
entry, call Source.from_api_response(data), and assert the resulting Source.id,
Source.url (exact YouTube URL), and Source.kind == SourceType.YOUTUBE to lock
the medium-nesting branch.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d42de255-271d-4422-89ce-fb4ec0dbfac6
📒 Files selected for processing (4)
src/notebooklm/_sources.pysrc/notebooklm/types.pytests/integration/test_sources.pytests/unit/test_types.py
| if len(src[2]) > 7: | ||
| url_list = src[2][7] | ||
| if isinstance(url_list, list) and len(url_list) > 0: | ||
| url = url_list[0] |
There was a problem hiding this comment.
Validate index-7 URL element type before assignment
At Line 110, url_list[0] is accepted without a string check. A malformed payload can set a non-string URL and bypass the fallback chain.
Proposed patch
- if len(src[2]) > 7:
- url_list = src[2][7]
- if isinstance(url_list, list) and len(url_list) > 0:
- url = url_list[0]
+ if len(src[2]) > 7:
+ url_list = src[2][7]
+ if (
+ isinstance(url_list, list)
+ and len(url_list) > 0
+ and isinstance(url_list[0], str)
+ ):
+ url = url_list[0]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/notebooklm/_sources.py` around lines 108 - 111, The code assigns url =
url_list[0] without verifying its type; update the block where src is inspected
(the branch that checks len(src[2]) > 7 and sets url_list = src[2][7]) to
validate that url_list is a non-empty list and that its first element is a
string (and optionally non-empty after strip) before assigning to url; if the
element is not a string, skip it and continue the fallback chain (or leave url
unset) so malformed payloads cannot inject non-string values.
Fixes #265
Problem
Source.urlis alwaysNonefor YouTube sources when retrieved viaclient.sources.list()orclient.sources.get(). Web page and uploaded file sources return their URLs correctly.The NotebookLM API stores YouTube URLs at
src[2][5]as[url, video_id, channel_name], while web/PDF source URLs are stored atsrc[2][7]. The existing URL extraction logic only checkedsrc[2][7], so YouTube source URLs were never extracted.Solution
Added a check for
src[2][5]between the existing[7]and[0]checks in all URL extraction paths:_sources.py—list()method: added fallback tosrc[2][5]whensrc[2][7]isNonetypes.py—Source.from_api_response(): added the same fallback in both the medium-nested and deeply-nested parsing pathsThe fix validates that
src[2][5]is a list with at least one element that is a string before using it as a URL.Testing
tests/unit/test_types.py:test_from_api_response_youtube_url_at_index_5— verifies thatSource.from_api_response()correctly extracts YouTube URLs stored atentry[2][5]tests/integration/test_sources.py:test_list_sources_youtube_url_at_index_5— verifies thatsources.list()correctly extracts YouTube URLs from thesrc[2][5]positionSummary by CodeRabbit
Bug Fixes
Tests