feat: add markdown output format for source fulltext#223
feat: add markdown output format for source fulltext#223jaytxng wants to merge 9 commits intoteng-lin:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
📝 WalkthroughWalkthroughAdds a Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as "CLI (user)"
participant Client as "NotebookLM Client\nSourcesAPI.get_fulltext"
participant RPC as "Backend RPC\nGET_SOURCE"
participant Markdownify as "markdownify (optional)"
CLI->>Client: request fulltext(source_id, format)
Client->>RPC: GET_SOURCE(params depending on format)
RPC-->>Client: response (plaintext blocks OR HTML)
alt format == "text"
Client->>Client: extract and join text blocks
else format == "markdown"
Client->>Markdownify: convert HTML -> Markdown
Markdownify-->>Client: markdown content
end
Client-->>CLI: SourceFulltext { content, char_count, ... }
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces the capability to retrieve source fulltext content in Markdown format, in addition to the existing plain text format. This feature is exposed via a new format parameter in the get_fulltext method of the Python API and a corresponding -f/--format option in the CLI. The Markdown conversion leverages the markdownify library, which is now an optional dependency. The review comments suggest improving error handling for the markdownify dependency, clarifying installation instructions in the docstrings, and adding input validation for the format parameter to ensure robust behavior.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/notebooklm/_sources.py (1)
669-698:⚠️ Potential issue | 🟡 MinorReject unsupported
formatvalues instead of silently returning text.Line 698 currently treats every value other than
"markdown"as"text", so typos like"md"quietly produce the wrong output. This is an edge-case correctness gap worth failing fast on.🩹 Suggested fix
+ if format not in {"text", "markdown"}: + raise ValidationError("format must be 'text' or 'markdown'") # [3],[3] returns HTML at result[4][1]; [2],[2] returns plaintext at result[3][0] params = [[source_id], [3], [3]] if format == "markdown" else [[source_id], [2], [2]]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/notebooklm/_sources.py` around lines 669 - 698, The get_fulltext method currently treats any format other than "markdown" as "text"; update get_fulltext to validate the format parameter (accept only "text" and "markdown") and fail fast by raising a clear exception (e.g., ValueError) when an unsupported value is passed. Locate the format parameter check in get_fulltext (where params is set) and add a guard before computing params that verifies format is one of the allowed strings, raising an error with a message referencing the invalid format.src/notebooklm/cli/source.py (1)
628-661:⚠️ Potential issue | 🟠 MajorDisable markup parsing for user-generated content in markdown output.
When using
--format markdown, thefulltext.contentpassed toconsole.print()on lines 683 and 688 contains markdown syntax with brackets (links, tables, etc.). Since Rich's Console defaults tomarkup=True, it parses[]as markup tags, corrupting the markdown output. Setmarkup=Falseon both calls to display the content faithfully.Fix
# Show first 2000 chars with truncation notice if len(fulltext.content) > 2000: - console.print(fulltext.content[:2000]) + console.print(fulltext.content[:2000], markup=False) console.print( f"\n[dim]... ({fulltext.char_count - 2000:,} more chars, use -o to save full content)[/dim]" ) else: - console.print(fulltext.content) + console.print(fulltext.content, markup=False)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/notebooklm/cli/source.py` around lines 628 - 661, The markdown output is being corrupted because Rich's Console defaults to markup=True when printing user-generated markdown; in source_fulltext, when content_format == "markdown" ensure the two console.print(...) calls that print fulltext.content set markup=False so Rich does not interpret brackets as markup (locate the prints that use fulltext.content in the source_fulltext function and add markup=False to both calls).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/cli-reference.md`:
- Line 100: The CLI docs row for "fulltext <id>" shows the example using `-f
markdown` but doesn't mention the optional dependency; update the table entry or
add a short parenthetical note stating that the `-f markdown` formatter requires
the optional "markdown" extra (from pyproject.toml) to be installed, e.g.,
instruct users to install the package with the `markdown` extra before using `-f
markdown` to avoid ModuleNotFoundError; reference the `fulltext <id>` command
and the `-f markdown` flag so readers can easily find the doc line to edit.
In `@docs/python-api.md`:
- Line 236: Update the docs row for get_fulltext to indicate the third parameter
is keyword-only: change the parameters column that currently reads "str, str,
str" to reflect "str, str, *, format: str" (or similar keyword-only notation) so
it matches the function signature and implementation of
get_fulltext(notebook_id, source_id, *, format="text") and prevents callers from
passing format positionally.
In `@pyproject.toml`:
- Line 37: The extras configuration is missing the new "markdown" extra from the
aggregate "all" extra; update the [project.optional-dependencies] section so
that the "all" extra includes "markdown" (which provides the markdownify>=0.14.1
requirement) along with the existing "browser" and "dev" entries, ensuring pip
install notebooklm-py[all] pulls in markdownify and the new markdown fulltext
path works at runtime.
---
Outside diff comments:
In `@src/notebooklm/_sources.py`:
- Around line 669-698: The get_fulltext method currently treats any format other
than "markdown" as "text"; update get_fulltext to validate the format parameter
(accept only "text" and "markdown") and fail fast by raising a clear exception
(e.g., ValueError) when an unsupported value is passed. Locate the format
parameter check in get_fulltext (where params is set) and add a guard before
computing params that verifies format is one of the allowed strings, raising an
error with a message referencing the invalid format.
In `@src/notebooklm/cli/source.py`:
- Around line 628-661: The markdown output is being corrupted because Rich's
Console defaults to markup=True when printing user-generated markdown; in
source_fulltext, when content_format == "markdown" ensure the two
console.print(...) calls that print fulltext.content set markup=False so Rich
does not interpret brackets as markup (locate the prints that use
fulltext.content in the source_fulltext function and add markup=False to both
calls).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f75e0166-2504-4fbe-8582-7791787dcc93
📒 Files selected for processing (6)
CHANGELOG.mddocs/cli-reference.mddocs/python-api.mdpyproject.tomlsrc/notebooklm/_sources.pysrc/notebooklm/cli/source.py
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/notebooklm/_sources.py`:
- Around line 669-671: Rename the get_fulltext parameter named format to avoid
shadowing the built-in (e.g., output_format) and update its type to a Literal
union ("text" | "markdown") for stricter typing; inside get_fulltext add an
early validation that raises ValueError if output_format is not "text" or
"markdown" so invalid values (e.g., "json") don't silently return empty content.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e5b722a6-7117-43e2-8963-fda63d2c9f18
📒 Files selected for processing (1)
src/notebooklm/_sources.py
|
Thanks for this PR, @jaytxng! The markdown format support is a great addition — clean API design, proper optional dependency handling, and thorough documentation updates. Nice work. I ran a multi-model review (Claude, Gemini, Codex) and here are the consolidated findings: Recommendations1. Add tests (all 3 reviewers flagged this)The new RPC/parse branch has no coverage. At minimum:
2. Rich markup eats
|
Adds a
formatparameter toget_fulltext()that lets users retrieve sourcecontent as structured Markdown instead of flattened plaintext.
What
formatkwarg onclient.sources.get_fulltext()—"text"(default) or"markdown"-f/--formatCLI option onsource fulltext[3],[3]) converted viamarkdownifyWhy
The existing plaintext output strips all structure — headings, tables, links,
emphasis are lost. For downstream use cases (LLM context, note-taking, research
pipelines), preserving that structure matters. resolves #222
Details
"text"is the default for both CLI and Python APImarkdownifyis an optional dependency (pip install notebooklm-py[markdown])markdownextra topyproject.tomlSummary by CodeRabbit
New Features
Documentation