Skip to content

feat: add markdown output format for source fulltext#223

Open
jaytxng wants to merge 9 commits intoteng-lin:mainfrom
jaytxng:feature/fulltext-markdown-format
Open

feat: add markdown output format for source fulltext#223
jaytxng wants to merge 9 commits intoteng-lin:mainfrom
jaytxng:feature/fulltext-markdown-format

Conversation

@jaytxng
Copy link
Copy Markdown

@jaytxng jaytxng commented Mar 25, 2026

Adds a format parameter to get_fulltext() that lets users retrieve source
content as structured Markdown instead of flattened plaintext.

What

  • New format kwarg on client.sources.get_fulltext()"text" (default) or "markdown"
  • New -f/--format CLI option on source fulltext
  • Markdown format preserves headings, tables, links, and emphasis from the source
  • Uses the API's HTML rendition (params [3],[3]) converted via markdownify

Why

The existing plaintext output strips all structure — headings, tables, links,
emphasis are lost. For downstream use cases (LLM context, note-taking, research
pipelines), preserving that structure matters. resolves #222

Details

  • Default behavior unchanged — "text" is the default for both CLI and Python API
  • markdownify is an optional dependency (pip install notebooklm-py[markdown])
  • Added markdown extra to pyproject.toml
  • Updated cli-reference, python-api docs, and changelog

Summary by CodeRabbit

  • New Features

    • Fulltext output now supports markdown in addition to plain text via a new format option.
    • CLI command adds a -f/--format flag to choose text or markdown output.
    • Optional markdown support available via an extra dependency.
  • Documentation

    • CLI reference and Python API docs updated to document the new format option and example usage.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the notebooklm-py library and CLI by introducing the capability to retrieve source content in a structured Markdown format. Previously, only flattened plaintext was available, which stripped away valuable formatting. The new functionality allows users to access richer content, preserving elements like headings, tables, and links, which is crucial for downstream applications like LLM context generation, note-taking, and research pipelines. This change provides greater flexibility and utility for interacting with source data.

Highlights

  • Structured Markdown Output: Introduced a new format parameter to client.sources.get_fulltext() in the Python API, allowing users to retrieve source content as structured Markdown instead of flattened plaintext.
  • CLI Integration: Added a new -f/--format CLI option to the source fulltext command, enabling command-line users to specify the desired output format.
  • Content Preservation: The new Markdown format preserves structural elements such as headings, tables, links, and emphasis, which are lost in the default plaintext output.
  • Optional Dependency: Integrated markdownify as an optional dependency, requiring pip install notebooklm-py[markdown] for Markdown functionality.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 25, 2026

📝 Walkthrough

Walkthrough

Adds a format parameter to SourcesAPI.get_fulltext() (defaults to "text", supports "markdown") and a --format/-f CLI option for source fulltext. Markdown output converts HTML to Markdown via an optional markdownify extra (markdownify>=0.14.1).

Changes

Cohort / File(s) Summary
Documentation
CHANGELOG.md, docs/cli-reference.md, docs/python-api.md
Documented new format parameter for get_fulltext() with values "text" or "markdown", added CLI -f option and updated examples.
Dependencies
pyproject.toml
Added optional extra markdown that installs markdownify>=0.14.1.
API Implementation
src/notebooklm/_sources.py
Added SourcesAPI.get_fulltext(..., *, format: str = "text"). Validates format; for "text" extracts and joins plaintext blocks, for "markdown" requests HTML and converts it to Markdown using markdownify (raises ImportError with install hint if missing).
CLI Integration
src/notebooklm/cli/source.py
source fulltext command gains --format/-f option (text

Sequence Diagram(s)

sequenceDiagram
participant CLI as "CLI (user)"
participant Client as "NotebookLM Client\nSourcesAPI.get_fulltext"
participant RPC as "Backend RPC\nGET_SOURCE"
participant Markdownify as "markdownify (optional)"

CLI->>Client: request fulltext(source_id, format)
Client->>RPC: GET_SOURCE(params depending on format)
RPC-->>Client: response (plaintext blocks OR HTML)
alt format == "text"
Client->>Client: extract and join text blocks
else format == "markdown"
Client->>Markdownify: convert HTML -> Markdown
Markdownify-->>Client: markdown content
end
Client-->>CLI: SourceFulltext { content, char_count, ... }
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 I hopped through tags and lines today,
Turned HTML crumbs into Markdown play,
Text or markdown, pick your way,
A tiny extra in the fray,
I nibble bytes and bound away ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding markdown output format support for the source fulltext feature.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the capability to retrieve source fulltext content in Markdown format, in addition to the existing plain text format. This feature is exposed via a new format parameter in the get_fulltext method of the Python API and a corresponding -f/--format option in the CLI. The Markdown conversion leverages the markdownify library, which is now an optional dependency. The review comments suggest improving error handling for the markdownify dependency, clarifying installation instructions in the docstrings, and adding input validation for the format parameter to ensure robust behavior.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/notebooklm/_sources.py (1)

669-698: ⚠️ Potential issue | 🟡 Minor

Reject unsupported format values instead of silently returning text.

Line 698 currently treats every value other than "markdown" as "text", so typos like "md" quietly produce the wrong output. This is an edge-case correctness gap worth failing fast on.

🩹 Suggested fix
+        if format not in {"text", "markdown"}:
+            raise ValidationError("format must be 'text' or 'markdown'")
         # [3],[3] returns HTML at result[4][1]; [2],[2] returns plaintext at result[3][0]
         params = [[source_id], [3], [3]] if format == "markdown" else [[source_id], [2], [2]]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/notebooklm/_sources.py` around lines 669 - 698, The get_fulltext method
currently treats any format other than "markdown" as "text"; update get_fulltext
to validate the format parameter (accept only "text" and "markdown") and fail
fast by raising a clear exception (e.g., ValueError) when an unsupported value
is passed. Locate the format parameter check in get_fulltext (where params is
set) and add a guard before computing params that verifies format is one of the
allowed strings, raising an error with a message referencing the invalid format.
src/notebooklm/cli/source.py (1)

628-661: ⚠️ Potential issue | 🟠 Major

Disable markup parsing for user-generated content in markdown output.

When using --format markdown, the fulltext.content passed to console.print() on lines 683 and 688 contains markdown syntax with brackets (links, tables, etc.). Since Rich's Console defaults to markup=True, it parses [] as markup tags, corrupting the markdown output. Set markup=False on both calls to display the content faithfully.

Fix
             # Show first 2000 chars with truncation notice
             if len(fulltext.content) > 2000:
-                console.print(fulltext.content[:2000])
+                console.print(fulltext.content[:2000], markup=False)
                 console.print(
                     f"\n[dim]... ({fulltext.char_count - 2000:,} more chars, use -o to save full content)[/dim]"
                 )
             else:
-                console.print(fulltext.content)
+                console.print(fulltext.content, markup=False)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/notebooklm/cli/source.py` around lines 628 - 661, The markdown output is
being corrupted because Rich's Console defaults to markup=True when printing
user-generated markdown; in source_fulltext, when content_format == "markdown"
ensure the two console.print(...) calls that print fulltext.content set
markup=False so Rich does not interpret brackets as markup (locate the prints
that use fulltext.content in the source_fulltext function and add markup=False
to both calls).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/cli-reference.md`:
- Line 100: The CLI docs row for "fulltext <id>" shows the example using `-f
markdown` but doesn't mention the optional dependency; update the table entry or
add a short parenthetical note stating that the `-f markdown` formatter requires
the optional "markdown" extra (from pyproject.toml) to be installed, e.g.,
instruct users to install the package with the `markdown` extra before using `-f
markdown` to avoid ModuleNotFoundError; reference the `fulltext <id>` command
and the `-f markdown` flag so readers can easily find the doc line to edit.

In `@docs/python-api.md`:
- Line 236: Update the docs row for get_fulltext to indicate the third parameter
is keyword-only: change the parameters column that currently reads "str, str,
str" to reflect "str, str, *, format: str" (or similar keyword-only notation) so
it matches the function signature and implementation of
get_fulltext(notebook_id, source_id, *, format="text") and prevents callers from
passing format positionally.

In `@pyproject.toml`:
- Line 37: The extras configuration is missing the new "markdown" extra from the
aggregate "all" extra; update the [project.optional-dependencies] section so
that the "all" extra includes "markdown" (which provides the markdownify>=0.14.1
requirement) along with the existing "browser" and "dev" entries, ensuring pip
install notebooklm-py[all] pulls in markdownify and the new markdown fulltext
path works at runtime.

---

Outside diff comments:
In `@src/notebooklm/_sources.py`:
- Around line 669-698: The get_fulltext method currently treats any format other
than "markdown" as "text"; update get_fulltext to validate the format parameter
(accept only "text" and "markdown") and fail fast by raising a clear exception
(e.g., ValueError) when an unsupported value is passed. Locate the format
parameter check in get_fulltext (where params is set) and add a guard before
computing params that verifies format is one of the allowed strings, raising an
error with a message referencing the invalid format.

In `@src/notebooklm/cli/source.py`:
- Around line 628-661: The markdown output is being corrupted because Rich's
Console defaults to markup=True when printing user-generated markdown; in
source_fulltext, when content_format == "markdown" ensure the two
console.print(...) calls that print fulltext.content set markup=False so Rich
does not interpret brackets as markup (locate the prints that use
fulltext.content in the source_fulltext function and add markup=False to both
calls).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f75e0166-2504-4fbe-8582-7791787dcc93

📥 Commits

Reviewing files that changed from the base of the PR and between ddb4dee and 6753bc2.

📒 Files selected for processing (6)
  • CHANGELOG.md
  • docs/cli-reference.md
  • docs/python-api.md
  • pyproject.toml
  • src/notebooklm/_sources.py
  • src/notebooklm/cli/source.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/notebooklm/_sources.py`:
- Around line 669-671: Rename the get_fulltext parameter named format to avoid
shadowing the built-in (e.g., output_format) and update its type to a Literal
union ("text" | "markdown") for stricter typing; inside get_fulltext add an
early validation that raises ValueError if output_format is not "text" or
"markdown" so invalid values (e.g., "json") don't silently return empty content.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e5b722a6-7117-43e2-8963-fda63d2c9f18

📥 Commits

Reviewing files that changed from the base of the PR and between 6753bc2 and 2f2c5c8.

📒 Files selected for processing (1)
  • src/notebooklm/_sources.py

@teng-lin
Copy link
Copy Markdown
Owner

Thanks for this PR, @jaytxng! The markdown format support is a great addition — clean API design, proper optional dependency handling, and thorough documentation updates. Nice work.

I ran a multi-model review (Claude, Gemini, Codex) and here are the consolidated findings:


Recommendations

1. Add tests (all 3 reviewers flagged this)

The new RPC/parse branch has no coverage. At minimum:

  • Verify [[source_id], [3], [3]] params are sent for output_format="markdown"
  • Verify HTML at result[4][1] is parsed and converted
  • Verify ImportError with helpful message when markdownify is missing
  • Verify ValueError on invalid output_format
  • Verify CLI -f markdown passes through correctly

2. Rich markup eats [brackets] in terminal output

console.print(fulltext.content) passes raw markdown through Rich, which interprets [text] as markup tags — brackets silently disappear. Consider using console.print(Text(fulltext.content)) or rich.markup.escape() for the content display path. File output (-o) is unaffected.

The same issue affects the ImportError message — pip install 'notebooklm-py[markdown]' will render as pip install 'notebooklm-py' when Rich prints it, losing the crucial [markdown] extra name.

3. Fail fast on missing markdownify

Currently the markdownify import check happens after the (potentially slow) RPC call. Moving it before the network request gives users immediate feedback:

if output_format == "markdown":
    try:
        from markdownify import markdownify as md
    except ImportError:
        raise ImportError(...) from None

# then proceed with RPC call...

4. CHANGELOG.md parameter name mismatch

Line 12 says format parameter but the actual kwarg is output_format. Copy-pasting the changelog example would produce a TypeError:

-- New `format` parameter on `client.sources.get_fulltext()`
++ New `output_format` parameter on `client.sources.get_fulltext()`

5. (Minor) More specific warning when markdown HTML is missing

When result[4][1] isn't present in markdown mode, a targeted log message like "Markdown HTML not found at result[4][1] — API response structure may have changed" would help debug future breakage (this is an undocumented RPC after all).


None of these are blocking on their own, but #1 (tests) and #2 (Rich markup) are the most impactful to address before merge. Everything else about the PR looks solid — backward-compatible default, proper Literal typing, clean CLI integration. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Markdown support for deep research report

2 participants