Skip to content

Adds a product docs toolset to the dbt MCP#620

Merged
b-per merged 5 commits into
mainfrom
add/public-docs
Mar 10, 2026
Merged

Adds a product docs toolset to the dbt MCP#620
b-per merged 5 commits into
mainfrom
add/public-docs

Conversation

@mirnawong1
Copy link
Copy Markdown
Contributor

@mirnawong1 mirnawong1 commented Mar 2, 2026

Summary

This PR adds a product_docs toolset to the dbt MCP server, giving AI agents real-time access to the public dbt documentation at docs.getdbt.com. The toolset exposes two MCP tools with a clean separation of concerns: search for pages, then fetch their content.

Product docs tools are included by default when you run dbt-mcp; use DISABLE_PRODUCT_DOCS=true to turn them off, or DBT_MCP_ENABLE_PRODUCT_DOCS=true when using an allowlist (other DBT_MCP_ENABLE_* vars).

Flow

User asks a question about dbt
│
▼
┌─────────────────────────┐
│ 1. search_product_docs │ Agent sends keywords
│ │ Tool searches llms.txt index (title/description matching)
│ │ If fewer than 3 results, runs `expand_keywords(query)` to get synonyms/abbreviations (e.g. "udf" → "user-defined function"), then does a second search_index call with the expanded query.
Merges and deduplicates results by URL.
Never touches llms-full.txt.
│ │ Returns: titles, URLs, descriptions (NO content)
└───────────┬─────────────┘
│
│ Agent now has URLs for relevant pages
│
▼
┌─────────────────────────────────────────────────┐
│ 2. get_product_doc_pages([paths]) │
│ Fetches 1–10 pages in parallel as Markdown │
└─────────────────────────────────────────────────┘
│
▼
 Agent synthesizes answer from full page content

What changed

Two-tool surface (src/dbt_mcp/product_docs/tools.py)

Designed with minimal tool count to reduce LLM confusion and MCP context overhead:

  • search_product_docs — keyword search over the llms.txt index (returns metadata: titles, URLs, descriptions). Includes automatic full-text fallback via llms-full.txt when keyword search finds fewer than 3 results.
  • get_product_doc_pages — fetch one or more pages by path or URL as Markdown (up to 10 in parallel). A single tool handles both single-page and multi-page fetching — pass a list of one path or many.

Typed responses (src/dbt_mcp/product_docs/types.py)

All tools return typed @dataclass responses (following the existing semantic_layer/types.py pattern) instead of raw JSON strings:

  • DocSearchResult, SearchProductDocsResponse, ProductDocPageResponse, GetProductDocPagesResponse

FastMCP natively serializes dataclasses, so no manual json.dumps() is needed.

Client with caching and search ranking (src/dbt_mcp/product_docs/client.py)

  • ProductDocsClient handles HTTP fetching with TTL-based in-memory caching (1h index, 24h full-text, 30m pages).
  • Relevance-ranked search with documented scoring weights (extracted into score_index_entry() with named constants like SCORE_KEYWORD_IN_TITLE, SCORE_EXACT_TITLE_MATCH, etc.).
  • Abbreviation expansion (e.g. "CI" → "continuous integration", "SL" → "semantic layer").
  • Content truncation (28k chars) to keep responses manageable for LLM context windows.
  • client.get_page() raises httpx.HTTPStatusError / httpx.RequestError — tool layer catches exceptions and returns typed error responses.

Supporting changes

  • tool_names.py / toolsets.pySEARCH_PRODUCT_DOCS and GET_PRODUCT_DOC_PAGES enum members added.
  • human_descriptions.py — descriptions for both tools.
  • Prompt .md files — search_product_docs.md and get_product_doc_pages.md with guidance on how to present docs content to users.
  • config.py / settings.py / server.py — product docs toolset registration with DISABLE_PRODUCT_DOCS support.
  • README.md / diagram.d2 — auto-updated by pre-commit hook.
  • .changes/unreleased/ — changie entry added.

Tests

  • Unit tests (tests/unit/tools/test_product_docs.py) — 42 tests covering parsing, URL normalization, search ranking, full-text fallback, page fetching (success, 404, network error, partial failures, 10-page cap), and toolset registration.
  • Integration tests (tests/integration/product_docs/test_product_docs.py) — live tests against docs.getdbt.com for client methods and both MCP tools.

Why

  1. Minimal tool surface — Two tools instead of three. A single get_product_doc_pages tool accepts a list of paths (1 to 10), eliminating the need for separate single/batch tools. Fewer tools = less LLM confusion about which to pick, less MCP context overhead.
  2. Typed returns — Returning dataclasses instead of dicts/JSON strings aligns with the rest of the codebase (e.g. semantic layer tools) and gives downstream consumers type safety.
  3. Documented scoring — Search relevance weights are named constants with a docstring explaining the heuristic, not magic numbers.
  4. Robust error handling — Raising exceptions from the HTTP layer instead of returning error strings, with typed error responses at the tool level.

Checklist

  • I have performed a self-review of my code
  • I have made corresponding changes to the documentation (in https://github.com/dbt-labs/docs.getdbt.com) if required — WILL ADD THIS WHEN PR MERGED
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Additional notes

Product docs tools are on by default. Set DISABLE_PRODUCT_DOCS=true to disable them. If you use an allowlist (e.g. DBT_MCP_ENABLE_SEMANTIC_LAYER=true), add DBT_MCP_ENABLE_PRODUCT_DOCS=true to include product docs. The read_only_hint=True annotation is correct per the MCP spec — the in-memory cache is internal process state, not an externally observable side effect.

@mirnawong1 mirnawong1 requested review from a team, b-per, jairus-m and jasnonaz as code owners March 2, 2026 11:41
@mirnawong1 mirnawong1 changed the title Add/public-docs Adds a product docs toolset to the dbt MCP Mar 2, 2026
@mirnawong1
Copy link
Copy Markdown
Contributor Author

i'm not sure where the failure is coming from :(

@mirnawong1 mirnawong1 marked this pull request as draft March 2, 2026 12:23
Copy link
Copy Markdown
Collaborator

@b-per b-per left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Mirna. I added a first set of comments but there might be another round once this is addressed.

The current CI failures are because

  • we need a changie entry (created with changie new)
  • we need to run a task check to format the code

Comment thread .cursor/mcp.json Outdated
Comment thread .cursor/mcp.json.bak Outdated
Comment thread src/dbt_mcp/tools/human_descriptions.py Outdated
Comment thread src/dbt_mcp/product_docs/tools.py Outdated
Comment thread src/dbt_mcp/product_docs/tools.py Outdated
Comment thread src/dbt_mcp/product_docs/tools.py Outdated
Comment thread src/dbt_mcp/product_docs/tools.py Outdated
Comment thread src/dbt_mcp/product_docs/tools.py Outdated
Comment thread src/dbt_mcp/product_docs/tools.py Outdated
Comment thread src/dbt_mcp/product_docs/tools.py Outdated
Comment thread src/dbt_mcp/product_docs/tools.py
Comment thread scripts/test-docs-approaches.py Outdated
Copy link
Copy Markdown
Collaborator

@b-per b-per left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Mirna for addressing the first set of comments 🙌
I added a few extra.

Also, if possible, could you ask your LLM to clean the commit history? It might be even easier to review if the PR is moved to a few logical commits.

Comment thread src/dbt_mcp/product_docs/tools.py Outdated
Comment thread src/dbt_mcp/product_docs/client.py Outdated
Comment thread src/dbt_mcp/product_docs/client.py
Comment thread src/dbt_mcp/product_docs/client.py
Comment thread src/dbt_mcp/product_docs/client.py Outdated
Comment thread src/dbt_mcp/product_docs/client.py Outdated
Comment thread src/dbt_mcp/product_docs/tools.py
@mirnawong1
Copy link
Copy Markdown
Contributor Author

great thanks so much @b-per and @DevonFulcher ! ive addressed all the comments and ready for your re-review whenever you have a second.

one thing that i'm not sure about is the full-text fallback mechanism. I've noticed that when a search_product_docs query returns less than 3 results from the lightweight llms.txt index, it automatically falls back to fetching llms-full.txt — the entire dbt documentation corpus. And this this freezes my cursor ide. Not sure how often this would happen but wondering if there's a clever way to circumvent this?

@DevonFulcher
Copy link
Copy Markdown
Collaborator

falls back to fetching llms-full.txt

@mirnawong1 is it possible to remove this fallback? I don't think we want to feed our entire docs site into the LLM.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new “Product Docs” toolset to the dbt MCP server so agents can search and fetch public documentation from docs.getdbt.com via two dedicated MCP tools.

Changes:

  • Introduces search_product_docs (metadata search with full-text fallback) and get_product_doc_pages (parallel Markdown fetch for up to 10 pages).
  • Adds a cached ProductDocsClient plus typed dataclass responses for both tools.
  • Wires the toolset into config/registration and adds unit + integration test coverage.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/unit/tools/test_product_docs.py Unit tests for URL normalization, parsing, ranking, tool behavior, and registration toggles.
tests/integration/product_docs/test_product_docs.py Live integration tests against docs.getdbt.com for client + MCP tools.
src/dbt_mcp/tools/toolsets.py Adds PRODUCT_DOCS toolset and maps tools into it.
src/dbt_mcp/tools/tool_names.py Adds SEARCH_PRODUCT_DOCS and GET_PRODUCT_DOC_PAGES tool names.
src/dbt_mcp/tools/human_descriptions.py Adds human-readable descriptions for the two new tools.
src/dbt_mcp/prompts/product_docs/search_product_docs.md Prompt guidance for using the search tool and fallback behavior.
src/dbt_mcp/prompts/product_docs/get_product_doc_pages.md Prompt guidance for presenting fetched docs content to users.
src/dbt_mcp/product_docs/types.py Defines typed dataclass response models for product docs tools.
src/dbt_mcp/product_docs/tools.py Implements and registers the two MCP tools, including parallel fetch and error handling.
src/dbt_mcp/product_docs/client.py Adds HTTP fetch + TTL caching + search ranking + full-text search support.
src/dbt_mcp/product_docs/init.py Introduces the product_docs package.
src/dbt_mcp/mcp/server.py Registers the product docs toolset in server creation.
src/dbt_mcp/config/settings.py Adds DISABLE_PRODUCT_DOCS and DBT_MCP_ENABLE_PRODUCT_DOCS settings.
src/dbt_mcp/config/config.py Wires product docs into enable/disable toolset mapping.
docs/diagram.d2 Updates architecture diagram to include Product Docs toolset.
README.md Documents the new Product Docs tools in the public tool list.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread src/dbt_mcp/product_docs/client.py
Comment thread src/dbt_mcp/product_docs/tools.py
Comment thread src/dbt_mcp/product_docs/tools.py Outdated
Introduce a new `product_docs` toolset that lets AI agents query the
public dbt documentation in real time via two MCP tools:

- `search_product_docs` — keyword search against llms.txt with
  automatic full-text fallback via llms-full.txt
- `get_product_doc_pages` — fetch one or more pages as Markdown
  (up to 10 in parallel)

Includes TTL-based in-memory caching, relevance-ranked search with
documented scoring weights, abbreviation expansion, and unit tests.

Made-with: Cursor
Instead of fetching the entire llms-full.txt corpus when the llms.txt
index returns few results, re-run the metadata search with expanded
keywords (abbreviations and synonyms). This avoids loading the full
docs site while still improving recall for short/abbreviated queries.

Made-with: Cursor
Comment thread tests/unit/tools/test_product_docs.py Fixed
…URLs

- Validate that normalize_doc_url only produces docs.getdbt.com URLs,
  raising ValueError for external hosts (prevents SSRF).
- Share a single ProductDocsClient across tool calls so TTL-based
  caches (llms.txt index, page cache) actually persist.
- Normalize URLs in error responses to match the format used in
  success responses (display_url instead of raw input path).

Made-with: Cursor
@b-per b-per self-requested a review March 9, 2026 12:16
Copy link
Copy Markdown
Collaborator

@b-per b-per left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a couple more comments now that I saw the current caching approach

Comment thread src/dbt_mcp/product_docs/client.py Outdated
Comment thread src/dbt_mcp/product_docs/client.py Outdated
Replace TTL-based caching (timestamps, locks, eviction) with a simple
dict that lives for the lifetime of the MCP server process. Restart the
server to refresh. Removes INDEX_CACHE_TTL_SECONDS, PAGE_CACHE_TTL_SECONDS,
FULL_TEXT_CACHE_TTL_SECONDS, and associated asyncio locks.

Made-with: Cursor
@mirnawong1
Copy link
Copy Markdown
Contributor Author

thanks @b-per ! ready for re-review when you can

Copy link
Copy Markdown
Collaborator

@b-per b-per left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Mirna. It's good to merge from me. Let's see if we get feedback from the MCP users about it and iterate from there.

@b-per
Copy link
Copy Markdown
Collaborator

b-per commented Mar 10, 2026

Your commits are not signed though. Before any future contribution, could you check how to set git to sign your commits? https://github.com/dbt-labs/dbt-mcp/blob/main/CONTRIBUTING.md

@b-per b-per merged commit 17b9bf8 into main Mar 10, 2026
11 checks passed
@b-per b-per deleted the add/public-docs branch March 10, 2026 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants