Skip to content

feat(llmobs): capture pydantic-ai external/MCP tools, fix run_stream name#18528

Draft
PROFeNoM wants to merge 8 commits into
mainfrom
alex/pydantic-ai-tool-mcp-capture
Draft

feat(llmobs): capture pydantic-ai external/MCP tools, fix run_stream name#18528
PROFeNoM wants to merge 8 commits into
mainfrom
alex/pydantic-ai-tool-mcp-capture

Conversation

@PROFeNoM

@PROFeNoM PROFeNoM commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

JIRA: MLOB-7609

Description

Extends the Pydantic AI LLM Observability integration so the agent manifest reflects the tools and MCP servers an agent actually uses:

  • External toolset tools (ExternalToolset.tool_defs) now appear in the manifest.
  • MCP servers are recorded (id, URL or command, tool prefix) without connecting to them. Covers the deprecated MCPServer* classes, the newer MCPToolset, and MCP toolsets behind wrappers (.prefixed(), load_mcp_toolsets()). Secret-looking launch args (e.g. --api-key <token>) and credentials in MCP URLs are scrubbed before reaching the manifest.
  • MCP tools called during a run are added to the manifest as observed tools.
  • run_stream agent name inference is fixed (the tracing proxy frame previously broke pydantic-ai's own frame-walk).

Observed tool calls are attributed to their agent by walking the span's parent chain to the nearest agent span, so concurrent runs and nested agent-as-tool delegations don't cross-contaminate manifests. This replaces an unused process-global attribution registry (_running_agents/_latest_agent, populated but never read and never cleaned up) with span-scoped state, fixing latent cross-contamination before any feature relied on it.

What changes in Datadog

Snippets below are from the demo app. In each screenshot, released ddtrace is on the left, this branch on the right.

👉 Link to spans

1. External toolset tools in the manifest

external = ExternalToolset([
    ToolDefinition(
        name="lookup_order",
        description="Look up the status of a customer order by id.",
        parameters_json_schema={
            "type": "object",
            "properties": {"order_id": {"type": "string"}},
            "required": ["order_id"],
        },
    )
])
agent = Agent(model="openai:gpt-4o", name="support_agent", toolsets=[external])
await agent.run("Hello, can you help me?")

Before: manifest tools is empty. After: it lists lookup_order.

image

2. MCP server metadata + called MCP tool

mcp_server = MCPServerStdio(command=sys.executable, args=["-m", "mcp_server_time"], id="time-mcp")
agent = Agent(model="openai:gpt-4o", name="time_agent", toolsets=[mcp_server])
async with agent:
    await agent.run("What time is it right now in Tokyo?")

Before: no MCP info. After: manifest records the time-mcp server (id + command/args) and the called get_current_time tool. (See the time_agent span, not the separate MCP Client Session span.)

image

3. run_stream agent name inference

# No name= passed: inferred from the bound variable.
streamed_agent = Agent(model="openai:gpt-4o")
async with streamed_agent.run_stream("Name three capitals in Europe.") as result:
    async for chunk in result.stream_text(delta=True):
        ...

Before: span falls back to PydanticAI Agent. After: span name is streamed_agent.

image

Testing

New tests in tests/contrib/pydantic_ai/test_pydantic_ai_llmobs.py: external toolset capture, end-to-end MCP server (real stdio FastMCP subprocess), MCPToolset capture, wrapper-toolset MCP capture, dynamic MCP toolset capture, credential scrubbing, run_stream name inference, and concurrent/nested agent tool attribution.

Full llmobs::pydantic_ai suite passes locally (pydantic-ai 0.8.1 / 1.0.0 / 1.106.0). MCP-server tests gated >= 1.0.0 (needs MCPServer.id); MCPToolset test gated >= 1.97.0.

Risks

Low. Integration-only. All toolset/MCP attribute reads are duck-typed and guarded, so a missing or renamed attribute never raises in a customer run. No public API change.

Performance

Feature overhead is within run-to-run noise (~0.2%). The per-tool-call cost is a short walk up the span parent chain plus an accumulator on the agent span; MCP scrubbing/formatting happens once per run at manifest assembly, not per tool call.

Benchmarked before vs after in the same venv (pydantic-ai 1.106.0, LLMObs enabled): 3 function tools + a real MCP stdio toolset, TestModel (no network) calling every tool each run.

median ms/run per-rep range
Before (no feature) 9.35 9.12 – 10.05
After (feature on) 9.37 9.02 – 9.87
Benchmark script and method

Dropped into tests/contrib/pydantic_ai/, run via the harness:
scripts/run-tests --venv <py3.12 pydantic-ai-slim 1.106.0> -- -s -- -s -k test_bench.
It raises AssertionError at the end so the harness surfaces the numbers (stdout is hidden on pass).
"Before" = the three integration files at merge-base (c641709); "After" = this branch.

import os
import statistics
import sys
import time

from pydantic_ai.mcp import MCPServerStdio
from pydantic_ai.models.test import TestModel


def tool_a() -> str:
    """Tool A"""
    return "a"


def tool_b(x: int) -> int:
    """Tool B"""
    return x


def tool_c(name: str) -> str:
    """Tool C"""
    return name


async def test_bench(pydantic_ai, pydantic_ai_llmobs):
    N = 300
    server_path = os.path.join(os.path.dirname(__file__), "mcp_server.py")
    mcp = MCPServerStdio(command=sys.executable, args=[server_path], id="square-mcp", env=os.environ.copy())
    agent = pydantic_ai.Agent(
        model=TestModel(), name="bench_agent", tools=[tool_a, tool_b, tool_c], toolsets=[mcp]
    )
    async with agent:
        for _ in range(40):  # warmup
            await agent.run("go")
        times = []
        for _ in range(5):
            t0 = time.perf_counter()
            for _ in range(N):
                await agent.run("go")
            times.append(time.perf_counter() - t0)
    med = statistics.median(times)
    raise AssertionError(
        f"BENCHRESULT median={med * 1e3 / N:.4f}ms/run reps={[round(t * 1e3 / N, 4) for t in times]}"
    )

The MCP connection is opened once (async with agent) and reused across all runs, so the measurement
isolates per-run instrumentation cost, not MCP setup. TestModel removes LLM network latency so
overhead is not buried under request time.

Additional Notes

Full resolved tool catalog (via ToolManager.for_run_step) is deferred until pydantic-ai 2.0 stable.

@cit-pr-commenter-54b7da

cit-pr-commenter-54b7da Bot commented Jun 9, 2026

Copy link
Copy Markdown

Codeowners resolved as

ddtrace/llmobs/_integrations/pydantic_ai.py                             @DataDog/ml-observability

@datadog-datadog-prod-us1

datadog-datadog-prod-us1 Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Pipelines  Tests

Fix all issues with BitsAI

⚠️ Warnings

🚦 18 Pipeline jobs failed

System Tests | tracer-release / End-to-end #1 / uds-flask 1   View in Datadog   GitHub Actions

🧪 9 Tests failed

tests.appsec.test_asm_standalone.Test_AppSecStandalone_NotEnabled.test_client_computed_stats_header_is_not_present[uds-flask] from system_tests_suite   View in Datadog (Fix with Cursor)
assert 2 == 1

self = &lt;tests.appsec.test_asm_standalone.Test_AppSecStandalone_NotEnabled object at 0x7fd54cd0f620&gt;

    def test_client_computed_stats_header_is_not_present(self):
        spans_checked = 0
        for data, trace, _ in interfaces.library.get_spans(request=self.r):
            assert trace.trace_id_equals(1212121212121212122)
            assert &#34;datadog-client-computed-stats&#34; not in [x.lower() for x, y in data[&#34;request&#34;][&#34;headers&#34;]]
            spans_checked &#43;= 1
...
tests.test_data_integrity.Test_TraceUniqueness.test_trace_ids[uds-flask] from system_tests_suite   View in Datadog (Fix with Cursor)
ValueError: Found duplicated trace id 5013447155428939925 in ./logs/interfaces/library/00105__v0.4_traces.json and ./logs/interfaces/library/00102__v0.4_traces.json

self = &lt;tests.test_data_integrity.Test_TraceUniqueness object at 0x7fd54cfa1040&gt;

    def test_trace_ids(self):
&gt;       interfaces.library.assert_trace_id_uniqueness()

tests/test_data_integrity.py:19: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

...
View all 9 test failures

System Tests | tracer-release / End-to-end #5 / flask-poc 5   View in Datadog   GitHub Actions

🧪 1 Test failed

tests.test_data_integrity.Test_TraceUniqueness.test_trace_ids[flask-poc] from system_tests_suite   View in Datadog (Fix with Cursor)
ValueError: Found duplicated trace id 14233711099861840250 in ./logs/interfaces/library/00062__v0.4_traces.json and ./logs/interfaces/library/00061__v0.4_traces.json

self = &lt;tests.test_data_integrity.Test_TraceUniqueness object at 0x7f8fa4581070&gt;

    def test_trace_ids(self):
&gt;       interfaces.library.assert_trace_id_uniqueness()

tests/test_data_integrity.py:19: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

...

System Tests | tracer-release / End-to-end #6 / django-py3.13 6   View in Datadog   GitHub Actions

🧪 1 Test failed

tests.test_data_integrity.Test_TraceUniqueness.test_trace_ids[django-py3.13] from system_tests_suite   View in Datadog (Fix with Cursor)
ValueError: Found duplicated trace id 10406335910600928621 in ./logs/interfaces/library/00129__v0.4_traces.json and ./logs/interfaces/library/00125__v0.4_traces.json

self = &lt;tests.test_data_integrity.Test_TraceUniqueness object at 0x7f73e7f22ae0&gt;

    def test_trace_ids(self):
&gt;       interfaces.library.assert_trace_id_uniqueness()

tests/test_data_integrity.py:19: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

...

View all 18 failed jobs.

ℹ️ Info

No other issues found (see more)

❄️ No new flaky tests detected

🔄 Datadog auto-retried 2 jobs - 2 passed on retry View in Datadog

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 4b615a3 | Docs | Datadog PR Page | Give us feedback!

@PROFeNoM PROFeNoM force-pushed the alex/pydantic-ai-tool-mcp-capture branch from c77dfcb to b07c3c9 Compare June 9, 2026 07:54
@pr-commenter

pr-commenter Bot commented Jun 9, 2026

Copy link
Copy Markdown

Benchmarks

Benchmark execution time: 2026-06-10 13:41:35

Comparing candidate commit 4b615a3 in PR branch alex/pydantic-ai-tool-mcp-capture with baseline commit c641709 in branch main.

Found 0 performance improvements and 4 performance regressions! Performance is the same for 616 metrics, 10 unstable metrics.

scenario:iastaspects-index_aspect

  • 🟥 execution_time [+14.692µs; +18.382µs] or [+11.901%; +14.890%]

scenario:iastaspects-title_aspect

  • 🟥 execution_time [+39.556µs; +51.415µs] or [+12.039%; +15.648%]

scenario:iastaspectsospath-ospathbasename_aspect

  • 🟥 execution_time [+97.861µs; +108.059µs] or [+22.569%; +24.921%]

scenario:span-start

  • 🟥 execution_time [+1.460ms; +1.620ms] or [+9.576%; +10.626%]

@PROFeNoM

PROFeNoM commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b07c3c9a5f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread ddtrace/llmobs/_integrations/pydantic_ai.py Outdated
Comment thread ddtrace/llmobs/_integrations/pydantic_ai.py Outdated
@PROFeNoM

PROFeNoM commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2825cd277d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread ddtrace/llmobs/_integrations/pydantic_ai.py Outdated
@PROFeNoM

PROFeNoM commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 25a8d78761

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread ddtrace/llmobs/_integrations/pydantic_ai.py Outdated
@PROFeNoM

PROFeNoM commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5688d79434

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread ddtrace/contrib/internal/pydantic_ai/utils.py Outdated
Comment thread ddtrace/llmobs/_integrations/pydantic_ai.py Outdated
@PROFeNoM

PROFeNoM commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a526d8f503

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread ddtrace/llmobs/_integrations/pydantic_ai.py Outdated
@PROFeNoM

PROFeNoM commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eeb53bb4e3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread ddtrace/llmobs/_integrations/pydantic_ai.py Outdated
@PROFeNoM

PROFeNoM commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8b0c3149ba

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread ddtrace/llmobs/_integrations/pydantic_ai.py Outdated
@PROFeNoM

PROFeNoM commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2360f6ba7d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread ddtrace/llmobs/_integrations/pydantic_ai.py Outdated
PROFeNoM added 2 commits June 10, 2026 09:43
…am agent name

Extend the pydantic-ai LLMObs integration to record the full agent
manifest: statically declared tools, externally/MCP-provided tools
discovered during a run, and MCP server connection details
(url/command/args, with credentials scrubbed from URLs and launch args).

Observed tools are attributed per agent run via the ddtrace span parent
chain: the agent span seeds an observed-tools dict in its ctx item and
each tool span walks up to its nearest agent ancestor to record there.
This keeps attribution correct under concurrency and nested
agent-as-tool delegation without any context-local token state.

Also honor pydantic-ai's `infer_name=False` on the run_stream path and
re-infer the agent name through our proxy frame when it is left default.
…ming

Add coverage for external/MCP tool capture in the agent manifest,
credential scrubbing of MCP urls and launch args, per-run and override
toolsets, concurrent and nested-delegation tool attribution, agent entry
failure, and run_stream name inference (including infer_name=False).
Add the mcp test server and the pydantic-ai mcp test venv.
@PROFeNoM PROFeNoM force-pushed the alex/pydantic-ai-tool-mcp-capture branch from 2360f6b to 39e939d Compare June 10, 2026 07:43
@PROFeNoM

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 39e939d8e8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread ddtrace/llmobs/_integrations/pydantic_ai.py
…sets

Dynamic, combined, and capability toolsets resolve to their MCP toolset
only at run time, so they aren't reachable from the agent's static
toolset list and were absent from the manifest's mcp_servers (orphaning
the tool's mcp_server_id). Capture the realized MCP toolset from the
observed tool call and merge it into mcp_servers.

The observed tool path stashes only the toolset object; scrubbing and
formatting run once per run at manifest assembly, not per tool call.
@PROFeNoM

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c7c5fe1599

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread ddtrace/llmobs/_integrations/pydantic_ai.py
PROFeNoM added 2 commits June 10, 2026 14:10
The __aenter__ span-finish guard fixed a pre-existing span leak (and stale
_run_stream_active flag) that exists on main, unrelated to tool/MCP capture.
Removing it keeps this PR scoped to the feature; the entry-failure leak can be
addressed in a separate fix PR.
The 3.13 cap wasn't required by any package metadata (pydantic-ai-slim
1.106.0 and mcp are both >=3.10 with no upper bound). Dropping it; the full
pydantic_ai suite passes on 3.14 (63 passed).
PROFeNoM added 3 commits June 10, 2026 14:34
Collapse the MCP test cluster from 8 functions to 4 without losing coverage:
the 3 end-to-end runs become one parametrized test (static vs dynamic toolset;
the redundant MCPToolset live-run is dropped, its only unique path is covered
by the wrapper unit test), and the 5 _get_mcp_servers unit tests become 3
cohesive ones (credential scrubbing, source resolution, wrapper unwrapping).
MCP/fastmcp imports stay function-local because the py3.9 venv can't install
the mcp package and would fail collection otherwise.
Keep comments only where the behavior is non-obvious (toolset unwrapping, MCP
detection, server resolution, credential scrubbing); drop the rest.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant