feat(llmobs): capture pydantic-ai external/MCP tools, fix run_stream name by PROFeNoM · Pull Request #18528 · DataDog/dd-trace-py

PROFeNoM · 2026-06-09T07:47:06Z

Description

Extends the Pydantic AI LLM Observability integration so the agent manifest reflects the tools and MCP servers an agent actually uses:

External toolset tools (ExternalToolset.tool_defs) now appear in the manifest.
MCP servers are recorded (id, URL or command, tool prefix) without connecting to them. Covers the deprecated MCPServer* classes, the newer MCPToolset, and MCP toolsets behind wrappers (.prefixed(), load_mcp_toolsets()). Secret-looking launch args (e.g. --api-key <token>) and credentials in MCP URLs are scrubbed before reaching the manifest.
MCP tools called during a run are added to the manifest as observed tools.
run_stream agent name inference is fixed (the tracing proxy frame previously broke pydantic-ai's own frame-walk).

Observed tool calls are attributed to their agent by walking the span's parent chain to the nearest agent span, so concurrent runs and nested agent-as-tool delegations don't cross-contaminate manifests. This replaces an unused process-global attribution registry (_running_agents/_latest_agent, populated but never read and never cleaned up) with span-scoped state, fixing latent cross-contamination before any feature relied on it.

What changes in Datadog

Snippets below are from the demo app. In each screenshot, released ddtrace is on the left, this branch on the right.

👉 Link to spans

1. External toolset tools in the manifest

external = ExternalToolset([
    ToolDefinition(
        name="lookup_order",
        description="Look up the status of a customer order by id.",
        parameters_json_schema={
            "type": "object",
            "properties": {"order_id": {"type": "string"}},
            "required": ["order_id"],
        },
    )
])
agent = Agent(model="openai:gpt-4o", name="support_agent", toolsets=[external])
await agent.run("Hello, can you help me?")

Before: manifest tools is empty. After: it lists lookup_order.

2. MCP server metadata + called MCP tool

mcp_server = MCPServerStdio(command=sys.executable, args=["-m", "mcp_server_time"], id="time-mcp")
agent = Agent(model="openai:gpt-4o", name="time_agent", toolsets=[mcp_server])
async with agent:
    await agent.run("What time is it right now in Tokyo?")

Before: no MCP info. After: manifest records the time-mcp server (id + command/args) and the called get_current_time tool. (See the time_agent span, not the separate MCP Client Session span.)

3. `run_stream` agent name inference

# No name= passed: inferred from the bound variable.
streamed_agent = Agent(model="openai:gpt-4o")
async with streamed_agent.run_stream("Name three capitals in Europe.") as result:
    async for chunk in result.stream_text(delta=True):
        ...

Before: span falls back to PydanticAI Agent. After: span name is streamed_agent.

Testing

New tests in tests/contrib/pydantic_ai/test_pydantic_ai_llmobs.py: external toolset capture, end-to-end MCP server (real stdio FastMCP subprocess), MCPToolset capture, wrapper-toolset MCP capture, dynamic MCP toolset capture, credential scrubbing, run_stream name inference, and concurrent/nested agent tool attribution.

Full llmobs::pydantic_ai suite passes locally (pydantic-ai 0.8.1 / 1.0.0 / 1.106.0). MCP-server tests gated >= 1.0.0 (needs MCPServer.id); MCPToolset test gated >= 1.97.0.

Risks

Low. Integration-only. All toolset/MCP attribute reads are duck-typed and guarded, so a missing or renamed attribute never raises in a customer run. No public API change.

Performance

Feature overhead is within run-to-run noise (~0.2%). The per-tool-call cost is a short walk up the span parent chain plus an accumulator on the agent span; MCP scrubbing/formatting happens once per run at manifest assembly, not per tool call.

Benchmarked before vs after in the same venv (pydantic-ai 1.106.0, LLMObs enabled): 3 function tools + a real MCP stdio toolset, TestModel (no network) calling every tool each run.

	median ms/run	per-rep range
Before (no feature)	9.35	9.12 – 10.05
After (feature on)	9.37	9.02 – 9.87

Benchmark script and method

Dropped into tests/contrib/pydantic_ai/, run via the harness:
scripts/run-tests --venv <py3.12 pydantic-ai-slim 1.106.0> -- -s -- -s -k test_bench.
It raises AssertionError at the end so the harness surfaces the numbers (stdout is hidden on pass).
"Before" = the three integration files at merge-base (c641709); "After" = this branch.

import os
import statistics
import sys
import time

from pydantic_ai.mcp import MCPServerStdio
from pydantic_ai.models.test import TestModel


def tool_a() -> str:
    """Tool A"""
    return "a"


def tool_b(x: int) -> int:
    """Tool B"""
    return x


def tool_c(name: str) -> str:
    """Tool C"""
    return name


async def test_bench(pydantic_ai, pydantic_ai_llmobs):
    N = 300
    server_path = os.path.join(os.path.dirname(__file__), "mcp_server.py")
    mcp = MCPServerStdio(command=sys.executable, args=[server_path], id="square-mcp", env=os.environ.copy())
    agent = pydantic_ai.Agent(
        model=TestModel(), name="bench_agent", tools=[tool_a, tool_b, tool_c], toolsets=[mcp]
    )
    async with agent:
        for _ in range(40):  # warmup
            await agent.run("go")
        times = []
        for _ in range(5):
            t0 = time.perf_counter()
            for _ in range(N):
                await agent.run("go")
            times.append(time.perf_counter() - t0)
    med = statistics.median(times)
    raise AssertionError(
        f"BENCHRESULT median={med * 1e3 / N:.4f}ms/run reps={[round(t * 1e3 / N, 4) for t in times]}"
    )

The MCP connection is opened once (async with agent) and reused across all runs, so the measurement
isolates per-run instrumentation cost, not MCP setup. TestModel removes LLM network latency so
overhead is not buried under request time.

Additional Notes

Full resolved tool catalog (via ToolManager.for_run_step) is deferred until pydantic-ai 2.0 stable.

cit-pr-commenter-54b7da · 2026-06-09T07:47:38Z

Codeowners resolved as

ddtrace/llmobs/_integrations/pydantic_ai.py                             @DataDog/ml-observability

datadog-datadog-prod-us1 · 2026-06-09T07:50:57Z

Tests

✨ Fix all issues with BitsAI

⚠️ Warnings

🚦 18 Pipeline jobs failed

System Tests | tracer-release / End-to-end #1 / uds-flask 1

🧪 9 Tests failed

tests.appsec.test_asm_standalone.Test_AppSecStandalone_NotEnabled.test_client_computed_stats_header_is_not_present[uds-flask]

from system_tests_suite

(Fix with Cursor)

assert 2 == 1

self = &lt;tests.appsec.test_asm_standalone.Test_AppSecStandalone_NotEnabled object at 0x7fd54cd0f620&gt;

    def test_client_computed_stats_header_is_not_present(self):
        spans_checked = 0
        for data, trace, _ in interfaces.library.get_spans(request=self.r):
            assert trace.trace_id_equals(1212121212121212122)
            assert &#34;datadog-client-computed-stats&#34; not in [x.lower() for x, y in data[&#34;request&#34;][&#34;headers&#34;]]
            spans_checked &#43;= 1
...

tests.test_data_integrity.Test_TraceUniqueness.test_trace_ids[uds-flask] from system_tests_suite

(Fix with Cursor)

ValueError: Found duplicated trace id 5013447155428939925 in ./logs/interfaces/library/00105__v0.4_traces.json and ./logs/interfaces/library/00102__v0.4_traces.json

self = &lt;tests.test_data_integrity.Test_TraceUniqueness object at 0x7fd54cfa1040&gt;

    def test_trace_ids(self):
&gt;       interfaces.library.assert_trace_id_uniqueness()

tests/test_data_integrity.py:19: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

...

View all 9 test failures

System Tests | tracer-release / End-to-end #5 / flask-poc 5

🧪 1 Test failed

tests.test_data_integrity.Test_TraceUniqueness.test_trace_ids[flask-poc] from system_tests_suite

(Fix with Cursor)

ValueError: Found duplicated trace id 14233711099861840250 in ./logs/interfaces/library/00062__v0.4_traces.json and ./logs/interfaces/library/00061__v0.4_traces.json

self = &lt;tests.test_data_integrity.Test_TraceUniqueness object at 0x7f8fa4581070&gt;

    def test_trace_ids(self):
&gt;       interfaces.library.assert_trace_id_uniqueness()

tests/test_data_integrity.py:19: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

...

System Tests | tracer-release / End-to-end #6 / django-py3.13 6

🧪 1 Test failed

tests.test_data_integrity.Test_TraceUniqueness.test_trace_ids[django-py3.13] from system_tests_suite

(Fix with Cursor)

ValueError: Found duplicated trace id 10406335910600928621 in ./logs/interfaces/library/00129__v0.4_traces.json and ./logs/interfaces/library/00125__v0.4_traces.json

self = &lt;tests.test_data_integrity.Test_TraceUniqueness object at 0x7f73e7f22ae0&gt;

    def test_trace_ids(self):
&gt;       interfaces.library.assert_trace_id_uniqueness()

tests/test_data_integrity.py:19: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

...

View all 18 failed jobs.

ℹ️ Info

No other issues found (see more)

❄️ No new flaky tests detected

🔄 Datadog auto-retried 2 jobs - 2 passed on retry

Useful? React with 👍 / 👎

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: 4b615a3 | Docs | Datadog PR Page | Give us feedback!}

pr-commenter · 2026-06-09T08:37:19Z

Benchmarks

Benchmark execution time: 2026-06-10 13:41:35

Comparing candidate commit 4b615a3 in PR branch alex/pydantic-ai-tool-mcp-capture with baseline commit c641709 in branch main.

Found 0 performance improvements and 4 performance regressions! Performance is the same for 616 metrics, 10 unstable metrics.

scenario:iastaspects-index_aspect

🟥 execution_time [+14.692µs; +18.382µs] or [+11.901%; +14.890%]

scenario:iastaspects-title_aspect

🟥 execution_time [+39.556µs; +51.415µs] or [+12.039%; +15.648%]

scenario:iastaspectsospath-ospathbasename_aspect

🟥 execution_time [+97.861µs; +108.059µs] or [+22.569%; +24.921%]

scenario:span-start

🟥 execution_time [+1.460ms; +1.620ms] or [+9.576%; +10.626%]

PROFeNoM · 2026-06-09T12:02:48Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b07c3c9a5f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

PROFeNoM · 2026-06-09T14:04:39Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2825cd277d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

PROFeNoM · 2026-06-09T14:18:06Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 25a8d78761

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

PROFeNoM · 2026-06-09T15:47:29Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5688d79434

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

PROFeNoM · 2026-06-09T16:09:06Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a526d8f503

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

PROFeNoM · 2026-06-09T16:23:37Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eeb53bb4e3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

PROFeNoM · 2026-06-09T16:34:54Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8b0c3149ba

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

PROFeNoM · 2026-06-09T16:46:44Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2360f6ba7d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

…am agent name Extend the pydantic-ai LLMObs integration to record the full agent manifest: statically declared tools, externally/MCP-provided tools discovered during a run, and MCP server connection details (url/command/args, with credentials scrubbed from URLs and launch args). Observed tools are attributed per agent run via the ddtrace span parent chain: the agent span seeds an observed-tools dict in its ctx item and each tool span walks up to its nearest agent ancestor to record there. This keeps attribution correct under concurrency and nested agent-as-tool delegation without any context-local token state. Also honor pydantic-ai's `infer_name=False` on the run_stream path and re-infer the agent name through our proxy frame when it is left default.

…ming Add coverage for external/MCP tool capture in the agent manifest, credential scrubbing of MCP urls and launch args, per-run and override toolsets, concurrent and nested-delegation tool attribution, agent entry failure, and run_stream name inference (including infer_name=False). Add the mcp test server and the pydantic-ai mcp test venv.

PROFeNoM · 2026-06-10T07:54:19Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 39e939d8e8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

…sets Dynamic, combined, and capability toolsets resolve to their MCP toolset only at run time, so they aren't reachable from the agent's static toolset list and were absent from the manifest's mcp_servers (orphaning the tool's mcp_server_id). Capture the realized MCP toolset from the observed tool call and merge it into mcp_servers. The observed tool path stashes only the toolset object; scrubbing and formatting run once per run at manifest assembly, not per tool call.

PROFeNoM · 2026-06-10T09:26:11Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c7c5fe1599

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

The __aenter__ span-finish guard fixed a pre-existing span leak (and stale _run_stream_active flag) that exists on main, unrelated to tool/MCP capture. Removing it keeps this PR scoped to the feature; the entry-failure leak can be addressed in a separate fix PR.

The 3.13 cap wasn't required by any package metadata (pydantic-ai-slim 1.106.0 and mcp are both >=3.10 with no upper bound). Dropping it; the full pydantic_ai suite passes on 3.14 (63 passed).

Collapse the MCP test cluster from 8 functions to 4 without losing coverage: the 3 end-to-end runs become one parametrized test (static vs dynamic toolset; the redundant MCPToolset live-run is dropped, its only unique path is covered by the wrapper unit test), and the 5 _get_mcp_servers unit tests become 3 cohesive ones (credential scrubbing, source resolution, wrapper unwrapping). MCP/fastmcp imports stay function-local because the py3.9 venv can't install the mcp package and would fail collection otherwise.

Keep comments only where the behavior is non-obvious (toolset unwrapping, MCP detection, server resolution, credential scrubbing); drop the rest.

PROFeNoM force-pushed the alex/pydantic-ai-tool-mcp-capture branch from c77dfcb to b07c3c9 Compare June 9, 2026 07:54

chatgpt-codex-connector Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread ddtrace/llmobs/_integrations/pydantic_ai.py Outdated

Comment thread ddtrace/llmobs/_integrations/pydantic_ai.py Outdated

chatgpt-codex-connector Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread ddtrace/llmobs/_integrations/pydantic_ai.py Outdated

chatgpt-codex-connector Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread ddtrace/llmobs/_integrations/pydantic_ai.py Outdated

chatgpt-codex-connector Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread ddtrace/contrib/internal/pydantic_ai/utils.py Outdated

Comment thread ddtrace/llmobs/_integrations/pydantic_ai.py Outdated