Skip to content

feat(llmobs): add support for OpenAI Agents#7808

Open
crysmags wants to merge 27 commits intomasterfrom
crysmags/openai-agents-test2
Open

feat(llmobs): add support for OpenAI Agents#7808
crysmags wants to merge 27 commits intomasterfrom
crysmags/openai-agents-test2

Conversation

@crysmags
Copy link
Copy Markdown
Collaborator

What does this PR do?

Adds APM tracing and LLMObs instrumentation for the OpenAI Agents SDK (@openai/agents).

Tracing — instruments the following operations via Orchestrion (rewriter-based):

  • run() — agent execution span (openai-agents.run)
  • getResponse() / getStreamedResponse() — model request spans (openai-agents.getResponse, openai-agents.getStreamedResponse)
  • invokeFunctionTool() — tool call spans (openai-agents.invokeFunctionTool)
  • onInvokeHandoff() — agent handoff spans (openai-agents.onInvokeHandoff)
  • runInputGuardrails() / runOutputGuardrails() — guardrail spans

LLMObs — adds an LLMObs plugin (packages/dd-trace/src/llmobs/plugins/openai-agents/) that enriches spans with LLM observability tags (model, provider, token usage, input/output).

Motivation

The OpenAI Agents SDK is a first-party framework from OpenAI for building multi-agent systems in Node.js. It reached a stable API in >=0.7.0. Instrumenting it gives Datadog users distributed tracing and LLM observability for agent workflows without any code changes.

Additional Notes

  • Instrumentation targets @openai/agents-core and @openai/agents-openai (the sub-packages that contain the actual implementation); @openai/agents is the umbrella re-export.
  • Uses the Orchestrion rewriter pattern (same as ai, langchain) rather than traditional addHook — patches are applied at the function level via versionRange: '>=0.7.0'.
  • 14 tracing tests and 6 LLMObs tests added; both suites run with PLUGINS=openai-agents.

@crysmags crysmags changed the title Crysmags/OpenAI agents test2 feat(llmobs): add support for OpenAI Agents Mar 17, 2026
@crysmags crysmags force-pushed the crysmags/openai-agents-test2 branch from ab4ce04 to 7e401ba Compare March 17, 2026 16:31
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 17, 2026

Overall package size

Self size: 5.48 MB
Deduped: 6.33 MB
No deduping: 6.33 MB

Dependency sizes | name | version | self size | total size | |------|---------|-----------|------------| | import-in-the-middle | 3.0.1 | 82.56 kB | 817.39 kB | | dc-polyfill | 0.1.10 | 26.73 kB | 26.73 kB |

🤖 This report was automatically generated by heaviest-objects-in-the-universe

Comment on lines +93 to +103
const usage = result.usage
if (usage) {
if (usage.inputTokens !== undefined) {
span.setTag('openai.response.usage.prompt_tokens', usage.inputTokens)
}
if (usage.outputTokens !== undefined) {
span.setTag('openai.response.usage.completion_tokens', usage.outputTokens)
}
if (usage.totalTokens !== undefined) {
span.setTag('openai.response.usage.total_tokens', usage.totalTokens)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK we only tag token usage on the actual llm event spans, and not the apm tracing spans

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, we don't need to add really any tags on the APM spans here

Comment on lines +32 to +33
service: ANY_STRING,
resource: ANY_STRING,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should assert the actual values here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as well as the rest of the file for values that are constant

"default": null
}
],
"DD_TRACE_OPENAI_AGENTS_ENABLED": ["A"],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like our code generator is generating the wrong format for this, should match the below structure.

Comment on lines +81 to +100
/**
* For streaming, the span finishes before stream iteration begins.
* Output data is not available, so we only tag inputs and metadata.
*
* @param {{ currentStore?: { span: object }, arguments?: Array<*> }} ctx
*/
setLLMObsTags (ctx) {
const span = ctx.currentStore?.span
if (!span) return

const request = ctx.arguments?.[0]
const inputMessages = extractInputMessages(request)

// Streaming spans finish before iteration; output is not available
this._tagger.tagLLMIO(span, inputMessages, [{ content: '', role: '' }])

const metadata = extractMetadata(request)
metadata.stream = true
this._tagger.tagMetadata(span, metadata)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems to have skipped tagging streaming output, which we should collect by wrapping the returned async iterator.

Comment on lines +110 to +112
if (baseURL.includes('azure')) return 'azure_openai'
if (baseURL.includes('deepseek')) return 'deepseek'
return 'openai'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we parse the url instead of hardcoding?

Comment on lines +132 to +158
for (const item of input) {
if (item.type === 'message') {
const role = item.role
if (!role) continue

let content = ''
if (Array.isArray(item.content)) {
const textParts = item.content
.filter(c => c.type === 'input_text' || c.type === 'text')
.map(c => c.text)
content = textParts.join('')
} else if (typeof item.content === 'string') {
content = item.content
}

if (content) {
messages.push({ role, content })
}
} else if (item.type === 'function_call') {
let args = item.arguments
if (typeof args === 'string') {
try {
args = JSON.parse(args)
} catch {
args = {}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we break this function into some helpers? Would help to improve readability.

@PROFeNoM
Copy link
Copy Markdown
Contributor

PROFeNoM commented Mar 20, 2026

At first glance, looking at packages/dd-trace/test/llmobs/plugins/openai-agents/index.spec.js, I can see that only llm LLMObs span kinds are being generated. Looking at tests/contrib/openai_agents/test_openai_agents_llmobs.py in dd-trace-py, the Python integration generates workflow, agent, llm, tool, and task span kinds. Are we missing observability for some operations?

More importantly, The Python integration achieves this by hooking into the SDK's TracingProcessor interface via add_trace_processor(), which gives it full semantic span tree. The JS SDK has the same infrastructure: addTraceProcessor() and a TracingProcessor interface with onTraceStart/onTraceEnd/onSpanStart/onSpanEnd.
⚠️ I strongly believe this is the approach to use. ⚠️

@wconti27
Copy link
Copy Markdown
Contributor

At first glance, looking at packages/dd-trace/test/llmobs/plugins/openai-agents/index.spec.js, I can see that only llm LLMObs span kinds are being generated. Looking at tests/contrib/openai_agents/test_openai_agents_llmobs.py in dd-trace-py, the Python integration generates workflow, agent, llm, tool, and task span kinds. Are we missing observability for some operations?

More importantly, The Python integration achieves this by hooking into the SDK's TracingProcessor interface via add_trace_processor(), which gives it full semantic span tree. The JS SDK has the same infrastructure: addTraceProcessor() and a TracingProcessor interface with onTraceStart/onTraceEnd/onSpanStart/onSpanEnd. ⚠️ I strongly believe this is the approach to use. ⚠️

We actually had a few meetings of how to handle these cases, one with @sabrenner , another with out larger IDM team, and another with the node core engineers. Basically we decided to utilize these types of trace processors when the processor is OTel compatible. Given that the tracing provided by the package is not OTel, and is some form of internal tracing, which can change on a whim leaving our instrumentation broken, we decided to not go that route for these types of cases.

@PROFeNoM
Copy link
Copy Markdown
Contributor

which can change on a whim leaving our instrumentation broken

Hummm... I'm not totally convinced by that tbh. By that I mean that tracing specific, internal, methods, in inherently more brittle than rely on an interface that should follow semver. It's not guaranteed, I agree, but has at least better odds.

Regardless of the approach chosen, it still seems to be that there quite some difference between the current Python integration and the proposed NodeJS one (only llm LLMObs span kinds are being generated). Is there a plan to align both integrations?

@datadog-datadog-prod-us1-2
Copy link
Copy Markdown

datadog-datadog-prod-us1-2 bot commented Mar 25, 2026

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 27.36%
Overall Coverage: 68.15% (-0.37%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 34073c2 | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 25, 2026

Codecov Report

❌ Patch coverage is 26.21723% with 197 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.90%. Comparing base (e68f386) to head (7fa17a3).
⚠️ Report is 80 commits behind head on master.

Files with missing lines Patch % Lines
...dd-trace/src/llmobs/plugins/openai-agents/utils.js 1.85% 106 Missing ⚠️
...dd-trace/src/llmobs/plugins/openai-agents/index.js 20.17% 91 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7808      +/-   ##
==========================================
- Coverage   80.45%   73.90%   -6.56%     
==========================================
  Files         748      776      +28     
  Lines       32411    36335    +3924     
==========================================
+ Hits        26076    26853     +777     
- Misses       6335     9482    +3147     
Flag Coverage Δ
aiguard-macos 39.44% <33.33%> (+0.24%) ⬆️
aiguard-ubuntu 39.56% <33.33%> (+0.24%) ⬆️
aiguard-windows 39.23% <33.33%> (+0.17%) ⬆️
apm-capabilities-tracing-macos 49.26% <20.07%> (+0.35%) ⬆️
apm-capabilities-tracing-ubuntu 49.18% <20.07%> (+0.23%) ⬆️
apm-capabilities-tracing-windows 49.03% <20.07%> (+0.35%) ⬆️
apm-integrations-child-process 38.74% <33.33%> (+0.23%) ⬆️
apm-integrations-couchbase-18 37.52% <33.33%> (+0.09%) ⬆️
apm-integrations-couchbase-eol 38.04% <33.33%> (+0.15%) ⬆️
apm-integrations-oracledb 37.87% <33.33%> (+0.14%) ⬆️
appsec-express 55.38% <33.33%> (+0.12%) ⬆️
appsec-fastify 51.71% <33.33%> (+0.10%) ⬆️
appsec-graphql 51.86% <33.33%> (+0.06%) ⬆️
appsec-kafka 44.48% <33.33%> (+0.11%) ⬆️
appsec-ldapjs 44.11% <33.33%> (+0.10%) ⬆️
appsec-lodash 43.70% <33.33%> (+0.07%) ⬆️
appsec-macos 58.08% <33.33%> (-0.11%) ⬇️
appsec-mongodb-core 48.89% <33.33%> (+0.10%) ⬆️
appsec-mongoose 49.55% <33.33%> (+0.10%) ⬆️
appsec-mysql 51.07% <33.33%> (+0.21%) ⬆️
appsec-node-serialize 43.29% <33.33%> (+0.10%) ⬆️
appsec-passport 47.75% <33.33%> (+0.11%) ⬆️
appsec-postgres 50.70% <33.33%> (+0.11%) ⬆️
appsec-sourcing 42.54% <33.33%> (-0.07%) ⬇️
appsec-stripe 44.73% <33.33%> (?)
appsec-template 43.45% <33.33%> (+0.10%) ⬆️
appsec-ubuntu 58.17% <33.33%> (-0.09%) ⬇️
appsec-windows 57.91% <33.33%> (-0.14%) ⬇️
instrumentations-instrumentation-bluebird 32.33% <33.33%> (-0.02%) ⬇️
instrumentations-instrumentation-body-parser 40.63% <33.33%> (+0.11%) ⬆️
instrumentations-instrumentation-child_process 38.07% <33.33%> (+0.24%) ⬆️
instrumentations-instrumentation-cookie-parser 34.36% <33.33%> (+0.04%) ⬆️
instrumentations-instrumentation-express 34.67% <33.33%> (+0.04%) ⬆️
instrumentations-instrumentation-express-mongo-sanitize 34.49% <33.33%> (+0.04%) ⬆️
instrumentations-instrumentation-express-session 40.27% <33.33%> (+0.11%) ⬆️
instrumentations-instrumentation-fs 32.01% <33.33%> (+0.05%) ⬆️
instrumentations-instrumentation-generic-pool 29.41% <50.00%> (-0.11%) ⬇️
instrumentations-instrumentation-http 39.99% <33.33%> (+0.19%) ⬆️
instrumentations-instrumentation-knex 32.39% <33.33%> (+0.05%) ⬆️
instrumentations-instrumentation-mongoose 33.51% <33.33%> (+0.04%) ⬆️
instrumentations-instrumentation-multer 40.38% <33.33%> (+0.11%) ⬆️
instrumentations-instrumentation-mysql2 38.40% <33.33%> (+0.12%) ⬆️
instrumentations-instrumentation-passport 44.16% <33.33%> (+0.11%) ⬆️
instrumentations-instrumentation-passport-http 43.84% <33.33%> (+0.11%) ⬆️
instrumentations-instrumentation-passport-local 44.37% <33.33%> (+0.11%) ⬆️
instrumentations-instrumentation-pg 37.84% <33.33%> (+0.12%) ⬆️
instrumentations-instrumentation-promise 32.26% <33.33%> (-0.02%) ⬇️
instrumentations-instrumentation-promise-js 32.26% <33.33%> (-0.02%) ⬇️
instrumentations-instrumentation-q 32.31% <33.33%> (-0.02%) ⬇️
instrumentations-instrumentation-url 32.23% <33.33%> (-0.02%) ⬇️
instrumentations-instrumentation-when 32.28% <33.33%> (-0.02%) ⬇️
llmobs-ai 41.37% <33.33%> (-0.89%) ⬇️
llmobs-anthropic 40.84% <33.33%> (+0.55%) ⬆️
llmobs-bedrock 39.32% <33.33%> (+0.08%) ⬆️
llmobs-google-genai 39.87% <33.33%> (-0.04%) ⬇️
llmobs-langchain 39.45% <33.33%> (-0.58%) ⬇️
llmobs-openai 44.12% <33.33%> (+0.15%) ⬆️
llmobs-vertex-ai 40.13% <33.33%> (+0.09%) ⬆️
platform-core 31.47% <ø> (ø)
platform-esbuild 34.42% <ø> (ø)
platform-instrumentations-misc 34.19% <100.00%> (-14.22%) ⬇️
platform-shimmer 37.56% <ø> (ø)
platform-unit-guardrails 32.89% <ø> (ø)
platform-webpack 19.88% <83.33%> (?)
plugins-azure-durable-functions 25.86% <100.00%> (+0.11%) ⬆️
plugins-azure-event-hubs 26.02% <100.00%> (+0.11%) ⬆️
plugins-azure-service-bus 25.38% <100.00%> (+0.11%) ⬆️
plugins-bullmq 43.60% <33.33%> (-0.60%) ⬇️
plugins-cassandra 38.02% <33.33%> (+0.25%) ⬆️
plugins-cookie 27.08% <100.00%> (+0.11%) ⬆️
plugins-cookie-parser 26.86% <100.00%> (+0.11%) ⬆️
plugins-crypto 26.73% <ø> (ø)
plugins-dd-trace-api 38.43% <33.33%> (+0.11%) ⬆️
plugins-express-mongo-sanitize 27.01% <100.00%> (+0.11%) ⬆️
plugins-express-session 26.82% <100.00%> (+0.11%) ⬆️
plugins-fastify 42.36% <33.33%> (+0.12%) ⬆️
plugins-fetch 38.51% <33.33%> (+0.18%) ⬆️
plugins-fs 38.75% <33.33%> (+0.14%) ⬆️
plugins-generic-pool 26.06% <100.00%> (+0.11%) ⬆️
plugins-google-cloud-pubsub 45.68% <33.33%> (+0.25%) ⬆️
plugins-grpc 41.01% <33.33%> (+0.10%) ⬆️
plugins-handlebars 27.05% <100.00%> (+0.11%) ⬆️
plugins-hapi 40.27% <33.33%> (+0.12%) ⬆️
plugins-hono 40.60% <33.33%> (+0.19%) ⬆️
plugins-ioredis 38.60% <33.33%> (+0.18%) ⬆️
plugins-knex 26.68% <100.00%> (+0.11%) ⬆️
plugins-langgraph 37.99% <33.33%> (-0.47%) ⬇️
plugins-ldapjs 24.55% <100.00%> (+0.11%) ⬆️
plugins-light-my-request 26.42% <100.00%> (+0.11%) ⬆️
plugins-limitd-client 32.61% <33.33%> (-0.01%) ⬇️
plugins-lodash 26.15% <100.00%> (+0.11%) ⬆️
plugins-mariadb 39.61% <33.33%> (+0.15%) ⬆️
plugins-memcached 38.34% <33.33%> (+0.20%) ⬆️
plugins-microgateway-core 39.41% <33.33%> (+0.19%) ⬆️
plugins-moleculer 40.63% <33.33%> (+0.12%) ⬆️
plugins-mongodb 39.27% <33.33%> (+0.11%) ⬆️
plugins-mongodb-core 39.11% <33.33%> (+0.12%) ⬆️
plugins-mongoose 38.92% <33.33%> (+0.08%) ⬆️
plugins-multer 26.82% <100.00%> (+0.11%) ⬆️
plugins-mysql 39.45% <33.33%> (+0.29%) ⬆️
plugins-mysql2 39.40% <33.33%> (+0.15%) ⬆️
plugins-node-serialize 27.12% <100.00%> (+0.11%) ⬆️
plugins-openai-agents 34.96% <26.21%> (?)
plugins-opensearch 37.74% <33.33%> (+0.15%) ⬆️
plugins-passport-http 26.87% <100.00%> (+0.11%) ⬆️
plugins-postgres 35.54% <33.33%> (-0.03%) ⬇️
plugins-process 26.73% <ø> (ø)
plugins-pug 27.08% <100.00%> (+0.11%) ⬆️
plugins-redis 39.04% <33.33%> (+0.16%) ⬆️
plugins-router 43.22% <33.33%> (+0.12%) ⬆️
plugins-sequelize 25.66% <100.00%> (+0.11%) ⬆️
plugins-test-and-upstream-amqp10 38.61% <33.33%> (+0.12%) ⬆️
plugins-test-and-upstream-amqplib 44.36% <33.33%> (+0.50%) ⬆️
plugins-test-and-upstream-apollo 39.23% <33.33%> (+0.13%) ⬆️
plugins-test-and-upstream-avsc 38.69% <33.33%> (+0.07%) ⬆️
plugins-test-and-upstream-bunyan 33.94% <33.33%> (+0.06%) ⬆️
plugins-test-and-upstream-connect 40.93% <33.33%> (+0.12%) ⬆️
plugins-test-and-upstream-graphql 40.27% <33.33%> (+0.17%) ⬆️
plugins-test-and-upstream-koa 40.52% <33.33%> (+0.12%) ⬆️
plugins-test-and-upstream-protobufjs 38.92% <33.33%> (+0.07%) ⬆️
plugins-test-and-upstream-rhea 44.39% <33.33%> (+0.35%) ⬆️
plugins-undici 39.36% <33.33%> (+0.27%) ⬆️
plugins-url 26.73% <ø> (ø)
plugins-valkey 38.31% <33.33%> (+0.22%) ⬆️
plugins-vm 26.73% <ø> (ø)
plugins-winston 34.26% <33.33%> (+0.19%) ⬆️
plugins-ws 42.12% <33.33%> (+0.26%) ⬆️
profiling-macos 40.65% <33.33%> (+0.09%) ⬆️
profiling-ubuntu 40.77% <33.33%> (-0.32%) ⬇️
profiling-windows 42.29% <33.33%> (+0.45%) ⬆️
serverless-azure-functions-client 25.74% <100.00%> (+0.11%) ⬆️
serverless-azure-functions-eventhubs 25.74% <100.00%> (+0.11%) ⬆️
serverless-azure-functions-servicebus 25.74% <100.00%> (+0.11%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@pr-commenter
Copy link
Copy Markdown

pr-commenter bot commented Mar 25, 2026

Benchmarks

Benchmark execution time: 2026-04-08 18:09:48

Comparing candidate commit 34073c2 in PR branch crysmags/openai-agents-test2 with baseline commit 2bac203 in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 227 metrics, 33 unstable metrics.

@crysmags crysmags marked this pull request as ready for review March 25, 2026 23:05
@crysmags crysmags requested review from a team as code owners March 25, 2026 23:05
@crysmags crysmags requested review from ida613 and removed request for a team March 25, 2026 23:05
@crysmags
Copy link
Copy Markdown
Collaborator Author

which can change on a whim leaving our instrumentation broken

Hummm... I'm not totally convinced by that tbh. By that I mean that tracing specific, internal, methods, in inherently more brittle than rely on an interface that should follow semver. It's not guaranteed, I agree, but has at least better odds.

Regardless of the approach chosen, it still seems to be that there quite some difference between the current Python integration and the proposed NodeJS one (only llm LLMObs span kinds are being generated). Is there a plan to align both integrations?

I took a look at what the internal traces from OpenAI Agents were covering to see what the span kinds actually represented. Then I compared them to what we are covering, and for tracing we had the same instrumentation points, however, for LLM Observability we were only creating plugins which covered span_kind:llm so I added the additional plugins to match tracing and this will give us the same level of coverage that you see in Python.

@PROFeNoM
Copy link
Copy Markdown
Contributor

PROFeNoM commented Apr 1, 2026

Hey 👋 I was testing the integration with a minimal demo app and ran into two issues. I spent quite some time debugging so I wanted to share what I found.

Setup

Simple app with a single agent, run against @openai/agents@^0.7.0, with LLMObs enabled and dd-trace loaded via --import dd-trace/initialize.mjs.

I tested three configurations:

Test App file package.json Module system dd-trace init
1 app.mjs "type": "module" ESM --import dd-trace/initialize.mjs
2 app.js (no type field) CJS --require dd-trace/init
3 app.js "type": "module" ESM --import dd-trace/initialize.mjs

Tests 1 and 3 are equivalent (both ESM). Test 3 is the modern recommended approach (.js + "type": "module").

The app itself is minimal:

import { Agent, run } from '@openai/agents'

const agent = new Agent({
  name: 'Simple Agent',
  instructions: 'You are a helpful assistant.',
  model: 'gpt-4o'
})

const result = await run(agent, 'What is the capital of France?')

Traces from the test runs:

image

Issue 1: ESM apps don't generate any openai-agents spans

In tests 1 and 3 (ESM), I only see the vanilla openai LLM span OpenAI.createResponse. No workflow, no agent, no tool spans. The openai-agents integration doesn't fire at all.

I added some logging to the rewriter to see what files it was processing:

[RW] checking: @openai/agents-core dist/run.mjs        -> no transformer
[RW] checking: @openai/agents-openai dist/openaiResponsesModel.mjs  -> no transformer
[RW] checking: @openai/agents-core dist/tool.mjs        -> no transformer
[RW] checking: @openai/agents-core dist/handoff.mjs     -> no transformer

The rewriter sees .mjs files, but the instrumentation config targets .js:

{
  module: {
    name: '@openai/agents-core',
    filePath: 'dist/run.js',       // <-- .js
  },
}

The matcher does strict equality:
filePath === file_path, so dist/run.mjs !== dist/run.js - no match, no transformation.

It appears the @openai/agents packages ship dual format. The exports field in @openai/agents-core maps "require" to .js and "import" to .mjs. Any ESM app using import triggers the .mjs path. Only CJS apps using require() would get .js.

Issue 2: CJS works, but missing agent span compared to Python

In test 2 (CJS), the rewriter matches .js files and the integration fires. But the trace structure differs from what Python produces for the same scenario (simple agent, "What is the capital of France?"):

Python (4 LLMObs spans):

Workflow "Agent workflow"
  -> Agent "Simple Agent"
       -> LLM "Simple Agent (LLM)"
            -> LLM "OpenAI.createResponse"

NodeJS CJS (3 LLMObs spans):

Workflow "openai-agents.run"
  -> LLM "openai-agents.getResponse"
       -> LLM "OpenAI.createResponse"

The JS integration is missing the agent span. In Python, the agent span sits between the workflow and the LLM call and carries the agent name ("Simple Agent"). In the JS integration, the workflow span directly parents the LLM span with no agent in between.

This also means in multi-agent handoff scenarios, there's no clear boundary between which agent is running; the handoff tool span fires but there's no parent agent span to anchor it to.

A few other differences I noticed on the CJS spans:

  • Span names use internal function names (openai-agents.run, openai-agents.getResponse) rather than user-facing names (Agent workflow, Simple Agent)
  • The openai-agents.getResponse LLM span is missing metadata that Python's equivalent "Simple Agent (LLM)" span includes (text, tool_choice, truncation)

Questions

  1. Was the integration tested against ESM apps using the published npm package?
    Wondering if maybe I'm doing something wrong (but then, any customer could do somethign wrong, so it doesn't really matter)

  2. Is the missing agent span intentional? I saw that the rewriter targets run, getResponse, invokeFunctionTool, onInvokeHandoff, and guardrails...but there's no hook for agent invocation itself (Python wraps _run_single_turn for this). Curious if this is planned or if I'm missing something.

For issue 1, I see we're already handling this in the anthropic integration, which loops over both extensions:

const extensions = ['js', 'mjs']
for (const extension of extensions) {
  addHook({
    name: '@anthropic-ai/sdk',
    file: `resources/messages.${extension}`,
    versions: ['>=0.14.0 <0.33.0'],
  }, exports => { ... })
}

The rewriter instrumentation config would need something similar - registering both .js and .mjs variants for each target file. Not sure if the rewriter supports that pattern directly or if the matcher would need to be adjusted.

Copy link
Copy Markdown
Collaborator

@sabrenner sabrenner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did a first pass - as @PROFeNoM referenced in his comment, we can try and guide the toolkit to follow the existing tracing results (producing the same spans) that the Python integration has.

Comment on lines +3 to +7
// TODO: Add agent-level LLMObs span (kind: 'agent') wrapping per-agent async execution.
// Python achieves this via add_trace_processor(LLMObsTraceProcessor) which hooks
// Span.start() / Span.end() on the SDK's internal Span class (dist/tracing/spans.js).
// The equivalent here would be hooking Span.prototype.start / Span.prototype.end via
// orchestrion. Requires team sign-off before implementation.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this comment still apply? looks like we patch the run method below, this should allow us to capture the agent-level spans i think (but, correct me if i'm wrong, and i'll also run through this locally after giving a first review)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the simple single-agent case, run() already has everything we need to emit an agent span — we know the agent name, input, and output, and the hook wraps the full execution. The gap is multi-agent handoff scenarios: run() only gives us the starting agent, so we can't derive per-agent execution boundaries mid-run without hooking something lower-level like prepareAgentArtifacts.

That said, onInvokeHandoff is already instrumented separately, so the combination of run() + onInvokeHandoff covers most handoff observability. The missing piece is strictly the parent relationship — having the agent span wrap its own LLM calls in a handoff chain.

For the simple case, would it make sense to emit an agent span from the existing run() hook and update the TODO to note the handoff limitation? Or are you looking for full Python parity on the parent hierarchy, which would require a new hook point?

if (baseURL) {
const host = this.getHostFromBaseURL(baseURL)
if (host) {
tags['out.host'] = host
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we typically haven't done stuff like this for the APM side of llm-type or agentic integrations, any reason we're including it here? maybe we're good to just tag model name and provider

const tags = {
component: 'openai-agents',
'span.kind': 'client',
'ai.request.model_provider': 'openai',
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i actually don't think we set any APM tags for the openai-agents package in the Python integration, we're probably good to just not set any tags

Comment on lines +93 to +103
const usage = result.usage
if (usage) {
if (usage.inputTokens !== undefined) {
span.setTag('openai.response.usage.prompt_tokens', usage.inputTokens)
}
if (usage.outputTokens !== undefined) {
span.setTag('openai.response.usage.completion_tokens', usage.outputTokens)
}
if (usage.totalTokens !== undefined) {
span.setTag('openai.response.usage.total_tokens', usage.totalTokens)
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, we don't need to add really any tags on the APM spans here


getTags (ctx) {
const tags = super.getTags(ctx)
tags['openai.request.stream'] = 'true'
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is maybe the one tag we might wanna keep, but we could also just get rid of it too. all tagging/metadata can just be done on the LLMObs spans


const TracingPlugin = require('../../dd-trace/src/plugins/tracing')

class BaseOpenaiAgentsInternalPlugin extends TracingPlugin {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should not give the static prefixes below in this base class, and instead let all implementers define them (for example, the RunPlugin below would define these fields, as the other implementing Plugins here do

* @param {string} baseURL - The base URL of the OpenAI client
* @returns {string} The model provider name
*/
function getModelProvider (baseURL) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we just landed a change which updates this logic elsewhere: a7de9c0

i wonder if we can refactor both here and that instance into a shared getModelProviderFromOpenAIBaseUrl function, or something like that, so that any logic updates are shared,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think for all of the inlined-functions here, we can move them to a util.js file in this folder

if (savedAgentUrl !== undefined) process.env.DD_TRACE_AGENT_URL = savedAgentUrl
if (savedAgentPort !== undefined) process.env.DD_TRACE_AGENT_PORT = savedAgentPort
})

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe this setup should not be needed, it works fine with all other llmobs tests without this change. are we able to remove these blocks?

Comment on lines +42 to +46
for (const key of Object.keys(require.cache)) {
if (key.includes('@openai/agents')) {
delete require.cache[key]
}
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this approach also isn't needed for the langchain or langgraph suites which also use orchestrion, can we try getting rid of this and follow the same patterns we use in those test suites?

crysmags and others added 23 commits April 8, 2026 13:29
Instruments the OpenAI Agents SDK with Datadog APM tracing. Adds span
coverage for agent runs, model calls (getResponse, getStreamedResponse),
tool invocations, and handoffs with full semantic tag support.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Align LLMObs plugin and test directory names with the integration/plugin
ID (openai-agents) rather than the npm sub-package name (openai-agents-core).

Both test suites now run with the same PLUGINS=openai-agents value:
  tracing: PLUGINS=openai-agents yarn test:plugins
  llmobs:  PLUGINS=openai-agents yarn test:llmobs:plugins

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Accidentally committed during workflow test run; not a source file.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Registers @openai/agents-core and @openai/agents-openai with their
version ranges so yarn services correctly handles them and withVersions
picks them up for the test matrix.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The versions/package.json latests file is read-only by install_plugin_modules.js
and does not need its deps resolved in the root yarn.lock.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ming

- Rewriter: generate .mjs twins via flatMap so ESM apps get instrumented
- LLM spans: name as `{modelName} (LLM)` instead of internal method name
- Workflow spans: use run() options.workflowName, default 'Agent workflow'
- Handoff spans: name as `transfer_to_{agentName}` (Python parity)
- Metadata: map camelCase modelSettings to snake_case keys (top_p, max_tokens, etc.)
- Metadata: include request.tools list
- Metrics: capture reasoning_tokens from outputTokensDetails
- Workflow: extract agent manifest into metadata._dd.agent_manifest
- TODO: agent-level span (requires Span.start/end hook, needs team approval)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… them

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…plugins/index.js

The facade package has no hooks or rewriter entries — only @openai/agents-core
and @openai/agents-openai are actually instrumented.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ns/index.js

@openai/agents-openai depends on @openai/agents-core, so the plugin is always
registered when @openai/agents-core loads first. The second entry is a no-op.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…om hooks.js

@openai/agents-openai depends on @openai/agents-core, so the instrumentation
file is already loaded (and shimmers for both packages registered) before
@openai/agents-openai ever loads.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Refactor APM plugins: switch from ClientPlugin/CompositePlugin object
  spread to TracingPlugin with individual static prefix/spanName per class;
  export as arrays
- Remove all model/provider/host/usage tags from APM spans (LLMObs-only)
- Extract LLMObs helpers into utils.js (getModelProvider, extractAgentManifest,
  extractInputMessages, extractOutputMessages, etc.) for testability
- Fix getModelProvider to fall back to 'unknown' instead of empty string
- Fix TypeScript definition comment for openai-agents integration
- Restore accidentally-dropped supported-configurations.json entries
- Add DD_TRACE_OPENAI_AGENTS_ENABLED to supported-configurations.json
- Fix test-setup.js: use versioned absolute paths for @openai/agents-openai
  and openai resolution; fix module.Agent → clientModule.Agent references
- Fix LLMObs spec: use withVersions() wrapper, fix openai require path,
  add metadata: MOCK_NOT_NULLISH assertions for run() workflow spans

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rovider utility

- Extract getOpenAIModelProvider() into a shared plugins/utils.js, used by
  both the openai and openai-agents LLMObs plugins (eliminates duplicate logic
  and incorporates the 'unknown' fallback for unrecognised base URLs)
- Convert index.js plugin registration to object-keyed accumulation pattern,
  consistent with the langgraph plugin
- Add unique static id to each tracing plugin subclass (required for the
  object-keyed pattern)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…onse

Implement full streaming support using AsyncIterator orchestrion pattern:
- Switch getStreamedResponse instrumentation to kind: 'AsyncIterator'
- Add GetStreamedResponseNextPlugin (APM) to keep span open until iterator
  exhaustion, fixing premature span close via traceSync end() side-effect
- Add GetStreamedResponseNextLLMObsPlugin (LLMObs) to accumulate
  response_done event and tag span with full I/O, metrics, and metadata
  once the stream completes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@crysmags crysmags force-pushed the crysmags/openai-agents-test2 branch from 934e578 to 45b42c3 Compare April 8, 2026 17:32
crysmags and others added 4 commits April 8, 2026 13:34
…n in llmobs workflow

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…test action

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…date

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@wconti27 wconti27 added the apm-integration-toolkit PR Generated by APM AI Integration Toolkit label Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

apm-integration-toolkit PR Generated by APM AI Integration Toolkit semver-minor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants