fix(eval): Add agent tool trajectory eval #854

nina-kollman · 2026-01-08T15:51:51Z

Fixes: TLP-1615

Important

Add agent tool trajectory evaluation feature using Traceloop SDK with new script and SDK updates.

New Features:
- Add run:agent_tool_trajectory script in package.json for running agent tool trajectory evaluation.
- New file sample_agent_tool_trajectory.ts demonstrates agent tool trajectory evaluation using Traceloop SDK.
SDK Updates:
- Add AgentToolTrajectoryConfig type in mbt-evaluators.ts.
- Add agentToolTrajectory() method in EvaluatorMadeByTraceloop class in mbt-evaluators.ts.
- Update EVALUATOR_SLUGS and EVALUATOR_SCHEMAS in registry.ts to include agent-tool-trajectory.
- Add AgentToolTrajectoryRequest schema in types.ts.

^{This description was created by}^{for 66360df. You can customize this summary. It will automatically update as commits are pushed.}

Summary by CodeRabbit

New Features
- Added an example demonstrating agent tool trajectory evaluation with the Traceloop SDK, including initialization, evaluator configuration, experiment execution, tool-call tracking, result summaries, and error handling.
- Added an npm script to run the example conveniently from the project.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-08T15:52:02Z

📝 Walkthrough

Walkthrough

Added a new npm script and a new TypeScript sample demonstrating Traceloop SDK initialization, agent-tool-trajectory evaluator configuration, running an experiment with a dataset, and printing a results summary with error handling.

Changes

Cohort / File(s)	Summary
Agent Tool Trajectory Example `packages/sample-app/package.json`, `packages/sample-app/src/sample_agent_tool_trajectory.ts`	Added npm script `run:agent_tool_trajectory`; added new example `sample_agent_tool_trajectory.ts` that initializes the SDK, obtains a client, defines agent evaluator task (executed & expected tool calls), configures evaluators, runs an experiment with dataset and evaluator metadata, and logs results and errors.

Sequence Diagram

sequenceDiagram
    participant SampleApp as Sample App
    participant Traceloop as Traceloop SDK
    participant Client as Traceloop Client
    participant Experiment as Experiment Runner

    SampleApp->>Traceloop: initialize()
    Traceloop-->>SampleApp: init complete
    SampleApp->>Traceloop: getClient()
    Traceloop-->>SampleApp: Client
    SampleApp->>SampleApp: define agentEvaluatorsTask (tool calls + expected)
    SampleApp->>SampleApp: configure evaluators (agent-tool-trajectory)
    SampleApp->>Client: runExperiment(dataset, evaluators)
    Client->>Experiment: execute evaluations
    Experiment-->>Client: results
    Client-->>SampleApp: results summary (rows, errors, id)
    SampleApp->>SampleApp: log output / handle errors

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰
I hopped in code to trace the trail,
Tools and agents leave a tale,
I stitched the flow, I ran the test,
Results arrive — a tidy nest,
Cheers for one more sample on the trail!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Linked Issues check	❓ Inconclusive	The PR objectives align with adding agent tool trajectory evaluation (TLP-1615), but the linked issue lacks sufficient detail to fully validate all coding requirements.	Review the detailed requirements in TLP-1615 to ensure all implementation details (AgentToolTrajectoryConfig, registry updates, type definitions) fully satisfy the issue requirements.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(eval): Add agent tool trajectory eval' directly matches the main objective: adding agent tool trajectory evaluation support, though 'eval' could be more explicit.
Out of Scope Changes check	✅ Passed	The raw summary shows only changes to package.json and a new sample_agent_tool_trajectory.ts file, both directly related to adding agent tool trajectory evaluation support.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

ellipsis-dev

Important

Looks good to me! 👍

Reviewed everything up to 014cebb in 2 minutes and 53 seconds. Click for details.

Reviewed 402 lines of code in 5 files
Skipped 0 files when reviewing.
Skipped posting 12 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/sample-app/package.json:44

Draft comment:
New 'run:agent_tool_trajectory' script added. Verify that the build output and file path (dist/src/sample_agent_tool_trajectory.js) are correct.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% The comment is asking the PR author to verify the build output and file path, which falls under the category of asking the author to double-check things. This violates the rule against asking the author to confirm or ensure things. Therefore, this comment should not be approved.

2. packages/sample-app/src/sample_agent_tool_trajectory.ts:52

Draft comment:
The defaultExpected string uses single quotes inside the JSON-like structure. Consider using valid JSON formatting (double quotes) to ensure proper parsing.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 15% vs. threshold = 50% The comment suggests using double quotes for "proper parsing", but this is speculative. The code shows a string literal being created - it's not being parsed as JSON in the visible code. The string is just assigned and returned. Whether this causes an issue depends entirely on how the Traceloop SDK processes the expected_tool_calls field, which I cannot see. The comment says "consider" which makes it sound like a suggestion rather than a definite bug. The structure looks like Python dictionary syntax (single quotes), which might actually be intentional if the SDK expects that format. Without evidence that this will actually cause a parsing error, this is speculative. The string might be intentionally formatted this way if the Traceloop SDK expects Python-like syntax rather than JSON. The comment is speculative ("consider") and doesn't show evidence of an actual bug - it's just a suggestion based on an assumption about what format is needed. The comment lacks strong evidence that this is actually a problem. It's a speculative suggestion about formatting without demonstrating that the current code will fail. The rules state I should only keep comments with STRONG EVIDENCE of correctness, and speculative comments should be removed. This comment should be deleted. It's speculative and suggests a change without strong evidence that the current code is incorrect. The format might be intentional based on what the SDK expects, and there's no visible parsing happening in this code that would fail.

3. packages/sample-app/src/sample_agent_tool_trajectory.ts:97

Draft comment:
Ensure the datasetSlug 'agent-tool-trajectory' exists and is properly configured on the Traceloop platform.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is asking the PR author to ensure that a datasetSlug exists and is properly configured, which is similar to asking them to double-check or verify something. This violates the rule against asking the author to ensure something is set up correctly.

4. packages/traceloop-sdk/src/lib/generated/evaluators/mbt-evaluators.ts:181

Draft comment:
New agentToolTrajectory evaluator method added. It correctly casts config using AgentToolTrajectoryConfig.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative, as it only states that a new method was added and that it correctly casts a config. It doesn't provide any actionable feedback or suggestions for improvement.

5. packages/traceloop-sdk/src/lib/generated/evaluators/mbt-evaluators.ts:267

Draft comment:
The conversationQuality evaluator no longer accepts a config parameter (model config removed). Confirm that this breaking change is intentional and update documentation as needed.
Reason this comment was not posted:
Comment looked like it was already resolved.

6. packages/traceloop-sdk/src/lib/generated/evaluators/mbt-evaluators.ts:317

Draft comment:
The intentChange evaluator no longer accepts a config parameter. Ensure that any client code relying on model configuration is updated accordingly.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is asking the PR author to ensure that client code relying on model configuration is updated, which violates the rule against asking the author to ensure behavior is intended or tested. It doesn't provide a specific suggestion or ask for a specific test to be written.

7. packages/traceloop-sdk/src/lib/generated/evaluators/registry.ts:86

Draft comment:
New 'agent-tool-trajectory' evaluator schema added with required fields and optional config options. The schema appears to align with the new evaluator.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative, as it only describes what has been done without providing any actionable feedback or suggestions. It doesn't ask for confirmation or suggest improvements, nor does it identify any potential issues.

8. packages/traceloop-sdk/src/lib/generated/evaluators/registry.ts:137

Draft comment:
New 'html-comparison' evaluator schema has been added. Verify that the example inputs match the intended use case.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% The comment is asking the PR author to verify that the example inputs match the intended use case. This falls under the rule of not asking the author to confirm their intention or to ensure the behavior is intended. Therefore, this comment should be removed.

9. packages/traceloop-sdk/src/lib/generated/evaluators/types.ts:17

Draft comment:
AgentFlowQualityRequest now requires the 'config' field (no longer optional). Confirm that this change is intentional and that dependent code is updated.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% The comment asks the author to "confirm that this change is intentional and that dependent code is updated." This violates the rule that says "Do NOT ask the PR author to confirm their intention, to explain, to double-check things, to ensure the behavior is intended." Additionally, since this is an auto-generated file (as stated in the header), the change is likely intentional as it reflects changes in the OpenAPI specification. The comment is asking for confirmation rather than pointing out a clear issue or suggesting a specific code change. It's also speculative about whether dependent code needs updating without showing evidence of a problem. However, making a required field from an optional one is a breaking change that could cause runtime errors in dependent code. This could be a legitimate concern that warrants attention, even if the file is auto-generated. While it's true this is a breaking change, the comment doesn't provide actionable guidance - it just asks for confirmation. The rules explicitly state not to ask authors to confirm intentions or ensure things are tested. If there were specific code shown that would break, that would be different, but this is just a general "make sure" comment which violates the guidelines. This comment should be deleted because it asks the author to confirm their intention and ensure dependent code is updated, which directly violates the review rules. It doesn't point to a specific bug or provide actionable code changes.

10. packages/traceloop-sdk/src/lib/generated/evaluators/types.ts:157

Draft comment:
Removed ConversationQualityConfigRequest (and similarly IntentChangeConfigRequest) from the types. This is a breaking change; ensure all clients use the updated request formats.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% The comment is purely informative and asks the PR author to ensure that all clients use the updated request formats. This violates the rule against asking the author to ensure behavior is intended or tested.

11. packages/traceloop-sdk/src/lib/generated/evaluators/mbt-evaluators.ts:287

Draft comment:
Typo: The JSDoc comment line is missing a leading asterisk. Consider adding '* ' before Request Body: for consistency.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% The comment is pointing to line 287 which shows **Request Body:**. Looking at the diff, the htmlComparison method was added in this PR (lines 284-294 in the new file). However, I need to check if this is actually a formatting issue. Looking at the full file, I can see that ALL the JSDoc comments follow the exact same pattern - they have a blank line after the description, then **Request Body:** without a leading asterisk. This is consistent throughout the entire file. The comment appears to be incorrect because this is the established pattern in this auto-generated file. The file header even says "Auto-generated - DO NOT EDIT". The comment might be technically correct about JSDoc formatting standards, but I need to consider: 1) This is an auto-generated file (line 1 says "Auto-generated - DO NOT EDIT"), so manual formatting changes would be overwritten, and 2) The pattern is consistent throughout the entire file, suggesting this is intentional formatting by the generator. Given that this is explicitly an auto-generated file with a "DO NOT EDIT" warning, and the formatting is consistent across all methods, this comment is not actionable. Any manual fix would be overwritten on the next generation. The comment should be deleted. This comment should be deleted because it's about an auto-generated file that explicitly says "DO NOT EDIT" at the top. The formatting pattern is consistent throughout the file, indicating it's intentional. Manual changes would be overwritten when the file is regenerated.

12. packages/traceloop-sdk/src/lib/generated/evaluators/mbt-evaluators.ts:299

Draft comment:
Typo: The JSDoc comment line is missing a leading asterisk. Consider adding '* ' before Request Body: to maintain proper formatting.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

Workflow ID: wflow_cetZMXXv2n189bkL

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

ellipsis-dev

Important

Looks good to me! 👍

Reviewed 66360df in 32 seconds. Click for details.

Reviewed 13 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 1 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/sample-app/src/sample_agent_tool_trajectory.ts:77

Draft comment:
Consistent string quoting: switching from single to double quotes aligns with the rest of the file.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

Workflow ID: wflow_uWtqWvCh4joVOw6N

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In @packages/sample-app/src/sample_agent_tool_trajectory.ts:
- Around line 39-43: The error path after calling traceloop.getClient()
currently logs and returns, creating inconsistency with other critical failures
that call process.exit(1); change the return to process.exit(1) so the app
terminates on client initialization failure—update the block that checks const
client = traceloop.getClient() to call process.exit(1) after logging the error
(keep the console.error message) to match the failure handling used elsewhere.
- Around line 52-56: The defaultExpected string uses Python-style single quotes;
replace it with a JSON-formatted string using double quotes so the evaluator
receives a valid JSON array (update the value of defaultExpected referenced
where expectedToolCalls is assigned). Ensure the JSON uses double quotes for
object keys and string values (e.g.,
[{"name":"search","input":{"query":"weather"}}, ...]) so expectedToolCalls
matches the agent-tool-trajectory evaluator schema.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a6bd4be and 014cebb.

⛔ Files ignored due to path filters (3)

packages/traceloop-sdk/src/lib/generated/evaluators/mbt-evaluators.ts is excluded by !**/generated/**
packages/traceloop-sdk/src/lib/generated/evaluators/registry.ts is excluded by !**/generated/**
packages/traceloop-sdk/src/lib/generated/evaluators/types.ts is excluded by !**/generated/**

📒 Files selected for processing (2)

packages/sample-app/package.json
packages/sample-app/src/sample_agent_tool_trajectory.ts

🧰 Additional context used

📓 Path-based instructions (1)

packages/*/package.json

📄 CodeRabbit inference engine (CLAUDE.md)

Use workspace:* for intra-repo package dependencies in package.json

Files:

packages/sample-app/package.json

🧠 Learnings (5)

📓 Common learnings

Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/traceloop-sdk/**/*.{ts,tsx} : For manual LLM operations, use trace.withLLMSpan from traceloop/node-server-sdk

Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/traceloop-sdk/**/*.{ts,tsx} : Use the provided decorators (workflow, task, agent) for workflow/task/agent spans instead of re-implementing them

Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/traceloop-sdk/src/lib/node-server-sdk.ts : Register new instrumentation packages during SDK initialization

Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/traceloop-sdk/package.json : When adding a new instrumentation package, add it to the main SDK dependencies

Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/{instrumentation-*,traceloop-sdk}/**/*.{ts,tsx} : Import AI/LLM semantic attribute constants from traceloop/ai-semantic-conventions rather than hardcoding strings

Learnt from: nina-kollman
Repo: traceloop/openllmetry-js PR: 672
File: packages/traceloop-sdk/src/lib/client/experiment/experiment.ts:212-231
Timestamp: 2025-08-22T12:44:43.893Z
Learning: In the Traceloop SDK experiment system, JSONL datasets use a custom format where the first line contains column definitions and subsequent lines contain the actual data rows. The parseJsonlToRows method in packages/traceloop-sdk/src/lib/client/experiment/experiment.ts intentionally skips the first line (index 0) to avoid parsing the column definition as data.

📚 Learning: 2025-08-24T22:08:07.023Z

Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/traceloop-sdk/**/*.{ts,tsx} : For manual LLM operations, use trace.withLLMSpan from traceloop/node-server-sdk

Applied to files:

packages/sample-app/src/sample_agent_tool_trajectory.ts

📚 Learning: 2025-08-24T22:08:07.023Z

Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/{instrumentation-*,traceloop-sdk}/**/*.{ts,tsx} : Import AI/LLM semantic attribute constants from traceloop/ai-semantic-conventions rather than hardcoding strings

Applied to files:

packages/sample-app/src/sample_agent_tool_trajectory.ts

📚 Learning: 2025-08-24T22:08:07.023Z

Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/traceloop-sdk/**/*.{ts,tsx} : Use the provided decorators (workflow, task, agent) for workflow/task/agent spans instead of re-implementing them

Applied to files:

packages/sample-app/src/sample_agent_tool_trajectory.ts

📚 Learning: 2025-12-30T22:05:07.422Z

Learnt from: galzilber
Repo: traceloop/openllmetry-js PR: 853
File: packages/sample-app/src/sample_security_evaluators.ts:37-46
Timestamp: 2025-12-30T22:05:07.422Z
Learning: In packages/sample-app/src/**/*.ts, allow demonstration code to omit defensive error handling and bounds checking for API responses. This pattern applies to sample/demo code only and should not be interpreted as a best practice for production code. Reviewers should still require basic guards for critical paths (security, integrity) and clearly document that the code is a demonstration with comments indicating assumptions.

Applied to files:

packages/sample-app/src/sample_agent_tool_trajectory.ts

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build and test

🔇 Additional comments (5)

packages/sample-app/package.json (1)

44-44: LGTM!

The new script follows the established pattern and correctly references the compiled output file.

packages/sample-app/src/sample_agent_tool_trajectory.ts (4)

10-17: LGTM!

The imports are appropriate for this experiment sample.

75-85: LGTM!

The evaluator configuration is clear and follows a reasonable structure for the demo.

94-131: LGTM!

The experiment execution includes proper error handling and helpful logging. The comment on line 98 appropriately guides users to create the required dataset.

134-138: LGTM!

Proper top-level error handling ensures the process exits correctly on failure.

packages/sample-app/src/sample_agent_tool_trajectory.ts

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In @packages/sample-app/src/sample_agent_tool_trajectory.ts:
- Around line 52-55: The defaultExpected string uses Python-style single quotes
and must be valid JSON; update the constant defaultExpected to use double quotes
for keys and string values (e.g., '[{"name": "search", "input": {"query":
"weather"}}, {"name": "book_flight", "input": {"flight": "NYC to Paris"}},
{"name": "get_confirmation", "input": {"confirmation": "flight booked"}}]') so
the agent-tool-trajectory evaluator receives proper JSON.

🧹 Nitpick comments (3)

packages/sample-app/src/sample_agent_tool_trajectory.ts (3)
39-43: Consider consistent error handling.

Line 42 returns from the function while line 36 calls process.exit(1). This inconsistency means initialization failures exit the process, but client retrieval failures allow the function to complete silently. For demonstration purposes this may be intentional, but consider using process.exit(1) here as well for consistency.
♻️ Proposed fix for consistent error handling
   const client = traceloop.getClient();
   if (!client) {
     console.error("Failed to initialize Traceloop client");
-    return;
+    process.exit(1);
   }
51-51: Consider improving type safety with runtime validation.

The type assertions as string bypass compile-time type checking. If row.actual or row.expected are not strings at runtime, the fallback to empty string or default may mask issues.
♻️ Proposed refactor for safer type handling
-  const executedToolCalls = (row.actual as string) || "";
+  const executedToolCalls = typeof row.actual === 'string' ? row.actual : '';
   const defaultExpected =
     "[{'name': 'search', 'input': {'query': 'weather'}}, " +
     "{'name': 'book_flight', 'input': {'flight': 'NYC to Paris'}}, " +
     "{'name': 'get_confirmation', 'input': {'confirmation': 'flight booked'}}]";
-  const expectedToolCalls = (row.expected as string) || defaultExpected;
+  const expectedToolCalls = typeof row.expected === 'string' ? row.expected : defaultExpected;
Also applies to: 56-56

135-138: Consider defensive error handling in top-level catch.

Line 136 accesses error.message without verifying the error is an Error instance. While typically safe in catch blocks, being defensive (as done on lines 126 and 134) improves robustness.
♻️ Proposed fix for defensive error handling
 // Error handling for the main function
 main().catch((error) => {
-  console.error("💥 Application failed:", error.message);
+  console.error(
+    "💥 Application failed:",
+    error instanceof Error ? error.message : String(error)
+  );
   process.exit(1);
 });

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 014cebb and 66360df.

📒 Files selected for processing (1)

packages/sample-app/src/sample_agent_tool_trajectory.ts

🧰 Additional context used

🧠 Learnings (5)

📓 Common learnings

Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/traceloop-sdk/**/*.{ts,tsx} : For manual LLM operations, use trace.withLLMSpan from traceloop/node-server-sdk

Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/traceloop-sdk/package.json : When adding a new instrumentation package, add it to the main SDK dependencies

Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/traceloop-sdk/src/lib/node-server-sdk.ts : Register new instrumentation packages during SDK initialization

Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/traceloop-sdk/**/*.{ts,tsx} : Use the provided decorators (workflow, task, agent) for workflow/task/agent spans instead of re-implementing them

Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/{instrumentation-*,traceloop-sdk}/**/*.{ts,tsx} : Import AI/LLM semantic attribute constants from traceloop/ai-semantic-conventions rather than hardcoding strings

📚 Learning: 2025-08-24T22:08:07.023Z

Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/traceloop-sdk/**/*.{ts,tsx} : For manual LLM operations, use trace.withLLMSpan from traceloop/node-server-sdk

Applied to files:

packages/sample-app/src/sample_agent_tool_trajectory.ts

📚 Learning: 2025-08-24T22:08:07.023Z

Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/traceloop-sdk/**/*.{ts,tsx} : Use the provided decorators (workflow, task, agent) for workflow/task/agent spans instead of re-implementing them

Applied to files:

packages/sample-app/src/sample_agent_tool_trajectory.ts

📚 Learning: 2025-08-24T22:08:07.023Z

Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/{instrumentation-*,traceloop-sdk}/**/*.{ts,tsx} : Import AI/LLM semantic attribute constants from traceloop/ai-semantic-conventions rather than hardcoding strings

Applied to files:

packages/sample-app/src/sample_agent_tool_trajectory.ts

📚 Learning: 2025-12-30T22:05:07.422Z

Learnt from: galzilber
Repo: traceloop/openllmetry-js PR: 853
File: packages/sample-app/src/sample_security_evaluators.ts:37-46
Timestamp: 2025-12-30T22:05:07.422Z
Learning: In packages/sample-app/src/**/*.ts, allow demonstration code to omit defensive error handling and bounds checking for API responses. This pattern applies to sample/demo code only and should not be interpreted as a best practice for production code. Reviewers should still require basic guards for critical paths (security, integrity) and clearly document that the code is a demonstration with comments indicating assumptions.

Applied to files:

packages/sample-app/src/sample_agent_tool_trajectory.ts

🔇 Additional comments (2)

packages/sample-app/src/sample_agent_tool_trajectory.ts (2)

75-85: LGTM!

The evaluator configuration is well-structured with clear parameter names and reasonable default values.

94-104: LGTM!

The experiment configuration is appropriate with helpful comments indicating the dataset requirement. The parameters are well-chosen for a demonstration.

coderabbitai · 2026-01-08T16:00:31Z

packages/sample-app/src/sample_agent_tool_trajectory.ts

+    const defaultExpected =
+      "[{'name': 'search', 'input': {'query': 'weather'}}, " +
+      "{'name': 'book_flight', 'input': {'flight': 'NYC to Paris'}}, " +
+      "{'name': 'get_confirmation', 'input': {'confirmation': 'flight booked'}}]";


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, examine the file to understand the context cd packages/sample-app/src && wc -l sample_agent_tool_trajectory.ts

Repository: traceloop/openllmetry-js

Length of output: 103

🏁 Script executed:

# Read the relevant section with context sed -n '40,65p' packages/sample-app/src/sample_agent_tool_trajectory.ts

Repository: traceloop/openllmetry-js

Length of output: 931

🌐 Web query:

Traceloop agent-tool-trajectory evaluator expected format tool calls

💡 Result:

Short answer: format the trajectory as OpenAI-style chat messages where assistant messages that invoke tools include a tool_calls array; each tool_call has a function object with name and arguments (arguments as a JSON string). Tool outputs appear as role: "tool" messages. Example:

[
{"role":"user","content":"What is the weather in SF?"},
{
"role":"assistant",
"content":"",
"tool_calls":[
{
"function":{
"name":"get_weather",
"arguments":"{"city":"San Francisco"}"
}
}
]
},
{"role":"tool","content":"It's 80°F and sunny in San Francisco."},
{"role":"assistant","content":"It's 80°F and sunny in San Francisco."}
]

Sources: Agent-trajectory format and examples (agent-evals / OpenAI-style messages). [1] Traceloop evaluators docs for evaluator usage. [2]

References
[1] AgentEvals — agent trajectory examples and tool_calls format (examples using function.name and JSON-stringified arguments).
[2] Traceloop — custom evaluators / evaluator execution docs.

🏁 Script executed:

# Search for AgentToolTrajectoryEvaluator or related evaluator implementation rg -i "AgentToolTrajectoryEvaluator|agent.*tool.*trajectory" --type ts --type js -A 5

Repository: traceloop/openllmetry-js

Length of output: 12212

🏁 Script executed:

# Search for how expected_tool_calls is processed rg "expected_tool_calls" --type ts --type js -B 3 -A 3

Repository: traceloop/openllmetry-js

Length of output: 4717

🏁 Script executed:

# Search for any format conversion or transformation logic rg "tool_calls|tool.calls" --type ts --type js -B 2 -A 2

Repository: traceloop/openllmetry-js

Length of output: 33106

🌐 Web query:

Traceloop "agent-tool-trajectory" evaluator documentation format specification

💡 Result:

I couldn't find any Traceloop evaluator explicitly named "agent-tool-trajectory". The closest built-in evaluators are the Agent Flow / Agent Tool evaluators (Agent Flow Quality, Agent Tool Error Detector, Agent Efficiency) and the general Evaluator docs — see Traceloop's evaluator library and agent evaluator pages: [1][2].

If you want, I can:

fetch the exact input/output schema (format specification) for a specific Traceloop evaluator (tell me which one), or

search the Traceloop docs/GitHub for any evaluator file or JSON schema that uses the name "agent-tool-trajectory" and return the exact format spec.

References:
[1] Traceloop — Made by Traceloop (Evaluators overview). (traceloop.com)
[2] Traceloop — Custom Evaluators (how to create/test evaluators). (traceloop.com)

Citations:

1: https://www.traceloop.com/docs/evaluators/made-by-traceloop

2: https://www.traceloop.com/docs/evaluators/custom-evaluator?utm_source=openai

Fix tool call format to use JSON instead of Python dict syntax.

The defaultExpected variable uses Python dict syntax with single quotes ({'name': 'search'}), but the agent-tool-trajectory evaluator requires JSON format with double quotes ({"name": "search"}). Update the string to use valid JSON:

Suggested fix

const defaultExpected = '[{"name": "search", "input": {"query": "weather"}}, ' + '{"name": "book_flight", "input": {"flight": "NYC to Paris"}}, ' + '{"name": "get_confirmation", "input": {"confirmation": "flight booked"}}]';

🤖 Prompt for AI Agents

In @packages/sample-app/src/sample_agent_tool_trajectory.ts around lines 52 - 55, The defaultExpected string uses Python-style single quotes and must be valid JSON; update the constant defaultExpected to use double quotes for keys and string values (e.g., '[{"name": "search", "input": {"query": "weather"}}, {"name": "book_flight", "input": {"flight": "NYC to Paris"}}, {"name": "get_confirmation", "input": {"confirmation": "flight booked"}}]') so the agent-tool-trajectory evaluator receives proper JSON.

nina-kollman added 2 commits January 8, 2026 17:36

ran script

c11c60e

added sample

014cebb

pretty

66360df

ellipsis-dev bot reviewed Jan 8, 2026

View reviewed changes

coderabbitai bot reviewed Jan 8, 2026

View reviewed changes

packages/sample-app/src/sample_agent_tool_trajectory.ts Show resolved Hide resolved

packages/sample-app/src/sample_agent_tool_trajectory.ts Show resolved Hide resolved

doronkopit5 approved these changes Jan 8, 2026

View reviewed changes

nina-kollman merged commit 1d1f642 into main Jan 8, 2026
9 of 10 checks passed

coderabbitai bot reviewed Jan 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(eval): Add agent tool trajectory eval #854

fix(eval): Add agent tool trajectory eval #854

nina-kollman commented Jan 8, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 8, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

ellipsis-dev bot left a comment

Uh oh!

ellipsis-dev bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix(eval): Add agent tool trajectory eval #854

fix(eval): Add agent tool trajectory eval #854

Conversation

nina-kollman commented Jan 8, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nina-kollman commented Jan 8, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 8, 2026 •

edited

Loading