-
Notifications
You must be signed in to change notification settings - Fork 48
fix(eval): Add agent tool trajectory eval #854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughAdded a new npm script and a new TypeScript sample demonstrating Traceloop SDK initialization, agent-tool-trajectory evaluator configuration, running an experiment with a dataset, and printing a results summary with error handling. Changes
Sequence DiagramsequenceDiagram
participant SampleApp as Sample App
participant Traceloop as Traceloop SDK
participant Client as Traceloop Client
participant Experiment as Experiment Runner
SampleApp->>Traceloop: initialize()
Traceloop-->>SampleApp: init complete
SampleApp->>Traceloop: getClient()
Traceloop-->>SampleApp: Client
SampleApp->>SampleApp: define agentEvaluatorsTask (tool calls + expected)
SampleApp->>SampleApp: configure evaluators (agent-tool-trajectory)
SampleApp->>Client: runExperiment(dataset, evaluators)
Client->>Experiment: execute evaluations
Experiment-->>Client: results
Client-->>SampleApp: results summary (rows, errors, id)
SampleApp->>SampleApp: log output / handle errors
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important
Looks good to me! 👍
Reviewed everything up to 014cebb in 2 minutes and 53 seconds. Click for details.
- Reviewed
402lines of code in5files - Skipped
0files when reviewing. - Skipped posting
12draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/sample-app/package.json:44
- Draft comment:
New 'run:agent_tool_trajectory' script added. Verify that the build output and file path (dist/src/sample_agent_tool_trajectory.js) are correct. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%<= threshold50%The comment is asking the PR author to verify the build output and file path, which falls under the category of asking the author to double-check things. This violates the rule against asking the author to confirm or ensure things. Therefore, this comment should not be approved.
2. packages/sample-app/src/sample_agent_tool_trajectory.ts:52
- Draft comment:
The defaultExpected string uses single quotes inside the JSON-like structure. Consider using valid JSON formatting (double quotes) to ensure proper parsing. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 15% vs. threshold = 50% The comment suggests using double quotes for "proper parsing", but this is speculative. The code shows a string literal being created - it's not being parsed as JSON in the visible code. The string is just assigned and returned. Whether this causes an issue depends entirely on how the Traceloop SDK processes theexpected_tool_callsfield, which I cannot see. The comment says "consider" which makes it sound like a suggestion rather than a definite bug. The structure looks like Python dictionary syntax (single quotes), which might actually be intentional if the SDK expects that format. Without evidence that this will actually cause a parsing error, this is speculative. The string might be intentionally formatted this way if the Traceloop SDK expects Python-like syntax rather than JSON. The comment is speculative ("consider") and doesn't show evidence of an actual bug - it's just a suggestion based on an assumption about what format is needed. The comment lacks strong evidence that this is actually a problem. It's a speculative suggestion about formatting without demonstrating that the current code will fail. The rules state I should only keep comments with STRONG EVIDENCE of correctness, and speculative comments should be removed. This comment should be deleted. It's speculative and suggests a change without strong evidence that the current code is incorrect. The format might be intentional based on what the SDK expects, and there's no visible parsing happening in this code that would fail.
3. packages/sample-app/src/sample_agent_tool_trajectory.ts:97
- Draft comment:
Ensure the datasetSlug 'agent-tool-trajectory' exists and is properly configured on the Traceloop platform. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%<= threshold50%This comment is asking the PR author to ensure that a datasetSlug exists and is properly configured, which is similar to asking them to double-check or verify something. This violates the rule against asking the author to ensure something is set up correctly.
4. packages/traceloop-sdk/src/lib/generated/evaluators/mbt-evaluators.ts:181
- Draft comment:
New agentToolTrajectory evaluator method added. It correctly casts config using AgentToolTrajectoryConfig. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%<= threshold50%This comment is purely informative, as it only states that a new method was added and that it correctly casts a config. It doesn't provide any actionable feedback or suggestions for improvement.
5. packages/traceloop-sdk/src/lib/generated/evaluators/mbt-evaluators.ts:267
- Draft comment:
The conversationQuality evaluator no longer accepts a config parameter (model config removed). Confirm that this breaking change is intentional and update documentation as needed. - Reason this comment was not posted:
Comment looked like it was already resolved.
6. packages/traceloop-sdk/src/lib/generated/evaluators/mbt-evaluators.ts:317
- Draft comment:
The intentChange evaluator no longer accepts a config parameter. Ensure that any client code relying on model configuration is updated accordingly. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%<= threshold50%This comment is asking the PR author to ensure that client code relying on model configuration is updated, which violates the rule against asking the author to ensure behavior is intended or tested. It doesn't provide a specific suggestion or ask for a specific test to be written.
7. packages/traceloop-sdk/src/lib/generated/evaluators/registry.ts:86
- Draft comment:
New 'agent-tool-trajectory' evaluator schema added with required fields and optional config options. The schema appears to align with the new evaluator. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%<= threshold50%This comment is purely informative, as it only describes what has been done without providing any actionable feedback or suggestions. It doesn't ask for confirmation or suggest improvements, nor does it identify any potential issues.
8. packages/traceloop-sdk/src/lib/generated/evaluators/registry.ts:137
- Draft comment:
New 'html-comparison' evaluator schema has been added. Verify that the example inputs match the intended use case. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%<= threshold50%The comment is asking the PR author to verify that the example inputs match the intended use case. This falls under the rule of not asking the author to confirm their intention or to ensure the behavior is intended. Therefore, this comment should be removed.
9. packages/traceloop-sdk/src/lib/generated/evaluators/types.ts:17
- Draft comment:
AgentFlowQualityRequest now requires the 'config' field (no longer optional). Confirm that this change is intentional and that dependent code is updated. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% The comment asks the author to "confirm that this change is intentional and that dependent code is updated." This violates the rule that says "Do NOT ask the PR author to confirm their intention, to explain, to double-check things, to ensure the behavior is intended." Additionally, since this is an auto-generated file (as stated in the header), the change is likely intentional as it reflects changes in the OpenAPI specification. The comment is asking for confirmation rather than pointing out a clear issue or suggesting a specific code change. It's also speculative about whether dependent code needs updating without showing evidence of a problem. However, making a required field from an optional one is a breaking change that could cause runtime errors in dependent code. This could be a legitimate concern that warrants attention, even if the file is auto-generated. While it's true this is a breaking change, the comment doesn't provide actionable guidance - it just asks for confirmation. The rules explicitly state not to ask authors to confirm intentions or ensure things are tested. If there were specific code shown that would break, that would be different, but this is just a general "make sure" comment which violates the guidelines. This comment should be deleted because it asks the author to confirm their intention and ensure dependent code is updated, which directly violates the review rules. It doesn't point to a specific bug or provide actionable code changes.
10. packages/traceloop-sdk/src/lib/generated/evaluators/types.ts:157
- Draft comment:
Removed ConversationQualityConfigRequest (and similarly IntentChangeConfigRequest) from the types. This is a breaking change; ensure all clients use the updated request formats. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%<= threshold50%The comment is purely informative and asks the PR author to ensure that all clients use the updated request formats. This violates the rule against asking the author to ensure behavior is intended or tested.
11. packages/traceloop-sdk/src/lib/generated/evaluators/mbt-evaluators.ts:287
- Draft comment:
Typo: The JSDoc comment line is missing a leading asterisk. Consider adding '* ' before Request Body: for consistency. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% The comment is pointing to line 287 which shows**Request Body:**. Looking at the diff, the htmlComparison method was added in this PR (lines 284-294 in the new file). However, I need to check if this is actually a formatting issue. Looking at the full file, I can see that ALL the JSDoc comments follow the exact same pattern - they have a blank line after the description, then**Request Body:**without a leading asterisk. This is consistent throughout the entire file. The comment appears to be incorrect because this is the established pattern in this auto-generated file. The file header even says "Auto-generated - DO NOT EDIT". The comment might be technically correct about JSDoc formatting standards, but I need to consider: 1) This is an auto-generated file (line 1 says "Auto-generated - DO NOT EDIT"), so manual formatting changes would be overwritten, and 2) The pattern is consistent throughout the entire file, suggesting this is intentional formatting by the generator. Given that this is explicitly an auto-generated file with a "DO NOT EDIT" warning, and the formatting is consistent across all methods, this comment is not actionable. Any manual fix would be overwritten on the next generation. The comment should be deleted. This comment should be deleted because it's about an auto-generated file that explicitly says "DO NOT EDIT" at the top. The formatting pattern is consistent throughout the file, indicating it's intentional. Manual changes would be overwritten when the file is regenerated.
12. packages/traceloop-sdk/src/lib/generated/evaluators/mbt-evaluators.ts:299
- Draft comment:
Typo: The JSDoc comment line is missing a leading asterisk. Consider adding '* ' before Request Body: to maintain proper formatting. - Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.
Workflow ID: wflow_cetZMXXv2n189bkL
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important
Looks good to me! 👍
Reviewed 66360df in 32 seconds. Click for details.
- Reviewed
13lines of code in1files - Skipped
0files when reviewing. - Skipped posting
1draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/sample-app/src/sample_agent_tool_trajectory.ts:77
- Draft comment:
Consistent string quoting: switching from single to double quotes aligns with the rest of the file. - Reason this comment was not posted:
Confidence changes required:0%<= threshold50%None
Workflow ID: wflow_uWtqWvCh4joVOw6N
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In @packages/sample-app/src/sample_agent_tool_trajectory.ts:
- Around line 39-43: The error path after calling traceloop.getClient()
currently logs and returns, creating inconsistency with other critical failures
that call process.exit(1); change the return to process.exit(1) so the app
terminates on client initialization failure—update the block that checks const
client = traceloop.getClient() to call process.exit(1) after logging the error
(keep the console.error message) to match the failure handling used elsewhere.
- Around line 52-56: The defaultExpected string uses Python-style single quotes;
replace it with a JSON-formatted string using double quotes so the evaluator
receives a valid JSON array (update the value of defaultExpected referenced
where expectedToolCalls is assigned). Ensure the JSON uses double quotes for
object keys and string values (e.g.,
[{"name":"search","input":{"query":"weather"}}, ...]) so expectedToolCalls
matches the agent-tool-trajectory evaluator schema.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
⛔ Files ignored due to path filters (3)
packages/traceloop-sdk/src/lib/generated/evaluators/mbt-evaluators.tsis excluded by!**/generated/**packages/traceloop-sdk/src/lib/generated/evaluators/registry.tsis excluded by!**/generated/**packages/traceloop-sdk/src/lib/generated/evaluators/types.tsis excluded by!**/generated/**
📒 Files selected for processing (2)
packages/sample-app/package.jsonpackages/sample-app/src/sample_agent_tool_trajectory.ts
🧰 Additional context used
📓 Path-based instructions (1)
packages/*/package.json
📄 CodeRabbit inference engine (CLAUDE.md)
Use workspace:* for intra-repo package dependencies in package.json
Files:
packages/sample-app/package.json
🧠 Learnings (5)
📓 Common learnings
Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/traceloop-sdk/**/*.{ts,tsx} : For manual LLM operations, use trace.withLLMSpan from traceloop/node-server-sdk
Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/traceloop-sdk/**/*.{ts,tsx} : Use the provided decorators (workflow, task, agent) for workflow/task/agent spans instead of re-implementing them
Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/traceloop-sdk/src/lib/node-server-sdk.ts : Register new instrumentation packages during SDK initialization
Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/traceloop-sdk/package.json : When adding a new instrumentation package, add it to the main SDK dependencies
Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/{instrumentation-*,traceloop-sdk}/**/*.{ts,tsx} : Import AI/LLM semantic attribute constants from traceloop/ai-semantic-conventions rather than hardcoding strings
Learnt from: nina-kollman
Repo: traceloop/openllmetry-js PR: 672
File: packages/traceloop-sdk/src/lib/client/experiment/experiment.ts:212-231
Timestamp: 2025-08-22T12:44:43.893Z
Learning: In the Traceloop SDK experiment system, JSONL datasets use a custom format where the first line contains column definitions and subsequent lines contain the actual data rows. The parseJsonlToRows method in packages/traceloop-sdk/src/lib/client/experiment/experiment.ts intentionally skips the first line (index 0) to avoid parsing the column definition as data.
📚 Learning: 2025-08-24T22:08:07.023Z
Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/traceloop-sdk/**/*.{ts,tsx} : For manual LLM operations, use trace.withLLMSpan from traceloop/node-server-sdk
Applied to files:
packages/sample-app/src/sample_agent_tool_trajectory.ts
📚 Learning: 2025-08-24T22:08:07.023Z
Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/{instrumentation-*,traceloop-sdk}/**/*.{ts,tsx} : Import AI/LLM semantic attribute constants from traceloop/ai-semantic-conventions rather than hardcoding strings
Applied to files:
packages/sample-app/src/sample_agent_tool_trajectory.ts
📚 Learning: 2025-08-24T22:08:07.023Z
Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/traceloop-sdk/**/*.{ts,tsx} : Use the provided decorators (workflow, task, agent) for workflow/task/agent spans instead of re-implementing them
Applied to files:
packages/sample-app/src/sample_agent_tool_trajectory.ts
📚 Learning: 2025-12-30T22:05:07.422Z
Learnt from: galzilber
Repo: traceloop/openllmetry-js PR: 853
File: packages/sample-app/src/sample_security_evaluators.ts:37-46
Timestamp: 2025-12-30T22:05:07.422Z
Learning: In packages/sample-app/src/**/*.ts, allow demonstration code to omit defensive error handling and bounds checking for API responses. This pattern applies to sample/demo code only and should not be interpreted as a best practice for production code. Reviewers should still require basic guards for critical paths (security, integrity) and clearly document that the code is a demonstration with comments indicating assumptions.
Applied to files:
packages/sample-app/src/sample_agent_tool_trajectory.ts
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build and test
🔇 Additional comments (5)
packages/sample-app/package.json (1)
44-44: LGTM!The new script follows the established pattern and correctly references the compiled output file.
packages/sample-app/src/sample_agent_tool_trajectory.ts (4)
10-17: LGTM!The imports are appropriate for this experiment sample.
75-85: LGTM!The evaluator configuration is clear and follows a reasonable structure for the demo.
94-131: LGTM!The experiment execution includes proper error handling and helpful logging. The comment on line 98 appropriately guides users to create the required dataset.
134-138: LGTM!Proper top-level error handling ensures the process exits correctly on failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @packages/sample-app/src/sample_agent_tool_trajectory.ts:
- Around line 52-55: The defaultExpected string uses Python-style single quotes
and must be valid JSON; update the constant defaultExpected to use double quotes
for keys and string values (e.g., '[{"name": "search", "input": {"query":
"weather"}}, {"name": "book_flight", "input": {"flight": "NYC to Paris"}},
{"name": "get_confirmation", "input": {"confirmation": "flight booked"}}]') so
the agent-tool-trajectory evaluator receives proper JSON.
🧹 Nitpick comments (3)
packages/sample-app/src/sample_agent_tool_trajectory.ts (3)
39-43: Consider consistent error handling.Line 42 returns from the function while line 36 calls
process.exit(1). This inconsistency means initialization failures exit the process, but client retrieval failures allow the function to complete silently. For demonstration purposes this may be intentional, but consider usingprocess.exit(1)here as well for consistency.♻️ Proposed fix for consistent error handling
const client = traceloop.getClient(); if (!client) { console.error("Failed to initialize Traceloop client"); - return; + process.exit(1); }
51-51: Consider improving type safety with runtime validation.The type assertions
as stringbypass compile-time type checking. Ifrow.actualorrow.expectedare not strings at runtime, the fallback to empty string or default may mask issues.♻️ Proposed refactor for safer type handling
- const executedToolCalls = (row.actual as string) || ""; + const executedToolCalls = typeof row.actual === 'string' ? row.actual : ''; const defaultExpected = "[{'name': 'search', 'input': {'query': 'weather'}}, " + "{'name': 'book_flight', 'input': {'flight': 'NYC to Paris'}}, " + "{'name': 'get_confirmation', 'input': {'confirmation': 'flight booked'}}]"; - const expectedToolCalls = (row.expected as string) || defaultExpected; + const expectedToolCalls = typeof row.expected === 'string' ? row.expected : defaultExpected;Also applies to: 56-56
135-138: Consider defensive error handling in top-level catch.Line 136 accesses
error.messagewithout verifying the error is an Error instance. While typically safe in catch blocks, being defensive (as done on lines 126 and 134) improves robustness.♻️ Proposed fix for defensive error handling
// Error handling for the main function main().catch((error) => { - console.error("💥 Application failed:", error.message); + console.error( + "💥 Application failed:", + error instanceof Error ? error.message : String(error) + ); process.exit(1); });
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
packages/sample-app/src/sample_agent_tool_trajectory.ts
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/traceloop-sdk/**/*.{ts,tsx} : For manual LLM operations, use trace.withLLMSpan from traceloop/node-server-sdk
Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/traceloop-sdk/package.json : When adding a new instrumentation package, add it to the main SDK dependencies
Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/traceloop-sdk/src/lib/node-server-sdk.ts : Register new instrumentation packages during SDK initialization
Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/traceloop-sdk/**/*.{ts,tsx} : Use the provided decorators (workflow, task, agent) for workflow/task/agent spans instead of re-implementing them
Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/{instrumentation-*,traceloop-sdk}/**/*.{ts,tsx} : Import AI/LLM semantic attribute constants from traceloop/ai-semantic-conventions rather than hardcoding strings
📚 Learning: 2025-08-24T22:08:07.023Z
Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/traceloop-sdk/**/*.{ts,tsx} : For manual LLM operations, use trace.withLLMSpan from traceloop/node-server-sdk
Applied to files:
packages/sample-app/src/sample_agent_tool_trajectory.ts
📚 Learning: 2025-08-24T22:08:07.023Z
Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/traceloop-sdk/**/*.{ts,tsx} : Use the provided decorators (workflow, task, agent) for workflow/task/agent spans instead of re-implementing them
Applied to files:
packages/sample-app/src/sample_agent_tool_trajectory.ts
📚 Learning: 2025-08-24T22:08:07.023Z
Learnt from: CR
Repo: traceloop/openllmetry-js PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-24T22:08:07.023Z
Learning: Applies to packages/{instrumentation-*,traceloop-sdk}/**/*.{ts,tsx} : Import AI/LLM semantic attribute constants from traceloop/ai-semantic-conventions rather than hardcoding strings
Applied to files:
packages/sample-app/src/sample_agent_tool_trajectory.ts
📚 Learning: 2025-12-30T22:05:07.422Z
Learnt from: galzilber
Repo: traceloop/openllmetry-js PR: 853
File: packages/sample-app/src/sample_security_evaluators.ts:37-46
Timestamp: 2025-12-30T22:05:07.422Z
Learning: In packages/sample-app/src/**/*.ts, allow demonstration code to omit defensive error handling and bounds checking for API responses. This pattern applies to sample/demo code only and should not be interpreted as a best practice for production code. Reviewers should still require basic guards for critical paths (security, integrity) and clearly document that the code is a demonstration with comments indicating assumptions.
Applied to files:
packages/sample-app/src/sample_agent_tool_trajectory.ts
🔇 Additional comments (2)
packages/sample-app/src/sample_agent_tool_trajectory.ts (2)
75-85: LGTM!The evaluator configuration is well-structured with clear parameter names and reasonable default values.
94-104: LGTM!The experiment configuration is appropriate with helpful comments indicating the dataset requirement. The parameters are well-chosen for a demonstration.
| const defaultExpected = | ||
| "[{'name': 'search', 'input': {'query': 'weather'}}, " + | ||
| "{'name': 'book_flight', 'input': {'flight': 'NYC to Paris'}}, " + | ||
| "{'name': 'get_confirmation', 'input': {'confirmation': 'flight booked'}}]"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, examine the file to understand the context
cd packages/sample-app/src && wc -l sample_agent_tool_trajectory.tsRepository: traceloop/openllmetry-js
Length of output: 103
🏁 Script executed:
# Read the relevant section with context
sed -n '40,65p' packages/sample-app/src/sample_agent_tool_trajectory.tsRepository: traceloop/openllmetry-js
Length of output: 931
🌐 Web query:
Traceloop agent-tool-trajectory evaluator expected format tool calls
💡 Result:
Short answer: format the trajectory as OpenAI-style chat messages where assistant messages that invoke tools include a tool_calls array; each tool_call has a function object with name and arguments (arguments as a JSON string). Tool outputs appear as role: "tool" messages. Example:
[
{"role":"user","content":"What is the weather in SF?"},
{
"role":"assistant",
"content":"",
"tool_calls":[
{
"function":{
"name":"get_weather",
"arguments":"{"city":"San Francisco"}"
}
}
]
},
{"role":"tool","content":"It's 80°F and sunny in San Francisco."},
{"role":"assistant","content":"It's 80°F and sunny in San Francisco."}
]
Sources: Agent-trajectory format and examples (agent-evals / OpenAI-style messages). [1] Traceloop evaluators docs for evaluator usage. [2]
References
[1] AgentEvals — agent trajectory examples and tool_calls format (examples using function.name and JSON-stringified arguments).
[2] Traceloop — custom evaluators / evaluator execution docs.
🏁 Script executed:
# Search for AgentToolTrajectoryEvaluator or related evaluator implementation
rg -i "AgentToolTrajectoryEvaluator|agent.*tool.*trajectory" --type ts --type js -A 5Repository: traceloop/openllmetry-js
Length of output: 12212
🏁 Script executed:
# Search for how expected_tool_calls is processed
rg "expected_tool_calls" --type ts --type js -B 3 -A 3Repository: traceloop/openllmetry-js
Length of output: 4717
🏁 Script executed:
# Search for any format conversion or transformation logic
rg "tool_calls|tool.calls" --type ts --type js -B 2 -A 2Repository: traceloop/openllmetry-js
Length of output: 33106
🌐 Web query:
Traceloop "agent-tool-trajectory" evaluator documentation format specification
💡 Result:
I couldn't find any Traceloop evaluator explicitly named "agent-tool-trajectory". The closest built-in evaluators are the Agent Flow / Agent Tool evaluators (Agent Flow Quality, Agent Tool Error Detector, Agent Efficiency) and the general Evaluator docs — see Traceloop's evaluator library and agent evaluator pages: [1][2].
If you want, I can:
- fetch the exact input/output schema (format specification) for a specific Traceloop evaluator (tell me which one), or
- search the Traceloop docs/GitHub for any evaluator file or JSON schema that uses the name "agent-tool-trajectory" and return the exact format spec.
References:
[1] Traceloop — Made by Traceloop (Evaluators overview). (traceloop.com)
[2] Traceloop — Custom Evaluators (how to create/test evaluators). (traceloop.com)
Citations:
- 1: https://www.traceloop.com/docs/evaluators/made-by-traceloop
- 2: https://www.traceloop.com/docs/evaluators/custom-evaluator?utm_source=openai
Fix tool call format to use JSON instead of Python dict syntax.
The defaultExpected variable uses Python dict syntax with single quotes ({'name': 'search'}), but the agent-tool-trajectory evaluator requires JSON format with double quotes ({"name": "search"}). Update the string to use valid JSON:
Suggested fix
const defaultExpected =
'[{"name": "search", "input": {"query": "weather"}}, ' +
'{"name": "book_flight", "input": {"flight": "NYC to Paris"}}, ' +
'{"name": "get_confirmation", "input": {"confirmation": "flight booked"}}]';
🤖 Prompt for AI Agents
In @packages/sample-app/src/sample_agent_tool_trajectory.ts around lines 52 -
55, The defaultExpected string uses Python-style single quotes and must be valid
JSON; update the constant defaultExpected to use double quotes for keys and
string values (e.g., '[{"name": "search", "input": {"query": "weather"}},
{"name": "book_flight", "input": {"flight": "NYC to Paris"}}, {"name":
"get_confirmation", "input": {"confirmation": "flight booked"}}]') so the
agent-tool-trajectory evaluator receives proper JSON.
Fixes: TLP-1615
Important
Add agent tool trajectory evaluation feature using Traceloop SDK with new script and SDK updates.
run:agent_tool_trajectoryscript inpackage.jsonfor running agent tool trajectory evaluation.sample_agent_tool_trajectory.tsdemonstrates agent tool trajectory evaluation using Traceloop SDK.AgentToolTrajectoryConfigtype inmbt-evaluators.ts.agentToolTrajectory()method inEvaluatorMadeByTraceloopclass inmbt-evaluators.ts.EVALUATOR_SLUGSandEVALUATOR_SCHEMASinregistry.tsto includeagent-tool-trajectory.AgentToolTrajectoryRequestschema intypes.ts.This description was created by
for 66360df. You can customize this summary. It will automatically update as commits are pushed.
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.