Are the official leaderboard results derived from this tool?

1. 怀疑存在严重 bug：evaluate.py:load_trajectory 加载全量轨迹文件 results/merged_trajectories.jsonl 为每个 case 匹配数据时，未指定 session_id/instance_id 作为唯一标识，导致通过 load_trajectory 加载的轨迹数据不明确，进而与对应的 checklist 匹配错误。
2. 轨迹处理逻辑未限制对话轮数。
3. 评测工具不支持并发，仅能串行执行？此外，若存在其他针对 litellm 的请求，是否会污染当前 task 的轨迹数据？
4. agent&image 相关问题：
* 部分 claude code task 镜像（image）包含代理配置，可能导致 API 连接失败；
* 环境变量配置异常，部分场景出现 “claude: command not found” 错误；
* Droid Agent Custom Model 因缺少官方 API-KEY，无法执行任务；
* kilo-dev 环境下的 agent 因环境配置异常，无法执行任务。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are the official leaderboard results derived from this tool? #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Are the official leaderboard results derived from this tool? #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions