Skip to content

Milestones

List view

  • 新增 regex_match 評測方法:可設定的 regex 提取完整答案字串 + exact string match 評分。BBH (BIG-Bench Hard) 為第一個使用者,但 method 不綁定特定 benchmark,未來任何「固定答案格式 + exact match」的 benchmark 都能使用。

    No due date
    6/6 issues closed
  • 將非核心的重量級相依套件改為 optional,降低基礎安裝體積。 目前 required deps 中最肥的兩個: - pyarrow(121MB)— 由 datasets 拉入,僅 HuggingFace 下載時使用 - google-api-python-client(93MB)— 僅 Google Drive/Sheets 整合使用 目標:核心安裝(pip install twinkle-eval)只需 ~150MB,完整功能透過 extras 安裝。

    No due date
    0/1 issues closed
  • 支援 BFCL v3 評測:多輪對話中的 function call,評測模型在連續對話中正確呼叫、接收工具回傳、並繼續推理的能力。

    No due date
    0/2 issues closed
  • 支援 BFCL v2 評測:引入真實世界 API 資料集(企業級與 OSS 貢獻),評測更具實際部署情境的 function call 能力。

    No due date
  • 支援 BFCL v4 評測:包含 Web Search、Memory Management、Format Sensitivity 三個 agentic 子任務,評測模型在自主代理情境下的工具使用能力。

    No due date
  • **Milestone Description** This milestone aims to introduce an LLM-as-Judge evaluation framework that leverages a large language model to score or rank system outputs based on predefined criteria. This approach enables more flexible and human-aligned evaluation for tasks where simple string matching or numeric scoring is insufficient. Key objectives include: - Designing a promptable evaluation API that uses an LLM to judge output quality - Supporting customizable scoring rubrics and evaluation dimensions - Allowing multiple judging strategies (e.g., numeric score, pairwise comparison, categorical labels) - Ensuring reproducibility through temperature control, system prompts, and deterministic settings - Providing baseline judge prompts for common tasks (e.g., helpfulness, correctness, style, safety) - Adding utilities for batching, retries, and cost tracking during judge evaluations This milestone will enable more human-like, instruction-aligned evaluation workflows across diverse tasks.

    No due date