Milestones

regex_match — Configurable Regex Extraction with Exact Match
新增 regex_match 評測方法：可設定的 regex 提取完整答案字串 + exact string match 評分。BBH (BIG-Bench Hard) 為第一個使用者，但 method 不綁定特定 benchmark，未來任何「固定答案格式 + exact match」的 benchmark 都能使用。
No due date
•6/6 issues closed
100% complete0 open 6 closed
依賴輕量化：optional dependency 拆分
將非核心的重量級相依套件改為 optional，降低基礎安裝體積。目前 required deps 中最肥的兩個： - pyarrow（121MB）— 由 datasets 拉入，僅 HuggingFace 下載時使用 - google-api-python-client（93MB）— 僅 Google Drive/Sheets 整合使用目標：核心安裝（pip install twinkle-eval）只需 ~150MB，完整功能透過 extras 安裝。
No due date
•0/1 issues closed
0% complete1 open 0 closed
BFCL v3 — Multi-Turn & Multi-Step
支援 BFCL v3 評測：多輪對話中的 function call，評測模型在連續對話中正確呼叫、接收工具回傳、並繼續推理的能力。
No due date
•0/2 issues closed
0% complete2 open 0 closed
BFCL v2 — Live & Enterprise Data
支援 BFCL v2 評測：引入真實世界 API 資料集（企業級與 OSS 貢獻），評測更具實際部署情境的 function call 能力。
No due date
0% complete0 open 0 closed
BFCL v4 — Agentic (Web / Memory / Format)
支援 BFCL v4 評測：包含 Web Search、Memory Management、Format Sensitivity 三個 agentic 子任務，評測模型在自主代理情境下的工具使用能力。
No due date
0% complete0 open 0 closed
LLM-as-Judge Evaluation Framework
**Milestone Description** This milestone aims to introduce an LLM-as-Judge evaluation framework that leverages a large language model to score or rank system outputs based on predefined criteria. This approach enables more flexible and human-aligned evaluation for tasks where simple string matching or numeric scoring is insufficient. Key objectives include: - Designing a promptable evaluation API that uses an LLM to judge output quality - Supporting customizable scoring rubrics and evaluation dimensions - Allowing multiple judging strategies (e.g., numeric score, pairwise comparison, categorical labels) - Ensuring reproducibility through temperature control, system prompts, and deterministic settings - Providing baseline judge prompts for common tasks (e.g., helpfulness, correctness, style, safety) - Adding utilities for batching, retries, and cost tracking during judge evaluations This milestone will enable more human-like, instruction-aligned evaluation workflows across diverse tasks.
No due date
0% complete0 open 0 closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Milestones

regex_match — Configurable Regex Extraction with Exact Match

依賴輕量化：optional dependency 拆分

BFCL v3 — Multi-Turn & Multi-Step

BFCL v2 — Live & Enterprise Data

BFCL v4 — Agentic (Web / Memory / Format)

LLM-as-Judge Evaluation Framework

Milestones

List view

regex_match — Configurable Regex Extraction with Exact Match

依賴輕量化：optional dependency 拆分

BFCL v3 — Multi-Turn & Multi-Step

BFCL v2 — Live & Enterprise Data

BFCL v4 — Agentic (Web / Memory / Format)

LLM-as-Judge Evaluation Framework