Releases: ai-twinkle/Eval
v2.8.0 — VLM Phase 1: Vision MCQ Evaluation
VLM Phase 1 — Vision MCQ Evaluation
第一個 Vision Language Model 評測方法上線。Twinkle Eval 現可評測 VLM 在多模態選擇題(MMBench / MMStar / MMMU / POPE)的表現。
Added
vision_mcq評測方法(Milestone #22,PR #134)VisionMCQExtractor:支援字母答案(A–Z)與 Yes/No 二元答案(POPE 等幻覺偵測 benchmark)- 優先解析
\boxed{}/\box{}(推理型 VLM 的標準輸出格式) - 採用
findall + 取最後一個 match策略,正確處理 VLM 先回顯選項列表再給最終答案的常見情境 - Pattern 順序經精細調整:parenthesized / bare-letter-at-end 優先於 line-start,避免 "A) cat / B) dog / C) bird / D" 被誤抓為 C
- Evaluator vision 路由
_encode_image_to_data_uri():magic-byte MIME 偵測(PNG / JPEG / GIF / WebP / BMP)、symlink resolve、50MB 大小上限保護- 支援本地檔案(base64 data URI)與 HTTP/HTTPS URL(直接傳遞給 OpenAI Chat Completions)
uses_vision = Trueflag 自動分流
- 4 個 Vision Benchmark:MMBench、MMStar、MMMU、POPE
- Example Dataset:
datasets/example/vision_mcq/(10 筆 MMStar 樣本,含 jpg 圖片) docs/evals/vision_mcq.md:含 VLMEvalKit 分數對比與速度對比- 61 個 vision_mcq 測試(
tests/test_vision_mcq.py),含大量 VLM 真實輸出格式的 regression cases - Optional dependency
vision = ["Pillow>=10.0.0"]
Changed
datasets/file.py:多模態附帶資源(圖片、音檔、影片)改為統計後一次性log_info,避免逐檔 warn 噪音- CLAUDE.md §13 新增「強制 Reviewer Agent」規定:所有 coding agent 在任何 PR push 之前,必須先 spawn 獨立的 reviewer agent 檢查 diff,blocker 必須先處理才能 push
使用方式
llm_api:
type: "openai"
base_url: "http://localhost:8000/v1"
api_key: "your-api-key"
model:
name: "your-vlm-model"
max_tokens: 1024
evaluation:
dataset_paths:
- "datasets/example/vision_mcq/"
evaluation_method: vision_mcq
strategy_config:
image_field: "image_path"
max_image_size: null
image_detail: "auto"需要 vision-capable 的 OpenAI 相容 API 端點(vLLM + Qwen2-VL、OpenAI GPT-4o、NVIDIA Build VLM 等)。
v2.7.1 — Fix vLLM 0.18+ reasoning field compatibility
Fixed
- vLLM 0.18+ reasoning 欄位相容性(PR #127 by @cyc00518):vLLM 0.18+ 將
reasoning_content改名為reasoning,新增_get_reasoning_text()helper 優先讀取reasoning,None時才回退reasoning_content,相容 vLLM <0.13 / 0.13.x / >=0.18 三種版本
Changed
- CLAUDE.md 新增 bug fix 必須立即推 PATCH 版號的規範
Full Changelog: v2.7.0...v2.7.1
v2.7.0 — ASR Evaluation (WhisperModel + WER/CER)
Milestone #21: ASR — Automatic Speech Recognition Evaluation
New Features
- WhisperModel: 新增 Whisper API (
/v1/audio/transcriptions) LLM 後端,相容 OpenAI、Groq、faster-whisper-server - ASRExtractor + ASRScorer: 自動依語言選擇 WER(英文)或 CER(中文/日文/韓文),含 text normalization pipeline
- Chat Completions 多模態支援: 透過現有 OpenAIModel 搭配 audio_url 內容,支援 Qwen2-Audio 等多模態模型
- 4 個 ASR Benchmark: LibriSpeech、Aishell-1、Fleurs、Common Voice(總計 23 個可下載 benchmark)
- Optional dependency:
pip install twinkle-eval[asr](jiwer)
Benchmark Results
使用 Breeze-ASR-25 + Common Voice TW 50 筆測試:
- CER: 3.80%
- 並行加速: 7.5x(vs sequential)
Full Changelog
v2.6.0 — Benchmark Download, CLI Enhancement, Regex Match
What's New
Benchmark Download Registry (Milestone #20)
--download-dataset支援 19 個 benchmark 短名稱(mmlu、gsm8k、bbh等)--download-dataset all一鍵下載全部--download-dataset list列出所有可用資料集- 支援 GitHub-based 下載(BIRD、Spider 2.0-lite、LongBench)
- Gated dataset(GPQA)互動式 HF token 提示
CLI Enhancement (Milestone #19)
--init重構:--init(列出範本)、--init <name>(單一)、--init all(全部)--dry-run:預覽評測計畫,不呼叫 API--validate:驗證設定檔與資料集格式--resume TIMESTAMP:從中斷點繼續評測- 設定檔範本搬入
twinkle_eval/templates/
Regex Match Evaluation Method (Milestone #18)
RegexMatchExtractor+StringMatchScorer- BBH (BIG-Bench Hard) 為首個 use case
- 66 個 unit tests
Full Changelog: v2.5.0...v2.6.0
v2.5.0 — Text-to-SQL Evaluation (Spider 1.0 / BIRD / Spider 2.0-lite)
Text-to-SQL Evaluation
Unified text2sql evaluation method supporting three major text-to-SQL benchmarks:
New Features
- SQL Extractor: extracts SQL from LLM responses (```sql blocks, plain SELECT, mixed text)
- SQL Scorer: two scoring modes
- Execution Accuracy (EX): executes predicted + gold SQL against SQLite, compares result sets (default)
- Exact Match (EM): normalized SQL string comparison
- Read-only SQLite execution:
mode=ro+PRAGMA query_only = ON
Supported Benchmarks
| Benchmark | Example Size | Databases | Notes |
|---|---|---|---|
| Spider 1.0 | 10 rows | concert_singer, pets_1 | Cross-domain text-to-SQL |
| BIRD | 10 rows | california_schools, financial | With external knowledge (evidence) |
| Spider 2.0-lite | 10 rows | book_store | SQLite-only subset (85 questions in full) |
Spider 2.0: Only lite version supported — full version requires BigQuery/Snowflake cloud credentials.
Config Example
evaluation:
evaluation_method: "text2sql"
strategy_config:
text2sql_scoring_mode: "exec"
text2sql_db_base_path: "datasets/example/spider/databases"Test Results (Devstral-Small-2-24B-Instruct-2512, EX mode)
| Dataset | Accuracy |
|---|---|
| Spider 1.0 | 90% |
| BIRD | 60% |
| Spider 2.0-lite | 80% |
Full Changelog: v2.4.0...v2.5.0
v2.4.0 — RAGAS Evaluation (LLM-as-Judge for RAG Pipelines)
New Benchmark: RAGAS (Retrieval-Augmented Generation Assessment)
Evaluate RAG pipeline quality using LLM-as-judge, measuring faithfulness, answer relevancy, context precision, and context recall.
Key Design
- LLM-as-Judge: The model in config.yaml acts as the judge (not the model being evaluated)
- Consolidated prompt: 1 LLM call per sample (vs official RAGAS's 6-8 multi-step calls), ~1/6 API cost
- No extra dependencies: Self-implemented scoring logic, no langchain/ragas dependency
- 36 tests: full coverage of extractor, scorer, presets, example dataset
Usage
evaluation:
dataset_paths:
- "datasets/example/ragas/"
evaluation_method: "ragas"4 Metrics (each 0.0–1.0)
| Metric | What it measures |
|---|---|
faithfulness |
Are response claims supported by retrieved context? |
answer_relevancy |
How relevant is the response to the question? |
context_precision |
Is the retrieved context relevant? |
context_recall |
Does context cover the reference answer? |
Example Dataset
10 rows from explodinggradients/WikiEval (5 good + 3 ungrounded + 2 poor answers) for sanity checking.
Full Changelog: v2.3.0...v2.4.0
v2.3.0 — NIAH (Needle in a Haystack) Benchmark
New Benchmark: NIAH (Needle in a Haystack)
Tests LLM long-context retrieval by inserting a "needle" fact at varying depths in a long "haystack" document.
Features
- 3 data sources: Kamradt Original (EN), NeedleBench (ZH+EN), LongBench (ZH)
- 3 scoring modes: substring match (default), exact match, token-level F1
- Custom dataset generator:
--generate-niahCLI tool to create NIAH tests from your own text - Config template:
config.niah.template.yaml - 49 tests: full coverage of extractor, scorer, generator, presets, example datasets
Usage
evaluation:
dataset_paths:
- "datasets/example/niah/kamradt/"
evaluation_method: "niah"Generator
twinkle-eval --generate-niah \
--haystack my_docs.txt \
--needle "The secret code is 42." \
--question "What is the secret code?" \
--answer "42" \
--context-lengths 1024,4096,16384 \
--needle-depths 0,25,50,75,100Other Changes
- CLAUDE.md: require full
pytest tests/run before PR submission (§6.6, §13)
Full Changelog: v2.2.0...v2.3.0
v2.2.0 — BFCL v1 Function-Calling Evaluation
What's New
BFCL v1 — Berkeley Function-Calling Leaderboard
新增 BFCL v1 評測,支援兩種模式評估模型的 function-calling 能力:
- FC mode (
bfcl_fc):使用 OpenAI tool_calls API,評估結構化 function-calling 輸出 - Prompting mode (
bfcl_prompt):將 function schema 注入 system prompt,從文字輸出中解析 function call(支援 reasoning model 的<think>block) - 支援 simple / multiple / parallel 三種 function-calling 子類型
- AST-based 結構比對評分(function name + argument matching)
- Dataset converter 可將 raw BFCL 資料轉換為評測格式
- 70 個 pytest 全部通過
安裝
pip install twinkle-eval[tool]Config 範例
evaluation:
dataset_paths:
- "datasets/bfcl_v1/simple/"
evaluation_method: "bfcl_fc" # or "bfcl_prompt"Full Changelog: v2.1.0...v2.2.0
v2.1.0 — IFEval & IFBench Instruction-Following Evaluation
What's New
新增兩個 instruction-following 評測 benchmark,讓 Twinkle Eval 支援完整的指令遵循能力評估。
IFEval — Google Instruction Following Evaluation
- 25 種可驗證指令類型(
change_case、keywords、length_constraints等) - 移植自 Google Research 官方實作(Apache 2.0)
- 4 個指標:strict/loose × prompt/instruction
evaluation_method: "ifeval"pip install twinkle-eval[ifeval]
IFBench — AllenAI Instruction Following Benchmark (OOD)
- 58 種 out-of-distribution 指令類型,分為 7 大類別(count, ratio, words, sentence, format, custom, repeat)
- 移植自 AllenAI IFBench(Apache 2.0, NeurIPS 2025)
- 與 IFEval 共用 strict/loose 評分框架
evaluation_method: "ifbench"pip install twinkle-eval[ifbench]
Score Parity(與官方工具對比)
| Benchmark | Metric | Twinkle Eval | Official | Diff |
|---|---|---|---|---|
| IFEval (541 rows) | prompt_strict | 89.65% | 89.65% | +0.00% ✅ |
| IFEval (541 rows) | instruction_strict | 92.93% | 92.93% | +0.00% ✅ |
| IFBench (294 rows) | prompt_strict | 44.22% | 44.22% | +0.00% ✅ |
| IFBench (294 rows) | instruction_strict | 47.20% | 47.16% | +0.04% ✅ |
Other Changes
- Evaluator 支援 IFEval(JSON string)與 IFBench(原生 list/dict)兩種資料集格式
- CLAUDE.md 新增 §6.6:每個新 benchmark 必須附
tests/test_{name}.py - 新增 43 個 IFBench pytest + 31 個 IFEval pytest
Full Changelog: v2.0.0...v2.1.0
v2.0.0 — 模組化架構重構:Extractor/Scorer 拆分
⚠️ Breaking Change 重大版本
本版本為完整架構重構,不向下相容 v1.x 的 import 路徑。若有自訂程式碼使用舊路徑,請依照下方遷移說明更新。
主要變更
Extractor / Scorer 拆分(#37)
將原本的 EvaluationStrategy 拆解為兩個獨立介面:
| 介面 | 職責 | 實作 |
|---|---|---|
Extractor |
從 LLM 輸出中抽取答案字串 | PatternExtractor、BoxExtractor、LogitExtractor、MathExtractor、CustomRegexExtractor |
Scorer |
正規化並判斷答案是否正確 | ExactMatchScorer、MathRulerScorer |
兩者透過 PRESETS 登錄表組合為具名的 evaluation_method,使用者也可自行組合任意 Extractor + Scorer 傳入 Evaluator。
套件目錄整理
所有根目錄散落的模組已遷移至正確的子套件:
| 舊路徑(已刪除) | 新路徑 |
|---|---|
twinkle_eval/config.py |
twinkle_eval/core/config.py |
twinkle_eval/logger.py |
twinkle_eval/core/logger.py |
twinkle_eval/validators.py |
twinkle_eval/core/validators.py |
twinkle_eval/evaluators.py |
twinkle_eval/runners/evaluator.py |
twinkle_eval/benchmark.py |
twinkle_eval/runners/benchmark.py |
twinkle_eval/finalize.py |
twinkle_eval/runners/finalize.py |
twinkle_eval/hf_uploader.py |
twinkle_eval/integrations/huggingface.py |
CLI --init 改版
不再產生單一 config.yaml,改為建立 configs/ 目錄,內含兩份範本:
configs/config.multiple_choice.template.yaml(適用 pattern / box / logit)configs/config.math.template.yaml(適用 math 評測)
新增 Notebooks
notebooks/ 目錄提供兩份教學文件:
notebooks/01_multiple_choice.ipynb:選擇題評測完整教學notebooks/02_math.ipynb:數學評測教學
遷移指南(v1.x → v2.0)
# 舊寫法(v1.x)
from twinkle_eval.evaluation_strategies import PatternMatchingStrategy
from twinkle_eval.evaluators import Evaluator
evaluator = Evaluator(evaluation_strategy=PatternMatchingStrategy())
# 新寫法(v2.0)
from twinkle_eval.metrics.extractors.pattern import PatternExtractor
from twinkle_eval.metrics.scorers.exact import ExactMatchScorer
from twinkle_eval.runners.evaluator import Evaluator
evaluator = Evaluator(extractor=PatternExtractor(), scorer=ExactMatchScorer())config.yaml 格式不受影響,evaluation_method 字串仍完全相容。
已修復
- box 評測
max_tokens不足導致推理截斷(建議 math/box 場景設 4096) mathruler缺少傳遞依賴說明(#35),統一引導使用pip install twinkle-eval[math]