Releases · ai-twinkle/Eval

11 Apr 06:12

lianghsun

v2.8.0

470bbec

v2.8.0 — VLM Phase 1: Vision MCQ Evaluation Latest

Latest

VLM Phase 1 — Vision MCQ Evaluation

第一個 Vision Language Model 評測方法上線。Twinkle Eval 現可評測 VLM 在多模態選擇題（MMBench / MMStar / MMMU / POPE）的表現。

Added

vision_mcq 評測方法（Milestone #22，PR #134）
- VisionMCQExtractor：支援字母答案（A–Z）與 Yes/No 二元答案（POPE 等幻覺偵測 benchmark）
- 優先解析 \boxed{} / \box{}（推理型 VLM 的標準輸出格式）
- 採用 findall + 取最後一個 match 策略，正確處理 VLM 先回顯選項列表再給最終答案的常見情境
- Pattern 順序經精細調整：parenthesized / bare-letter-at-end 優先於 line-start，避免 "A) cat / B) dog / C) bird / D" 被誤抓為 C
Evaluator vision 路由
- _encode_image_to_data_uri()：magic-byte MIME 偵測（PNG / JPEG / GIF / WebP / BMP）、symlink resolve、50MB 大小上限保護
- 支援本地檔案（base64 data URI）與 HTTP/HTTPS URL（直接傳遞給 OpenAI Chat Completions）
- uses_vision = True flag 自動分流
4 個 Vision Benchmark：MMBench、MMStar、MMMU、POPE
Example Dataset：datasets/example/vision_mcq/（10 筆 MMStar 樣本，含 jpg 圖片）
docs/evals/vision_mcq.md：含 VLMEvalKit 分數對比與速度對比
61 個 vision_mcq 測試（tests/test_vision_mcq.py），含大量 VLM 真實輸出格式的 regression cases
Optional dependency vision = ["Pillow>=10.0.0"]

Changed

datasets/file.py：多模態附帶資源（圖片、音檔、影片）改為統計後一次性 log_info，避免逐檔 warn 噪音
CLAUDE.md §13 新增「強制 Reviewer Agent」規定：所有 coding agent 在任何 PR push 之前，必須先 spawn 獨立的 reviewer agent 檢查 diff，blocker 必須先處理才能 push

使用方式

llm_api:
  type: "openai"
  base_url: "http://localhost:8000/v1"
  api_key: "your-api-key"

model:
  name: "your-vlm-model"
  max_tokens: 1024

evaluation:
  dataset_paths:
    - "datasets/example/vision_mcq/"
  evaluation_method: vision_mcq
  strategy_config:
    image_field: "image_path"
    max_image_size: null
    image_detail: "auto"

需要 vision-capable 的 OpenAI 相容 API 端點（vLLM + Qwen2-VL、OpenAI GPT-4o、NVIDIA Build VLM 等）。

Assets 2

09 Apr 15:01

lianghsun

v2.7.1

ffbf599

v2.7.1 — Fix vLLM 0.18+ reasoning field compatibility

Fixed

vLLM 0.18+ reasoning 欄位相容性（PR #127 by @cyc00518）：vLLM 0.18+ 將 reasoning_content 改名為 reasoning，新增 _get_reasoning_text() helper 優先讀取 reasoning，None 時才回退 reasoning_content，相容 vLLM <0.13 / 0.13.x / >=0.18 三種版本

Changed

CLAUDE.md 新增 bug fix 必須立即推 PATCH 版號的規範

Full Changelog: v2.7.0...v2.7.1

Contributors

cyc00518

Assets 2

07 Apr 16:11

lianghsun

v2.7.0

fabc2ad

v2.7.0 — ASR Evaluation (WhisperModel + WER/CER)

Milestone #21: ASR — Automatic Speech Recognition Evaluation

New Features

WhisperModel: 新增 Whisper API (/v1/audio/transcriptions) LLM 後端，相容 OpenAI、Groq、faster-whisper-server
ASRExtractor + ASRScorer: 自動依語言選擇 WER（英文）或 CER（中文/日文/韓文），含 text normalization pipeline
Chat Completions 多模態支援: 透過現有 OpenAIModel 搭配 audio_url 內容，支援 Qwen2-Audio 等多模態模型
4 個 ASR Benchmark: LibriSpeech、Aishell-1、Fleurs、Common Voice（總計 23 個可下載 benchmark）
Optional dependency: pip install twinkle-eval[asr]（jiwer）

Benchmark Results

使用 Breeze-ASR-25 + Common Voice TW 50 筆測試：

CER: 3.80%
並行加速: 7.5x（vs sequential）

Full Changelog

v2.6.0...v2.7.0

Assets 2

07 Apr 08:58

lianghsun

v2.6.0

77f7514

v2.6.0 — Benchmark Download, CLI Enhancement, Regex Match

What's New

Benchmark Download Registry (Milestone #20)

--download-dataset 支援 19 個 benchmark 短名稱（mmlu、gsm8k、bbh 等）
--download-dataset all 一鍵下載全部
--download-dataset list 列出所有可用資料集
支援 GitHub-based 下載（BIRD、Spider 2.0-lite、LongBench）
Gated dataset（GPQA）互動式 HF token 提示

CLI Enhancement (Milestone #19)

--init 重構：--init（列出範本）、--init <name>（單一）、--init all（全部）
--dry-run：預覽評測計畫，不呼叫 API
--validate：驗證設定檔與資料集格式
--resume TIMESTAMP：從中斷點繼續評測
設定檔範本搬入 twinkle_eval/templates/

Regex Match Evaluation Method (Milestone #18)

RegexMatchExtractor + StringMatchScorer
BBH (BIG-Bench Hard) 為首個 use case
66 個 unit tests

Full Changelog: v2.5.0...v2.6.0

Assets 2

27 Mar 03:33

lianghsun

v2.5.0

9f812c9

v2.5.0 — Text-to-SQL Evaluation (Spider 1.0 / BIRD / Spider 2.0-lite)

Text-to-SQL Evaluation

Unified text2sql evaluation method supporting three major text-to-SQL benchmarks:

New Features

SQL Extractor: extracts SQL from LLM responses (```sql blocks, plain SELECT, mixed text)
SQL Scorer: two scoring modes
- Execution Accuracy (EX): executes predicted + gold SQL against SQLite, compares result sets (default)
- Exact Match (EM): normalized SQL string comparison
Read-only SQLite execution: mode=ro + PRAGMA query_only = ON

Supported Benchmarks

Benchmark	Example Size	Databases	Notes
Spider 1.0	10 rows	concert_singer, pets_1	Cross-domain text-to-SQL
BIRD	10 rows	california_schools, financial	With external knowledge (evidence)
Spider 2.0-lite	10 rows	book_store	SQLite-only subset (85 questions in full)

Spider 2.0: Only lite version supported — full version requires BigQuery/Snowflake cloud credentials.

Config Example

evaluation:
  evaluation_method: "text2sql"
  strategy_config:
    text2sql_scoring_mode: "exec"
    text2sql_db_base_path: "datasets/example/spider/databases"

Test Results (Devstral-Small-2-24B-Instruct-2512, EX mode)

Dataset	Accuracy
Spider 1.0	90%
BIRD	60%
Spider 2.0-lite	80%

Full Changelog: v2.4.0...v2.5.0

Assets 2

26 Mar 14:40

lianghsun

v2.4.0

b29a153

v2.4.0 — RAGAS Evaluation (LLM-as-Judge for RAG Pipelines)

New Benchmark: RAGAS (Retrieval-Augmented Generation Assessment)

Evaluate RAG pipeline quality using LLM-as-judge, measuring faithfulness, answer relevancy, context precision, and context recall.

Key Design

LLM-as-Judge: The model in config.yaml acts as the judge (not the model being evaluated)
Consolidated prompt: 1 LLM call per sample (vs official RAGAS's 6-8 multi-step calls), ~1/6 API cost
No extra dependencies: Self-implemented scoring logic, no langchain/ragas dependency
36 tests: full coverage of extractor, scorer, presets, example dataset

Usage

evaluation:
  dataset_paths:
    - "datasets/example/ragas/"
  evaluation_method: "ragas"

4 Metrics (each 0.0–1.0)

Metric	What it measures
`faithfulness`	Are response claims supported by retrieved context?
`answer_relevancy`	How relevant is the response to the question?
`context_precision`	Is the retrieved context relevant?
`context_recall`	Does context cover the reference answer?

Example Dataset

10 rows from explodinggradients/WikiEval (5 good + 3 ungrounded + 2 poor answers) for sanity checking.

Full Changelog: v2.3.0...v2.4.0

Assets 2

26 Mar 04:38

lianghsun

v2.3.0

a9cd622

v2.3.0 — NIAH (Needle in a Haystack) Benchmark

New Benchmark: NIAH (Needle in a Haystack)

Tests LLM long-context retrieval by inserting a "needle" fact at varying depths in a long "haystack" document.

Features

3 data sources: Kamradt Original (EN), NeedleBench (ZH+EN), LongBench (ZH)
3 scoring modes: substring match (default), exact match, token-level F1
Custom dataset generator: --generate-niah CLI tool to create NIAH tests from your own text
Config template: config.niah.template.yaml
49 tests: full coverage of extractor, scorer, generator, presets, example datasets

Usage

evaluation:
  dataset_paths:
    - "datasets/example/niah/kamradt/"
  evaluation_method: "niah"

Generator

twinkle-eval --generate-niah \
    --haystack my_docs.txt \
    --needle "The secret code is 42." \
    --question "What is the secret code?" \
    --answer "42" \
    --context-lengths 1024,4096,16384 \
    --needle-depths 0,25,50,75,100

Other Changes

CLAUDE.md: require full pytest tests/ run before PR submission (§6.6, §13)

Full Changelog: v2.2.0...v2.3.0

Assets 2

25 Mar 16:43

lianghsun

v2.2.0

542b875

v2.2.0 — BFCL v1 Function-Calling Evaluation

What's New

BFCL v1 — Berkeley Function-Calling Leaderboard

新增 BFCL v1 評測，支援兩種模式評估模型的 function-calling 能力：

FC mode (bfcl_fc)：使用 OpenAI tool_calls API，評估結構化 function-calling 輸出
Prompting mode (bfcl_prompt)：將 function schema 注入 system prompt，從文字輸出中解析 function call（支援 reasoning model 的 <think> block）
支援 simple / multiple / parallel 三種 function-calling 子類型
AST-based 結構比對評分（function name + argument matching）
Dataset converter 可將 raw BFCL 資料轉換為評測格式
70 個 pytest 全部通過

安裝

pip install twinkle-eval[tool]

Config 範例

evaluation:
  dataset_paths:
    - "datasets/bfcl_v1/simple/"
  evaluation_method: "bfcl_fc"   # or "bfcl_prompt"

Full Changelog: v2.1.0...v2.2.0

Assets 2

25 Mar 16:19

lianghsun

v2.1.0

1fb9a68

v2.1.0 — IFEval & IFBench Instruction-Following Evaluation

What's New

新增兩個 instruction-following 評測 benchmark，讓 Twinkle Eval 支援完整的指令遵循能力評估。

IFEval — Google Instruction Following Evaluation

25 種可驗證指令類型（change_case、keywords、length_constraints 等）
移植自 Google Research 官方實作（Apache 2.0）
4 個指標：strict/loose × prompt/instruction
evaluation_method: "ifeval"
pip install twinkle-eval[ifeval]

IFBench — AllenAI Instruction Following Benchmark (OOD)

58 種 out-of-distribution 指令類型，分為 7 大類別（count, ratio, words, sentence, format, custom, repeat）
移植自 AllenAI IFBench（Apache 2.0, NeurIPS 2025）
與 IFEval 共用 strict/loose 評分框架
evaluation_method: "ifbench"
pip install twinkle-eval[ifbench]

Score Parity（與官方工具對比）

Benchmark	Metric	Twinkle Eval	Official	Diff
IFEval (541 rows)	prompt_strict	89.65%	89.65%	+0.00% ✅
IFEval (541 rows)	instruction_strict	92.93%	92.93%	+0.00% ✅
IFBench (294 rows)	prompt_strict	44.22%	44.22%	+0.00% ✅
IFBench (294 rows)	instruction_strict	47.20%	47.16%	+0.04% ✅

Other Changes

Evaluator 支援 IFEval（JSON string）與 IFBench（原生 list/dict）兩種資料集格式
CLAUDE.md 新增 §6.6：每個新 benchmark 必須附 tests/test_{name}.py
新增 43 個 IFBench pytest + 31 個 IFEval pytest

Full Changelog: v2.0.0...v2.1.0

Assets 2

20 Mar 07:15

lianghsun

v2.0.0

22366e3

v2.0.0 — 模組化架構重構：Extractor/Scorer 拆分

⚠️ Breaking Change 重大版本

本版本為完整架構重構，不向下相容 v1.x 的 import 路徑。若有自訂程式碼使用舊路徑，請依照下方遷移說明更新。

主要變更

Extractor / Scorer 拆分（#37）

將原本的 EvaluationStrategy 拆解為兩個獨立介面：

介面	職責	實作
`Extractor`	從 LLM 輸出中抽取答案字串	`PatternExtractor`、`BoxExtractor`、`LogitExtractor`、`MathExtractor`、`CustomRegexExtractor`
`Scorer`	正規化並判斷答案是否正確	`ExactMatchScorer`、`MathRulerScorer`

兩者透過 PRESETS 登錄表組合為具名的 evaluation_method，使用者也可自行組合任意 Extractor + Scorer 傳入 Evaluator。

套件目錄整理

所有根目錄散落的模組已遷移至正確的子套件：

舊路徑（已刪除）	新路徑
`twinkle_eval/config.py`	`twinkle_eval/core/config.py`
`twinkle_eval/logger.py`	`twinkle_eval/core/logger.py`
`twinkle_eval/validators.py`	`twinkle_eval/core/validators.py`
`twinkle_eval/evaluators.py`	`twinkle_eval/runners/evaluator.py`
`twinkle_eval/benchmark.py`	`twinkle_eval/runners/benchmark.py`
`twinkle_eval/finalize.py`	`twinkle_eval/runners/finalize.py`
`twinkle_eval/hf_uploader.py`	`twinkle_eval/integrations/huggingface.py`

CLI `--init` 改版

不再產生單一 config.yaml，改為建立 configs/ 目錄，內含兩份範本：

configs/config.multiple_choice.template.yaml（適用 pattern / box / logit）
configs/config.math.template.yaml（適用 math 評測）

新增 Notebooks

notebooks/ 目錄提供兩份教學文件：

notebooks/01_multiple_choice.ipynb：選擇題評測完整教學
notebooks/02_math.ipynb：數學評測教學

遷移指南（v1.x → v2.0）

# 舊寫法（v1.x）
from twinkle_eval.evaluation_strategies import PatternMatchingStrategy
from twinkle_eval.evaluators import Evaluator

evaluator = Evaluator(evaluation_strategy=PatternMatchingStrategy())

# 新寫法（v2.0）
from twinkle_eval.metrics.extractors.pattern import PatternExtractor
from twinkle_eval.metrics.scorers.exact import ExactMatchScorer
from twinkle_eval.runners.evaluator import Evaluator

evaluator = Evaluator(extractor=PatternExtractor(), scorer=ExactMatchScorer())

config.yaml 格式不受影響，evaluation_method 字串仍完全相容。

已修復

box 評測 max_tokens 不足導致推理截斷（建議 math/box 場景設 4096）
mathruler 缺少傳遞依賴說明（#35），統一引導使用 pip install twinkle-eval[math]

Assets 2

Releases: ai-twinkle/Eval

v2.8.0 — VLM Phase 1: Vision MCQ Evaluation

VLM Phase 1 — Vision MCQ Evaluation

Added

Changed

使用方式

Uh oh!

v2.7.1 — Fix vLLM 0.18+ reasoning field compatibility

Fixed

Changed

Contributors

Uh oh!

v2.7.0 — ASR Evaluation (WhisperModel + WER/CER)

Milestone #21: ASR — Automatic Speech Recognition Evaluation

New Features

Benchmark Results

Full Changelog

Uh oh!

v2.6.0 — Benchmark Download, CLI Enhancement, Regex Match

What's New

Benchmark Download Registry (Milestone #20)

CLI Enhancement (Milestone #19)

Regex Match Evaluation Method (Milestone #18)

Uh oh!

v2.5.0 — Text-to-SQL Evaluation (Spider 1.0 / BIRD / Spider 2.0-lite)

Text-to-SQL Evaluation

New Features

Supported Benchmarks

Config Example

Test Results (Devstral-Small-2-24B-Instruct-2512, EX mode)

Uh oh!

v2.4.0 — RAGAS Evaluation (LLM-as-Judge for RAG Pipelines)

New Benchmark: RAGAS (Retrieval-Augmented Generation Assessment)

Key Design

Usage

4 Metrics (each 0.0–1.0)

Example Dataset

Uh oh!

v2.3.0 — NIAH (Needle in a Haystack) Benchmark

New Benchmark: NIAH (Needle in a Haystack)

Features

Usage

Generator

Other Changes

Uh oh!

v2.2.0 — BFCL v1 Function-Calling Evaluation

What's New

BFCL v1 — Berkeley Function-Calling Leaderboard

安裝

Config 範例

Uh oh!

v2.1.0 — IFEval & IFBench Instruction-Following Evaluation

What's New

IFEval — Google Instruction Following Evaluation

IFBench — AllenAI Instruction Following Benchmark (OOD)

Score Parity（與官方工具對比）

Other Changes

Uh oh!

v2.0.0 — 模組化架構重構：Extractor/Scorer 拆分

⚠️ Breaking Change 重大版本

主要變更

Extractor / Scorer 拆分（#37）

套件目錄整理

CLI --init 改版

新增 Notebooks

遷移指南（v1.x → v2.0）

已修復

Uh oh!

CLI `--init` 改版