Skip to content

Releases: ai-twinkle/Eval

v2.8.0 — VLM Phase 1: Vision MCQ Evaluation

11 Apr 06:12

Choose a tag to compare

VLM Phase 1 — Vision MCQ Evaluation

第一個 Vision Language Model 評測方法上線。Twinkle Eval 現可評測 VLM 在多模態選擇題(MMBench / MMStar / MMMU / POPE)的表現。

Added

  • vision_mcq 評測方法(Milestone #22,PR #134
    • VisionMCQExtractor:支援字母答案(A–Z)與 Yes/No 二元答案(POPE 等幻覺偵測 benchmark)
    • 優先解析 \boxed{} / \box{}(推理型 VLM 的標準輸出格式)
    • 採用 findall + 取最後一個 match 策略,正確處理 VLM 先回顯選項列表再給最終答案的常見情境
    • Pattern 順序經精細調整:parenthesized / bare-letter-at-end 優先於 line-start,避免 "A) cat / B) dog / C) bird / D" 被誤抓為 C
  • Evaluator vision 路由
    • _encode_image_to_data_uri():magic-byte MIME 偵測(PNG / JPEG / GIF / WebP / BMP)、symlink resolve、50MB 大小上限保護
    • 支援本地檔案(base64 data URI)與 HTTP/HTTPS URL(直接傳遞給 OpenAI Chat Completions)
    • uses_vision = True flag 自動分流
  • 4 個 Vision Benchmark:MMBench、MMStar、MMMU、POPE
  • Example Datasetdatasets/example/vision_mcq/(10 筆 MMStar 樣本,含 jpg 圖片)
  • docs/evals/vision_mcq.md:含 VLMEvalKit 分數對比與速度對比
  • 61 個 vision_mcq 測試tests/test_vision_mcq.py),含大量 VLM 真實輸出格式的 regression cases
  • Optional dependency vision = ["Pillow>=10.0.0"]

Changed

  • datasets/file.py:多模態附帶資源(圖片、音檔、影片)改為統計後一次性 log_info,避免逐檔 warn 噪音
  • CLAUDE.md §13 新增「強制 Reviewer Agent」規定:所有 coding agent 在任何 PR push 之前,必須先 spawn 獨立的 reviewer agent 檢查 diff,blocker 必須先處理才能 push

使用方式

llm_api:
  type: "openai"
  base_url: "http://localhost:8000/v1"
  api_key: "your-api-key"

model:
  name: "your-vlm-model"
  max_tokens: 1024

evaluation:
  dataset_paths:
    - "datasets/example/vision_mcq/"
  evaluation_method: vision_mcq
  strategy_config:
    image_field: "image_path"
    max_image_size: null
    image_detail: "auto"

需要 vision-capable 的 OpenAI 相容 API 端點(vLLM + Qwen2-VL、OpenAI GPT-4o、NVIDIA Build VLM 等)。

v2.7.1 — Fix vLLM 0.18+ reasoning field compatibility

09 Apr 15:01

Choose a tag to compare

Fixed

  • vLLM 0.18+ reasoning 欄位相容性(PR #127 by @cyc00518):vLLM 0.18+ 將 reasoning_content 改名為 reasoning,新增 _get_reasoning_text() helper 優先讀取 reasoningNone 時才回退 reasoning_content,相容 vLLM <0.13 / 0.13.x / >=0.18 三種版本

Changed

  • CLAUDE.md 新增 bug fix 必須立即推 PATCH 版號的規範

Full Changelog: v2.7.0...v2.7.1

v2.7.0 — ASR Evaluation (WhisperModel + WER/CER)

07 Apr 16:11

Choose a tag to compare

Milestone #21: ASR — Automatic Speech Recognition Evaluation

New Features

  • WhisperModel: 新增 Whisper API (/v1/audio/transcriptions) LLM 後端,相容 OpenAI、Groq、faster-whisper-server
  • ASRExtractor + ASRScorer: 自動依語言選擇 WER(英文)或 CER(中文/日文/韓文),含 text normalization pipeline
  • Chat Completions 多模態支援: 透過現有 OpenAIModel 搭配 audio_url 內容,支援 Qwen2-Audio 等多模態模型
  • 4 個 ASR Benchmark: LibriSpeech、Aishell-1、Fleurs、Common Voice(總計 23 個可下載 benchmark)
  • Optional dependency: pip install twinkle-eval[asr](jiwer)

Benchmark Results

使用 Breeze-ASR-25 + Common Voice TW 50 筆測試:

  • CER: 3.80%
  • 並行加速: 7.5x(vs sequential)

Full Changelog

v2.6.0...v2.7.0

v2.6.0 — Benchmark Download, CLI Enhancement, Regex Match

07 Apr 08:58

Choose a tag to compare

What's New

Benchmark Download Registry (Milestone #20)

  • --download-dataset 支援 19 個 benchmark 短名稱(mmlugsm8kbbh 等)
  • --download-dataset all 一鍵下載全部
  • --download-dataset list 列出所有可用資料集
  • 支援 GitHub-based 下載(BIRD、Spider 2.0-lite、LongBench)
  • Gated dataset(GPQA)互動式 HF token 提示

CLI Enhancement (Milestone #19)

  • --init 重構:--init(列出範本)、--init <name>(單一)、--init all(全部)
  • --dry-run:預覽評測計畫,不呼叫 API
  • --validate:驗證設定檔與資料集格式
  • --resume TIMESTAMP:從中斷點繼續評測
  • 設定檔範本搬入 twinkle_eval/templates/

Regex Match Evaluation Method (Milestone #18)

  • RegexMatchExtractor + StringMatchScorer
  • BBH (BIG-Bench Hard) 為首個 use case
  • 66 個 unit tests

Full Changelog: v2.5.0...v2.6.0

v2.5.0 — Text-to-SQL Evaluation (Spider 1.0 / BIRD / Spider 2.0-lite)

27 Mar 03:33

Choose a tag to compare

Text-to-SQL Evaluation

Unified text2sql evaluation method supporting three major text-to-SQL benchmarks:

New Features

  • SQL Extractor: extracts SQL from LLM responses (```sql blocks, plain SELECT, mixed text)
  • SQL Scorer: two scoring modes
    • Execution Accuracy (EX): executes predicted + gold SQL against SQLite, compares result sets (default)
    • Exact Match (EM): normalized SQL string comparison
  • Read-only SQLite execution: mode=ro + PRAGMA query_only = ON

Supported Benchmarks

Benchmark Example Size Databases Notes
Spider 1.0 10 rows concert_singer, pets_1 Cross-domain text-to-SQL
BIRD 10 rows california_schools, financial With external knowledge (evidence)
Spider 2.0-lite 10 rows book_store SQLite-only subset (85 questions in full)

Spider 2.0: Only lite version supported — full version requires BigQuery/Snowflake cloud credentials.

Config Example

evaluation:
  evaluation_method: "text2sql"
  strategy_config:
    text2sql_scoring_mode: "exec"
    text2sql_db_base_path: "datasets/example/spider/databases"

Test Results (Devstral-Small-2-24B-Instruct-2512, EX mode)

Dataset Accuracy
Spider 1.0 90%
BIRD 60%
Spider 2.0-lite 80%

Full Changelog: v2.4.0...v2.5.0

v2.4.0 — RAGAS Evaluation (LLM-as-Judge for RAG Pipelines)

26 Mar 14:40

Choose a tag to compare

New Benchmark: RAGAS (Retrieval-Augmented Generation Assessment)

Evaluate RAG pipeline quality using LLM-as-judge, measuring faithfulness, answer relevancy, context precision, and context recall.

Key Design

  • LLM-as-Judge: The model in config.yaml acts as the judge (not the model being evaluated)
  • Consolidated prompt: 1 LLM call per sample (vs official RAGAS's 6-8 multi-step calls), ~1/6 API cost
  • No extra dependencies: Self-implemented scoring logic, no langchain/ragas dependency
  • 36 tests: full coverage of extractor, scorer, presets, example dataset

Usage

evaluation:
  dataset_paths:
    - "datasets/example/ragas/"
  evaluation_method: "ragas"

4 Metrics (each 0.0–1.0)

Metric What it measures
faithfulness Are response claims supported by retrieved context?
answer_relevancy How relevant is the response to the question?
context_precision Is the retrieved context relevant?
context_recall Does context cover the reference answer?

Example Dataset

10 rows from explodinggradients/WikiEval (5 good + 3 ungrounded + 2 poor answers) for sanity checking.

Full Changelog: v2.3.0...v2.4.0

v2.3.0 — NIAH (Needle in a Haystack) Benchmark

26 Mar 04:38

Choose a tag to compare

New Benchmark: NIAH (Needle in a Haystack)

Tests LLM long-context retrieval by inserting a "needle" fact at varying depths in a long "haystack" document.

Features

  • 3 data sources: Kamradt Original (EN), NeedleBench (ZH+EN), LongBench (ZH)
  • 3 scoring modes: substring match (default), exact match, token-level F1
  • Custom dataset generator: --generate-niah CLI tool to create NIAH tests from your own text
  • Config template: config.niah.template.yaml
  • 49 tests: full coverage of extractor, scorer, generator, presets, example datasets

Usage

evaluation:
  dataset_paths:
    - "datasets/example/niah/kamradt/"
  evaluation_method: "niah"

Generator

twinkle-eval --generate-niah \
    --haystack my_docs.txt \
    --needle "The secret code is 42." \
    --question "What is the secret code?" \
    --answer "42" \
    --context-lengths 1024,4096,16384 \
    --needle-depths 0,25,50,75,100

Other Changes

  • CLAUDE.md: require full pytest tests/ run before PR submission (§6.6, §13)

Full Changelog: v2.2.0...v2.3.0

v2.2.0 — BFCL v1 Function-Calling Evaluation

25 Mar 16:43

Choose a tag to compare

What's New

BFCL v1 — Berkeley Function-Calling Leaderboard

新增 BFCL v1 評測,支援兩種模式評估模型的 function-calling 能力:

  • FC mode (bfcl_fc):使用 OpenAI tool_calls API,評估結構化 function-calling 輸出
  • Prompting mode (bfcl_prompt):將 function schema 注入 system prompt,從文字輸出中解析 function call(支援 reasoning model 的 <think> block)
  • 支援 simple / multiple / parallel 三種 function-calling 子類型
  • AST-based 結構比對評分(function name + argument matching)
  • Dataset converter 可將 raw BFCL 資料轉換為評測格式
  • 70 個 pytest 全部通過

安裝

pip install twinkle-eval[tool]

Config 範例

evaluation:
  dataset_paths:
    - "datasets/bfcl_v1/simple/"
  evaluation_method: "bfcl_fc"   # or "bfcl_prompt"

Full Changelog: v2.1.0...v2.2.0

v2.1.0 — IFEval & IFBench Instruction-Following Evaluation

25 Mar 16:19

Choose a tag to compare

What's New

新增兩個 instruction-following 評測 benchmark,讓 Twinkle Eval 支援完整的指令遵循能力評估。

IFEval — Google Instruction Following Evaluation

  • 25 種可驗證指令類型(change_casekeywordslength_constraints 等)
  • 移植自 Google Research 官方實作(Apache 2.0)
  • 4 個指標:strict/loose × prompt/instruction
  • evaluation_method: "ifeval"
  • pip install twinkle-eval[ifeval]

IFBench — AllenAI Instruction Following Benchmark (OOD)

  • 58 種 out-of-distribution 指令類型,分為 7 大類別(count, ratio, words, sentence, format, custom, repeat)
  • 移植自 AllenAI IFBench(Apache 2.0, NeurIPS 2025)
  • 與 IFEval 共用 strict/loose 評分框架
  • evaluation_method: "ifbench"
  • pip install twinkle-eval[ifbench]

Score Parity(與官方工具對比)

Benchmark Metric Twinkle Eval Official Diff
IFEval (541 rows) prompt_strict 89.65% 89.65% +0.00% ✅
IFEval (541 rows) instruction_strict 92.93% 92.93% +0.00% ✅
IFBench (294 rows) prompt_strict 44.22% 44.22% +0.00% ✅
IFBench (294 rows) instruction_strict 47.20% 47.16% +0.04% ✅

Other Changes

  • Evaluator 支援 IFEval(JSON string)與 IFBench(原生 list/dict)兩種資料集格式
  • CLAUDE.md 新增 §6.6:每個新 benchmark 必須附 tests/test_{name}.py
  • 新增 43 個 IFBench pytest + 31 個 IFEval pytest

Full Changelog: v2.0.0...v2.1.0

v2.0.0 — 模組化架構重構:Extractor/Scorer 拆分

20 Mar 07:15

Choose a tag to compare

⚠️ Breaking Change 重大版本

本版本為完整架構重構,不向下相容 v1.x 的 import 路徑。若有自訂程式碼使用舊路徑,請依照下方遷移說明更新。


主要變更

Extractor / Scorer 拆分(#37

將原本的 EvaluationStrategy 拆解為兩個獨立介面:

介面 職責 實作
Extractor 從 LLM 輸出中抽取答案字串 PatternExtractorBoxExtractorLogitExtractorMathExtractorCustomRegexExtractor
Scorer 正規化並判斷答案是否正確 ExactMatchScorerMathRulerScorer

兩者透過 PRESETS 登錄表組合為具名的 evaluation_method,使用者也可自行組合任意 Extractor + Scorer 傳入 Evaluator

套件目錄整理

所有根目錄散落的模組已遷移至正確的子套件:

舊路徑(已刪除) 新路徑
twinkle_eval/config.py twinkle_eval/core/config.py
twinkle_eval/logger.py twinkle_eval/core/logger.py
twinkle_eval/validators.py twinkle_eval/core/validators.py
twinkle_eval/evaluators.py twinkle_eval/runners/evaluator.py
twinkle_eval/benchmark.py twinkle_eval/runners/benchmark.py
twinkle_eval/finalize.py twinkle_eval/runners/finalize.py
twinkle_eval/hf_uploader.py twinkle_eval/integrations/huggingface.py

CLI --init 改版

不再產生單一 config.yaml,改為建立 configs/ 目錄,內含兩份範本:

  • configs/config.multiple_choice.template.yaml(適用 pattern / box / logit)
  • configs/config.math.template.yaml(適用 math 評測)

新增 Notebooks

notebooks/ 目錄提供兩份教學文件:

  • notebooks/01_multiple_choice.ipynb:選擇題評測完整教學
  • notebooks/02_math.ipynb:數學評測教學

遷移指南(v1.x → v2.0)

# 舊寫法(v1.x)
from twinkle_eval.evaluation_strategies import PatternMatchingStrategy
from twinkle_eval.evaluators import Evaluator

evaluator = Evaluator(evaluation_strategy=PatternMatchingStrategy())

# 新寫法(v2.0)
from twinkle_eval.metrics.extractors.pattern import PatternExtractor
from twinkle_eval.metrics.scorers.exact import ExactMatchScorer
from twinkle_eval.runners.evaluator import Evaluator

evaluator = Evaluator(extractor=PatternExtractor(), scorer=ExactMatchScorer())

config.yaml 格式不受影響evaluation_method 字串仍完全相容。


已修復

  • box 評測 max_tokens 不足導致推理截斷(建議 math/box 場景設 4096)
  • mathruler 缺少傳遞依賴說明(#35),統一引導使用 pip install twinkle-eval[math]