A reproducible evaluation suite for speech-conditioned large language models.
Install · Tasks · Run · Leaderboard · Citation
speech-llm-bench is a small, opinionated framework for evaluating end-to-end
speech LLMs — models that take raw audio (and optionally text) as input and
emit text directly, without a frozen ASR front-end.
Most "speech LLM" papers report numbers on slightly different splits, with slightly different metrics, with slightly different prompts. This repo tries to fix that for a focused set of tasks I care about: ASR, spoken QA, audio captioning, prosody-aware understanding, and instruction following on speech.
lm-eval-harnessis text-only.- Speech-task evaluation needs decoding control (no-CoT, force-language tag), audio loading, segment-aware WER, and reference normalisation that vary by task. Doing this in ad-hoc scripts produces non-comparable numbers across papers.
- This repo tries to make the contract between a model wrapper and a task explicit, so adding a new model is ~30 lines and adding a new task is one YAML and a scoring function.
| Task | Metric | Datasets |
|---|---|---|
| ASR | WER / CER | LibriSpeech, AISHELL-1, FLEURS |
| Spoken QA | EM / F1 | Spoken-SQuAD, Heysquad |
| Audio captioning | BLEU / METEOR / SPIDEr | AudioCaps, Clotho |
| Speech translation | BLEU | CoVoST-2 (en→zh, zh→en, …) |
| Prosody QA | Accuracy | ProsodyQA (custom) |
| Instruction Following | LLM-as-judge | InstructS2T (custom) |
| Model | Wrapper | Notes |
|---|---|---|
| Qwen2-Audio | models/qwen2_audio.py |
tested 7B-Instruct |
| Qwen-Audio | models/qwen_audio.py |
original release |
| SALMONN | models/salmonn.py |
7B / 13B |
| Whisper-LLaMA | models/whisper_llama.py |
local cascade baseline |
| LLaMA-Omni | models/llama_omni.py |
streaming output |
| VITA | models/vita.py |
partial — text-only outputs |
git clone https://github.com/edmicho/speech-llm-bench
cd speech-llm-bench
pip install -e ".[full]"
# optional: install only ASR-related extras
# pip install -e ".[asr]"ffmpeg is required on the system for audio decoding.
slb run \
--task asr/librispeech-clean \
--model qwen2-audio-7b \
--limit 200 \
--output runs/qwen2-audio-7b__librispeech-clean.jsonYAML configs let you bundle a model + tasks + decoding settings:
slb run-config configs/qwen2-audio-7b.yamlResults land in runs/ as JSON; a small CLI summarises:
slb summarise runs/*.jsonDrop a YAML in slb/tasks/:
name: asr/my-corpus
type: asr
dataset:
loader: slb.data.local_jsonl
args:
path: data/my_corpus.jsonl
prompt: "Transcribe the speech."
decode:
max_new_tokens: 256
temperature: 0.0
metric: wer
norm: english_basicAnd a corresponding scoring/wer.py (already implemented). See
docs/adding_a_task.md.
Subclass BaseSpeechLLM:
class MySpeechLLM(BaseSpeechLLM):
name = "my-model"
def load(self):
...
def generate(self, audio, prompt, **decode):
...
return textRegister the class in slb/models/__init__.py. See models/qwen2_audio.py
for a worked example.
Numbers are reproduced locally on a single A100-40G. See LEADERBOARD.md for
the full table; an excerpt:
| Model | LibriSpeech-clean (WER ↓) | AISHELL-1 (CER ↓) | AudioCaps (SPIDEr ↑) |
|---|---|---|---|
| Whisper-large-v3 cascade | 2.0 | 5.4 | — |
| Qwen-Audio | 2.0 | 6.7 | 38.7 |
| Qwen2-Audio-7B | 1.6 | 5.1 | 41.2 |
| SALMONN-7B | 2.5 | — | 39.1 |
| LLaMA-Omni | 2.4 | — | — |
Numbers above are my own runs and may differ from published values by ±0.3.
- All decoding is deterministic (
temperature=0.0,do_sample=False) unless the task config overrides. - Audio loading uses a fixed resampling pipeline (
torchaudio.functional.resamplewithkaiser_window). - Reference text normalisation is task-specific and versioned (see
slb/scoring/norm.py). - A
seeds.jsonis written next to each result.
- ASR (WER/CER) on LibriSpeech, AISHELL-1, FLEURS
- Audio captioning on AudioCaps / Clotho
- Spoken QA
- Speech translation
- Streaming / latency metrics
- Long-form audio (>30s) with chunking
- Adversarial robustness suite
@misc{speechllmbench,
author = {Zihao Wei},
title = {SpeechLLM-Bench: a reproducible evaluation suite for speech LLMs},
year = {2025},
url = {https://github.com/edmicho/speech-llm-bench}
}Apache-2.0.