SpeechLLM-Bench

A reproducible evaluation suite for speech-conditioned large language models.

Install · Tasks · Run · Leaderboard · Citation

What

speech-llm-bench is a small, opinionated framework for evaluating end-to-end speech LLMs — models that take raw audio (and optionally text) as input and emit text directly, without a frozen ASR front-end.

Most "speech LLM" papers report numbers on slightly different splits, with slightly different metrics, with slightly different prompts. This repo tries to fix that for a focused set of tasks I care about: ASR, spoken QA, audio captioning, prosody-aware understanding, and instruction following on speech.

Why not just use HF `evaluate` / lm-eval-harness?

lm-eval-harness is text-only.
Speech-task evaluation needs decoding control (no-CoT, force-language tag), audio loading, segment-aware WER, and reference normalisation that vary by task. Doing this in ad-hoc scripts produces non-comparable numbers across papers.
This repo tries to make the contract between a model wrapper and a task explicit, so adding a new model is ~30 lines and adding a new task is one YAML and a scoring function.

Tasks

Task	Metric	Datasets
ASR	WER / CER	LibriSpeech, AISHELL-1, FLEURS
Spoken QA	EM / F1	Spoken-SQuAD, Heysquad
Audio captioning	BLEU / METEOR / SPIDEr	AudioCaps, Clotho
Speech translation	BLEU	CoVoST-2 (en→zh, zh→en, …)
Prosody QA	Accuracy	ProsodyQA (custom)
Instruction Following	LLM-as-judge	InstructS2T (custom)

Supported models

Model	Wrapper	Notes
Qwen2-Audio	`models/qwen2_audio.py`	tested 7B-Instruct
Qwen-Audio	`models/qwen_audio.py`	original release
SALMONN	`models/salmonn.py`	7B / 13B
Whisper-LLaMA	`models/whisper_llama.py`	local cascade baseline
LLaMA-Omni	`models/llama_omni.py`	streaming output
VITA	`models/vita.py`	partial — text-only outputs

Install

git clone https://github.com/edmicho/speech-llm-bench
cd speech-llm-bench
pip install -e ".[full]"
# optional: install only ASR-related extras
# pip install -e ".[asr]"

ffmpeg is required on the system for audio decoding.

Running an evaluation

slb run \
    --task asr/librispeech-clean \
    --model qwen2-audio-7b \
    --limit 200 \
    --output runs/qwen2-audio-7b__librispeech-clean.json

YAML configs let you bundle a model + tasks + decoding settings:

slb run-config configs/qwen2-audio-7b.yaml

Results land in runs/ as JSON; a small CLI summarises:

slb summarise runs/*.json

Adding a new task

Drop a YAML in slb/tasks/:

name: asr/my-corpus
type: asr
dataset:
  loader: slb.data.local_jsonl
  args:
    path: data/my_corpus.jsonl
prompt: "Transcribe the speech."
decode:
  max_new_tokens: 256
  temperature: 0.0
metric: wer
norm: english_basic

And a corresponding scoring/wer.py (already implemented). See docs/adding_a_task.md.

Adding a new model

Subclass BaseSpeechLLM:

class MySpeechLLM(BaseSpeechLLM):
    name = "my-model"

    def load(self):
        ...

    def generate(self, audio, prompt, **decode):
        ...
        return text

Register the class in slb/models/__init__.py. See models/qwen2_audio.py for a worked example.

Leaderboard

Numbers are reproduced locally on a single A100-40G. See LEADERBOARD.md for the full table; an excerpt:

Model	LibriSpeech-clean (WER ↓)	AISHELL-1 (CER ↓)	AudioCaps (SPIDEr ↑)
Whisper-large-v3 cascade	2.0	5.4	—
Qwen-Audio	2.0	6.7	38.7
Qwen2-Audio-7B	1.6	5.1	41.2
SALMONN-7B	2.5	—	39.1
LLaMA-Omni	2.4	—	—

Numbers above are my own runs and may differ from published values by ±0.3.

Reproducibility

All decoding is deterministic (temperature=0.0, do_sample=False) unless the task config overrides.
Audio loading uses a fixed resampling pipeline (torchaudio.functional.resample with kaiser_window).
Reference text normalisation is task-specific and versioned (see slb/scoring/norm.py).
A seeds.json is written next to each result.

Roadmap

ASR (WER/CER) on LibriSpeech, AISHELL-1, FLEURS
Audio captioning on AudioCaps / Clotho
Spoken QA
Speech translation
Streaming / latency metrics
Long-form audio (>30s) with chunking
Adversarial robustness suite

Citation

@misc{speechllmbench,
  author = {Zihao Wei},
  title  = {SpeechLLM-Bench: a reproducible evaluation suite for speech LLMs},
  year   = {2025},
  url    = {https://github.com/edmicho/speech-llm-bench}
}

License

Apache-2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
.github/workflows		.github/workflows
configs		configs
docs		docs
scripts		scripts
slb		slb
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LEADERBOARD.md		LEADERBOARD.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SpeechLLM-Bench

What

Why not just use HF `evaluate` / lm-eval-harness?

Tasks

Supported models

Install

Running an evaluation

Adding a new task

Adding a new model

Leaderboard

Reproducibility

Roadmap

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SpeechLLM-Bench

What

Why not just use HF evaluate / lm-eval-harness?

Tasks

Supported models

Install

Running an evaluation

Adding a new task

Adding a new model

Leaderboard

Reproducibility

Roadmap

Citation

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Why not just use HF `evaluate` / lm-eval-harness?

Packages