Skip to content

edmicho/speech-llm-bench

Repository files navigation

SpeechLLM-Bench

A reproducible evaluation suite for speech-conditioned large language models.

Python License: Apache 2.0 PyTorch

Install · Tasks · Run · Leaderboard · Citation


What

speech-llm-bench is a small, opinionated framework for evaluating end-to-end speech LLMs — models that take raw audio (and optionally text) as input and emit text directly, without a frozen ASR front-end.

Most "speech LLM" papers report numbers on slightly different splits, with slightly different metrics, with slightly different prompts. This repo tries to fix that for a focused set of tasks I care about: ASR, spoken QA, audio captioning, prosody-aware understanding, and instruction following on speech.

Why not just use HF evaluate / lm-eval-harness?

  • lm-eval-harness is text-only.
  • Speech-task evaluation needs decoding control (no-CoT, force-language tag), audio loading, segment-aware WER, and reference normalisation that vary by task. Doing this in ad-hoc scripts produces non-comparable numbers across papers.
  • This repo tries to make the contract between a model wrapper and a task explicit, so adding a new model is ~30 lines and adding a new task is one YAML and a scoring function.

Tasks

Task Metric Datasets
ASR WER / CER LibriSpeech, AISHELL-1, FLEURS
Spoken QA EM / F1 Spoken-SQuAD, Heysquad
Audio captioning BLEU / METEOR / SPIDEr AudioCaps, Clotho
Speech translation BLEU CoVoST-2 (en→zh, zh→en, …)
Prosody QA Accuracy ProsodyQA (custom)
Instruction Following LLM-as-judge InstructS2T (custom)

Supported models

Model Wrapper Notes
Qwen2-Audio models/qwen2_audio.py tested 7B-Instruct
Qwen-Audio models/qwen_audio.py original release
SALMONN models/salmonn.py 7B / 13B
Whisper-LLaMA models/whisper_llama.py local cascade baseline
LLaMA-Omni models/llama_omni.py streaming output
VITA models/vita.py partial — text-only outputs

Install

git clone https://github.com/edmicho/speech-llm-bench
cd speech-llm-bench
pip install -e ".[full]"
# optional: install only ASR-related extras
# pip install -e ".[asr]"

ffmpeg is required on the system for audio decoding.

Running an evaluation

slb run \
    --task asr/librispeech-clean \
    --model qwen2-audio-7b \
    --limit 200 \
    --output runs/qwen2-audio-7b__librispeech-clean.json

YAML configs let you bundle a model + tasks + decoding settings:

slb run-config configs/qwen2-audio-7b.yaml

Results land in runs/ as JSON; a small CLI summarises:

slb summarise runs/*.json

Adding a new task

Drop a YAML in slb/tasks/:

name: asr/my-corpus
type: asr
dataset:
  loader: slb.data.local_jsonl
  args:
    path: data/my_corpus.jsonl
prompt: "Transcribe the speech."
decode:
  max_new_tokens: 256
  temperature: 0.0
metric: wer
norm: english_basic

And a corresponding scoring/wer.py (already implemented). See docs/adding_a_task.md.

Adding a new model

Subclass BaseSpeechLLM:

class MySpeechLLM(BaseSpeechLLM):
    name = "my-model"

    def load(self):
        ...

    def generate(self, audio, prompt, **decode):
        ...
        return text

Register the class in slb/models/__init__.py. See models/qwen2_audio.py for a worked example.

Leaderboard

Numbers are reproduced locally on a single A100-40G. See LEADERBOARD.md for the full table; an excerpt:

Model LibriSpeech-clean (WER ↓) AISHELL-1 (CER ↓) AudioCaps (SPIDEr ↑)
Whisper-large-v3 cascade 2.0 5.4
Qwen-Audio 2.0 6.7 38.7
Qwen2-Audio-7B 1.6 5.1 41.2
SALMONN-7B 2.5 39.1
LLaMA-Omni 2.4

Numbers above are my own runs and may differ from published values by ±0.3.

Reproducibility

  • All decoding is deterministic (temperature=0.0, do_sample=False) unless the task config overrides.
  • Audio loading uses a fixed resampling pipeline (torchaudio.functional.resample with kaiser_window).
  • Reference text normalisation is task-specific and versioned (see slb/scoring/norm.py).
  • A seeds.json is written next to each result.

Roadmap

  • ASR (WER/CER) on LibriSpeech, AISHELL-1, FLEURS
  • Audio captioning on AudioCaps / Clotho
  • Spoken QA
  • Speech translation
  • Streaming / latency metrics
  • Long-form audio (>30s) with chunking
  • Adversarial robustness suite

Citation

@misc{speechllmbench,
  author = {Zihao Wei},
  title  = {SpeechLLM-Bench: a reproducible evaluation suite for speech LLMs},
  year   = {2025},
  url    = {https://github.com/edmicho/speech-llm-bench}
}

License

Apache-2.0.

About

A reproducible evaluation suite for speech-conditioned LLMs — ASR, spoken QA, audio captioning, and more.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages