SpeechQualityLLM: LLM-based Interactive Assessment of Speech Quality

TL;DR. SpeechQualityLLM turns objective speech quality assessment into a question–answering task:
given a (degraded, optional reference) speech signal and a natural-language question, a multimodal LLM predicts MOS and dimension-wise scores and explains its reasoning in text.

Overview

This repository contains code and pretrained checkpoints for SpeechQualityLLM, a multimodal system that:

Takes degraded speech (and optionally a clean reference) as input.
Uses an audio encoder (AST or Whisper) to extract time–frequency representations.
Projects audio features into a token sequence and feeds them, along with a textual prompt, into a LLaMA-family language model.
Answers speech-quality questions in natural language, including:
- MOS-numeric (overall 1–5 MOS),
- Dim-numeric (noisiness, coloration, discontinuity, loudness),
- Dim-categorical (verbal ratings),
- Multi-dim (joint MOS + four dimensions),
- Explanatory (MOS with short rationale).

The model is trained on the NISQA dataset using automatically generated question–answer pairs, and is evaluated on both single-ended (no-reference) and full-reference settings.

Main Features

✅ End-to-end multimodal QA for speech quality (audio + text → text).
✅ Support for single-ended and double-ended quality assessment.
✅ Multiple audio backbones:
- AST (Audio Spectrogram Transformer)
- Whisper encoder
✅ LoRA-tuned LLaMA backbone (e.g., LLaMA 3.1 8B) with 4-bit base weights.
✅ Rich textual outputs: scores, rationales, and profile-conditioned prompts (“act like a very noise-sensitive listener”, etc.).
✅ Reproducible evaluation on NISQA:
- MOS / dimension MAE, RMSE
- Pearson / Spearman correlations
- Multi-dimension consistency

Repository Structure

Speech-Quality-LLM/
├─ README.md
├─ requirements.txt                  # Python dependencies
├─ Dataset/
│   └─ NISQA_Corpus                  # NISQA metadata, audio files etc. (not included)
├─ AST                               # AST encoder directory
├─ llama-32-8B                       # Llama directory 
├─ results/
│   └─ Reference
        └─ checkpoint-10240          # Checkpoints after training of 10,000 steps, downloadable from google drive: https://drive.google.com/drive/folders/1vzcmHgOIpqVe6KzQBUfI5lHOd4slUREO?usp=sharing
    └─ NoReference
        └─ checkpoint-10240
    ....
├─ src/
│   ├─ dataloader_noref.py
│   ├─ dataloader_ref.py
│   ├─ dataloader_ref_whisper.py
│   ├─ training_noref.py
│   ├─ training_ref.py
│   ├─ training_ref_whisper.py

└─ evaluate_noref.py
└─ evaluate_ref_whisper.py
└─ evaluate_ref.py
└─ evaluate.py
└─ train.py

Installation

Install python libraries:
```
pip install -r requirements.txt
```
Download AST encoder finetuned on AudioSet dataset and put it in the "AST" directory (https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593)
Download Llama 3.1-8B and put it inside the "llama-32-8B" directory (https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
Download the trained checkpoint and put them inside the results directory (https://drive.google.com/drive/folders/1vzcmHgOIpqVe6KzQBUfI5lHOd4slUREO?usp=sharing)

Evaluation

In order to run the Full-reference model with finetuned AST, please run the evaluate.py with the necessary degraded audio path, reference audio path, prompt arguments, as well as checkpoint path if needed. Example:

  python evaluate.py --deg_path Dataset/NISQA_Corpus/NISQA_TEST_FOR/deg/c00001_for_cnv_m_1035_02.wav --ref_path Dataset/NISQA_Corpus/NISQA_TEST_FOR/ref/for_cnv_m_1035_02.wav --prompt "Explain the main causes of quality degradation in the degraded speech compared to the reference, then provide an overall MOS score between 1 and 5."

In order to evaluate the models on NISQA test dataset, run the following scripts:

  python evaluate_ref.py --checkpoint_path results/Reference/checkpoint-10240 # (for Full-reference with AST encoder)
  python evaluate_noref.py --checkpoint_path results/NoReference/checkpoint-10240 # (for No-reference with AST encoder)
python evaluate_ref_whisper.py --checkpoint_path results/Reference_FrozenWhisper/checkpoint-10240 # (for Full-reference with Whisper encoder)

Training

In order to train the model on NISQA train, validation and test set, please run the following script:

  python train.py --training_type full_reference_ast --freeze_encoder False

Use any of ["full_reference_ast", "no_reference_ast", "full_reference_whisper"] training type and keep the audio encoder frozen or trainable (by default finetune query projection layers only). Also change the dataset path and training params as needed.

Citation

If you find this repository useful in your research, please consider citing the following work:

@misc{monjur2025speechqualityllmllmbasedmultimodalassessment,
      title={SpeechQualityLLM: LLM-Based Multimodal Assessment of Speech Quality}, 
      author={Mahathir Monjur and Shahriar Nirjon},
      year={2025},
      eprint={2512.08238},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2512.08238}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SpeechQualityLLM: LLM-based Interactive Assessment of Speech Quality

Overview

Main Features

Repository Structure

Installation

Evaluation

Training

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
evaluation		evaluation
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
evaluate.py		evaluate.py
evaluate_noref.py		evaluate_noref.py
evaluate_ref.py		evaluate_ref.py
evaluate_ref_whisper.py		evaluate_ref_whisper.py
requirements.txt		requirements.txt
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

SpeechQualityLLM: LLM-based Interactive Assessment of Speech Quality

Overview

Main Features

Repository Structure

Installation

Evaluation

Training

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages