━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
MiMo-Audio-Eval Toolkit
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Welcome to the MiMo-Audio-Eval toolkit! This toolkit is designed to evaluate various audio language models as described in the MiMo-Audio paper. It provides a flexible and extensible framework, supporting a wide range of datasets, tasks, and models, specifically for evaluating pre-trained or supervised fine-tuned (SFT) audio language models. The toolkit is ideal for researchers and developers who need to assess the performance of these models across different tasks and datasets.
The MiMo-Audio-Eval toolkit supports a comprehensive set of datasets, tasks, and models. Some of the key features include:
-
Datasets:
- AISHELL1
- LibriSpeech
- SeedTTS
- Expresso
- InstructTTSEval
- SpeechMMLU
- MMAR
- MMAU
- MMAU-Pro
- MMSU
- ESD
- Big Bench Audio
- MultiChallenge Audio
-
Tasks:
-
Pretrain:
- ICL General Knowledge Evaluation
- ICL Audio Understanding Evaluation
- ICL Speech-to-Speech Generation
-
SFT:
- ASR
- TTS / InstructTTS
- Audio Understanding and Reasoning
- Spoken Dialogue
-
-
Models:
- MiMo-Audio
- Step-Audio2
- Kimi-Audio
- Baichuan-Audio
- Qwen-Omni
To get started with the MiMo-Audio-Eval toolkit, follow the instructions below to set up the environment and install the required dependencies.
- Python 3.12
- CUDA >= 12.0
git clone --recurse-submodules https://github.com/XiaomiMiMo/MiMo-Audio-Eval
cd MiMo-Audio-Eval
pip install -r requirements.txt
pip install flash-attn==2.7.4.post1
pip install -e .Note: For evaluating Qwen2.5-Omni, please install the following dependencies:
pip install transformers==4.52.3 qwen-omni-utils[decord]Note
If the compilation of flash-attn takes too long, you can download the precompiled wheel and install it manually:
pip install /path/to/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whlpython download_data.pyDownload the WavLM model and place it in the data/ directory.
Export your OpenAI API Key:
export OPENAI_API_KEY="your_openai_api_key_here"We provide a series of evaluation scripts in the eval_scripts directory, including scripts for evaluating both pre-trained models and SFT models. These scripts can be used to reproduce the results presented in our paper. An example usage is as follows:
bash $scripts <model_path> <tokenizer_path> <model_name>@misc{coreteam2025mimoaudio,
title={MiMo-Audio: Audio Language Models are Few-Shot Learners},
author={LLM-Core-Team Xiaomi},
year={2025},
url={https://github.com/XiaomiMiMo/MiMo-Audio},
}Please contact us at mimo@xiaomi.com or open an issue if you have any questions.