MiDashengLM-7B

Efficient audio understanding with general audio captions

📢 News

2025-10-21: Model updated from 0804 (tech report version) to 1021, supports audio-embedded instructions (e.g., "Follow the instruction at the start of an audio.") previously text-only.
2025-10-09: Uploaded several newly quantized model variants for resource-constrained devices.
2025-09-24: Released the mdl-toolkit, a user-friendly fine-tuning toolkit for MiDashengLM. ESC-50 example Notebook: en | 中文
2025-09-04: vLLM now officially supports MiDashengLM. Deploy dasheng-lm with vLLM. And we're developing the 4-bit quantized version, please stay tuned.
2025-09-01: vLLM integration PR submitted to the official vLLM repository. Preview available in our fork during review. See Issue #17 for details.

🔥 Key Highlights

State-of-the-Art Performance

Outperforms Qwen2.5-Omni-7B, Kimi-Audio-Instruct-7B on multiple key audio understanding tasks.

High Efficiency

3.2× throughput speedup at comparable batch sizes compared to Qwen2.5-Omni-7B.
20x throughput speedup by increasing furhter batchsizes. We tested up to a batch size=512 for 30s audio input on 80GB GPUs. Baselines only support batch size = 8.
Time-to-first-token (TTFT) speedup of up to 4x compared to Qwen2.5-Omni-7B.

Caption-based Alignment

Trained with general audio captions (instead of ASR transcripts) to achieve holistic audio understanding.

Full Transparency

Public-source training data and reproducible pipeline.
Apache License 2.0 for both research and commercial use.

^{This 0804 version corresponds to the technical report. Newer version (1021) offers improved performance.}

Acknowledgment and Model Foundation

Although MiDashengLM demonstrates superior audio understanding performance and efficiency compared to Qwen2.5-Omni models, we acknowledge Qwen2.5-Omni as a remarkable and respected foundational work in the field. Our model specifically uses Qwen2.5-Omni-7B Thinker as the initialization for decoder training, building upon its robust architecture and weight initialization.

The audio encoder is built upon Dasheng, an open-source audio encoder for general audio understanding with state-of-the-art performance. Dasheng serves as the core foundation enabling MiDashengLM's exceptional performance.

Framework

MiDashengLM integrates the powerful Dasheng audio encoder with the Qwen2.5-Omni-7B Thinker decoder through a unique caption-based alignment strategy. Unlike conventional ASR-driven approaches, our model leverages general audio captions to capture comprehensive audio representations encompassing speech, environmental sounds, and musical elements in a unified textual format. This design enables holistic audio understanding while maintaining exceptional computational efficiency.

Why Captions Instead of ASR?

ASR Limitations:

Discards huge amount of non-speech audio (music/environmental sounds).
Misses paralinguistic info (speaker emotion, acoustic properties).
Monotonic alignment provides trivial learning signal.

Caption Advantages:

Utilizes all audio content.
Captures global audio context.
Non-monotonic alignment provides a hard learning signal.

Novel Open Source Dataset for Training: ACAVCaps

ACAVCaps is a meticulously curated 38,662-hour collection of general audio captions derived from the open-source ACAV100M audio repository. While leveraging ACAV100M's extensive raw audio materials, we completely re-engineered the annotation process to create a dataset for holistic audio understanding. We devide the dataset into six categories:

Category	Example Caption
Pure Speech	"A female voice narrates historical competition with synthetic modulation"
Pure Sound	"Outdoor scene with wind, birds, duck quacking and background noise"
Pure Music	"Crowd cheering with electronic synthesizer-driven soundscape"
Mixed Music	"The audio features a crowd cheering and clapping alongside electronic music with a synthesizer-driven, dark, and energetic soundscape."
Mixed Speech	"A Russian voice demonstrates a synthesizer’s capabilities over an experimental electronic backdrop, explaining its sound design and value in a gritty, vocal-fry tone."
Mixed Sound	"A man speaks in English about entering a city and village, accompanied by the sounds of a running vehicle."

The figure below illustrates our data curation pipeline for ACAVCaps:

Each caption is generated through a three-step process:

Multi-expert analysis (speech, vocal, music, acoustics)
LLM reasoning synthesizing metadata with DeepSeek-R1
Filtering for audio-text consistency with Dasheng-GLAP

We will release the ACAVCaps dataset after the ICASSP 2026 review process.

Available Model Variants

We provide multiple precision / quantization formats to cover different deployment and fine-tuning scenarios:

Variant	Format	Hugging Face (1021)	ModelScope (1021))
midashenglm-7b	FP32	Link	Link
midashenglm-7b-bf16	BF16	Link	Link
midashenglm-7b-fp8	FP8	Link	Link
midashenglm-7b-w4a16-gptq	GPTQ W4A16	Link	Link

Usage Guidance:

FP32: Use only when numerical precision is critical (e.g., for rigorous reproduction or benchmarking). For general use, it consumes more resources without a corresponding quality gain.
BF16: Recommended for most general-purpose scenarios, including inference and fine-tuning. It delivers quality comparable to FP32 while being significantly faster on modern GPUs (e.g., A100, H100, RTX 4090).
FP8: Optimized for Hopper-class (H100 and newer) GPUs, leveraging hardware support for enhanced performance and memory savings. While older GPUs may see limited performance gains, FP8 can still be used to conserve VRAM, and storage.
GPTQ W4A16: An ideal choice for resource-constrained environments. It offers broad GPU compatibility and a smaller memory footprint, making it suitable for deployment where VRAM, memory, or storage is limited, provided that a slight trade-off in quality is acceptable.

The full list model variants of MiDashengLM: Hugging Face / Model Scope

Usage

Load Model

from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer

model_id = "mispeech/midashenglm-7b-1021-bf16"

model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

If you are in a region with limited access to Hugging Face resources, you may want to use hf-mirror as a mirror of Hugging Face:

export HF_ENDPOINT=https://hf-mirror.com

Construct Prompt

user_prompt = "Caption the audio."  # You may try any other prompt

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": user_prompt},
            {
                "type": "audio",
                "path": "/path/to/example.wav",
                # or "url": "https://example.com/example.wav"
                # or "audio": np.random.randn(16000)
            },
        ],
    },
]

Generate Output

import torch

with torch.no_grad():
    model_inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        add_special_tokens=True,
        return_dict=True,
    )
    generation = model.generate(**model_inputs)
    output = tokenizer.batch_decode(generation, skip_special_tokens=True)  # ["An engine is idling."]

Fine-tuning

We appreciate the ms-swift implementation contributed by @JimmyMa99 in ms-swift#5325.

We also provide MDL-Toolkit, a user-friendly fine-tuning toolkit for MiDashengLM.

Deploy with vLLM

vLLM provides a high-performance, user-friendly library for LLM inference and serving.

Install vLLM with pip or from source:

# Set up using Python-only build (without compilation)
git clone https://github.com/vllm-project/vllm.git
cd vllm
VLLM_USE_PRECOMPILED=1 pip install --editable .

# Full build (with compilation)
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .

You can find sample code for offline execution in the VLLM repository audio_language.

# Offline inference
python3 examples/offline_inference/audio_language.py -m midashenglm

# Online serving using OpenAI-compatible server
python3 -m vllm.entrypoints.openai.api_server --model mispeech/midashenglm-7b --tensor-parallel-size 1 --served-model-name default --port 8000 --dtype float16 --max_model_len 4096 --trust_remote_code

Results

The technical report primarily evaluates the midashenglm-7b-0804-fp32 model, which represents our initial release from August 4, 2025. Note that our current best-performing model is midashenglm-7b-1021, though the technical report has not been updated to include its results. For detailed experimental tables and performance metrics, please refer to the Hugging Face model pages or the technical report.

Reproduction Instructions

To reproduce our results, we provide:

Prompts (prompt.csv)
Evaluation scripts
Example JSONL files

1. Install Dependencies for Evaluation (No need this for inference)

pip install -r requirements.txt

2. Generate Model Outputs

Generate responses using the model's official framework with prompts from prompt.csv.

3. Convert Outputs to JSONL Format

Format model outputs using the example JSONL files:

Task	Example File
Automatic Speech Recognition	MiDashengLM_LibriSpeech_test-clean.jsonl
Single-target Audio Tagging	MiDashengLM_NSynth.jsonl
Gender Recognition	MiDashengLM_VoxCeleb-Gender.jsonl
Multi-target Audio Tagging	MiDashengLM_FSD50K.jsonl
Audio Captioning	MiDashengLM_AutoACD.jsonl
Open Audio Question Answering	MiDashengLM_MusicQA.jsonl
Audio QA with Options	MiDashengLM_MuChoMusic.jsonl

4. Evaluate Results

Execute the corresponding evaluation scripts:

# Automatic Speech Recognition (WER)
# Uses: lang, text, model_output
python evaluate/wer/compute_wer.py -i evaluate/jsonl/MiDashengLM_LibriSpeech_test-clean.jsonl

# Single-target Audio Tagging (ACC)
# Uses: label, model_output
python evaluate/compute_at_acc.py -i evaluate/jsonl/MiDashengLM_NSynth.jsonl

# Gender Recognition (ACC)
# Uses: label, model_output
python evaluate/compute_gender_acc.py -i evaluate/jsonl/MiDashengLM_VoxCeleb-Gender.jsonl

# Multi-target Audio Tagging (mAP)
# Uses: dataset_name, label, model_output, model_name
python evaluate/compute_map.py -i evaluate/jsonl/MiDashengLM_FSD50K.jsonl

# Audio Captioning (FENSE)
# Uses: audio, text, model_output
python evaluate/compute_fense.py -i evaluate/jsonl/MiDashengLM_AutoACD.jsonl

# Open Audio QA (FENSE)
# Uses: audio, answer, model_output
python evaluate/compute_fense.py -i evaluate/jsonl/MiDashengLM_MusicQA.jsonl

# Audio QA with Options (ACC)
# Uses: answer, model_output
python evaluate/compute_qa_acc.py -i evaluate/jsonl/MiDashengLM_MuChoMusic.jsonl

5. Evaluate on MECAT and MMAU benchmarks

Please refer to the official repositories for evaluation on the MECAT and MMAU benchmarks.

Efficiency

MiDashengLM-7B demonstrates superior inference efficiency compared to Qwen2.5-Omni-7B, achieving 3.2× speedup at comparable batch sizes and an overall potential speedup of 20.2× with larger batches.

Batch Size	MiDashengLM-7B (samples/s)	Qwen2.5-Omni-7B (samples/s)	Speedup
1	0.45	0.36	1.25x
4	1.40	0.91	1.53x
8	2.72	1.15	2.36x
16	5.18	OOM	-
32	9.78	OOM	-
64	17.07	OOM	-
128	22.73	OOM	-
200	25.15	OOM	-

Tested on 80GB GPU with 30s audio, 100-token output.

Citation

MiDashengLM is under the Apache License 2.0, and we encourage its use in both research and business applications.

If you find MiDashengLM useful in your research, please consider citing our work:

@techreport{midashenglm7b,
  title      = {MiDashengLM: Efficient Audio Understanding with General Audio Captions},
  author     = {{Horizon Team, MiLM Plus}},
  institution= {Xiaomi Inc.},
  year       = {2025},
  note       = {Contributors: Heinrich Dinkel et al. (listed alphabetically in Appendix B)},
  url        = {https://arxiv.org/abs/2508.03983},
  eprint     = {2508.03983},
}

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
evaluate		evaluate
fig		fig
mdl-toolkit		mdl-toolkit
technical_report		technical_report
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MiDashengLM-7B

📢 News

🔥 Key Highlights

Acknowledgment and Model Foundation

Framework

Why Captions Instead of ASR?

Novel Open Source Dataset for Training: ACAVCaps

Available Model Variants

Usage

Load Model

Construct Prompt

Generate Output

Fine-tuning

Deploy with vLLM

Results

Reproduction Instructions

1. Install Dependencies for Evaluation (No need this for inference)

2. Generate Model Outputs

3. Convert Outputs to JSONL Format

4. Evaluate Results

5. Evaluate on MECAT and MMAU benchmarks

Efficiency

Citation

About

Uh oh!

Releases

Contributors 7

Uh oh!

Languages

License

xiaomi-research/dasheng-lm

Folders and files

Latest commit

History

Repository files navigation

MiDashengLM-7B

📢 News

🔥 Key Highlights

Acknowledgment and Model Foundation

Framework

Why Captions Instead of ASR?

Novel Open Source Dataset for Training: ACAVCaps

Available Model Variants

Usage

Load Model

Construct Prompt

Generate Output

Fine-tuning

Deploy with vLLM

Results

Reproduction Instructions

1. Install Dependencies for Evaluation (No need this for inference)

2. Generate Model Outputs

3. Convert Outputs to JSONL Format

4. Evaluate Results

5. Evaluate on MECAT and MMAU benchmarks

Efficiency

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors 7

Uh oh!

Languages