The TMU System for the XACLE Challenge:
Training Large Audio Language Models with CLAP Pseudo-Labels

Audio-Text Alignment Score Prediction

Official implementation of our ICASSP 2026 paper

Installation | Quick Start | Training | Models | Citation

Overview

We present a Large Audio Language Model (LALM) system for predicting semantic alignment between audio and text pairs. Our approach leverages CLAP pseudo-labels for effective pretraining, achieving significant improvements over the baseline.

Key Results

Configuration	Val SRCC	Test SRCC
Official Baseline	0.384	0.334
Our System	0.674	0.625
Ensemble (Final)	0.678	0.632

Installation

# Clone the repository
git clone https://github.com/shiotalab-tmu/tmu-xacle2026.git
cd tmu-xacle2026

# Install dependencies
uv sync

Download BEATs Checkpoint

Download the BEATs_iter3+ (AS2M) checkpoint from: Microsoft UniLM - BEATs. And place the file at: checkpoints/BEATs_iter3_plus_AS2M.pt.

Pre-trained Models

Model	Description	Val SRCC	Link
Stage 3	XACLE fine-tuned	0.674	🤗Atotti/xacle-tmu-2026

from tmu_xacle.model.xacle_model import XACLEModel

model = XACLEModel.from_pretrained(
    "Atotti/xacle-tmu-2026",
    beats_checkpoint="checkpoints/BEATs_iter3_plus_AS2M.pt",
    device="cuda",
)

Quick Start

Python API

from tmu_xacle.model.xacle_model import XACLEModel

# Load pre-trained model from Hugging Face
model = XACLEModel.from_pretrained(
    "Atotti/xacle-tmu-2026",
    beats_checkpoint="checkpoints/BEATs_iter3_plus_AS2M.pt",
    device="cuda",
)

# Predict alignment score
score = model.predict("audio.wav", "A dog barking in the park")
print(f"Alignment Score: {score:.2f}")

Command Line

# Generate predictions for test set
uv run python scripts/inference.py \
    --checkpoint checkpoints/stage3.pt \
    --beats-checkpoint checkpoints/BEATs_iter3_plus_AS2M_finetuned.pt \
    --csv data/xacle/test.csv \
    --audio-dir data/xacle/wav \
    --output submission.csv

# Evaluate on dev set (with metrics)
uv run python scripts/inference.py \
    --checkpoint checkpoints/stage3.pt \
    --beats-checkpoint checkpoints/BEATs_iter3_plus_AS2M_finetuned.pt \
    --csv data/xacle/dev.csv \
    --audio-dir data/xacle/wav \
    --mode dev

Training Pipeline

Our training consists of three stages:

Stage	Task	Data	Epochs	LR
1	AAC Pretraining	AudioCaps + VGGSound (273K)	3	1e-5
2	CLAP Pseudo-Label	+ Negative Sampling (~1M)	20	1e-5
3	XACLE Fine-tuning	XACLE Train (7.5K)	150	6.2e-6

Stage 2: CLAP Pseudo-Label Pretraining

uv run python scripts/train.py \
    --stage 2 \
    --train-csv data/clap_pretrain.csv \
    --audio-dir data/audio \
    --beats-checkpoint checkpoints/BEATs_iter3_plus_AS2M_finetuned.pt \
    --epochs 20 \
    --lr 1e-5 \
    --batch-size 16

Stage 3: XACLE Fine-tuning

uv run python scripts/train.py \
    --stage 3 \
    --train-csv data/xacle/train.csv \
    --val-csv data/xacle/dev.csv \
    --audio-dir data/xacle/wav \
    --beats-checkpoint checkpoints/BEATs_iter3_plus_AS2M_finetuned.pt \
    --checkpoint checkpoints/stage2.pt \
    --epochs 150 \
    --lr 6.2e-6 \
    --freqm 15 --timem 30

TODO

Release training code
Release inference code
Release pre-trained models on Hugging Face Hub
Paper camera-ready

Citation

@INPROCEEDINGS{11461019,
  author={Tsutsumi, Ayuto and Tanaka, Kohei and Shiota, Sayaka},
  booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={The tmu system for the XACLE challenge: Training large audio language models with CLAP pseudo-labels}, 
  year={2026},
  volume={},
  number={},
  pages={21892-21894},
  keywords={Feeds;Feedback;Circuits;Protocols;HTTP;Large language models;Learning (artificial intelligence);Artificial intelligence;Weak supervision;Recurrent neural networks;text-to-audio generation;audio-caption alignment;audio language model;XACLE challenge},
  doi={10.1109/ICASSP55912.2026.11461019}}

Tokyo Metropolitan University - Shiota Laboratory

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
configs		configs
scripts		scripts
src/tmu_xacle		src/tmu_xacle
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The TMU System for the XACLE Challenge:
Training Large Audio Language Models with CLAP Pseudo-Labels

Overview

Key Results

Installation

Download BEATs Checkpoint

Pre-trained Models

Quick Start

Python API

Command Line

Training Pipeline

Stage 2: CLAP Pseudo-Label Pretraining

Stage 3: XACLE Fine-tuning

TODO

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

The TMU System for the XACLE Challenge: Training Large Audio Language Models with CLAP Pseudo-Labels

Overview

Key Results

Installation

Download BEATs Checkpoint

Pre-trained Models

Quick Start

Python API

Command Line

Training Pipeline

Stage 2: CLAP Pseudo-Label Pretraining

Stage 3: XACLE Fine-tuning

TODO

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

The TMU System for the XACLE Challenge:
Training Large Audio Language Models with CLAP Pseudo-Labels

Packages