The TMU System for the XACLE Challenge:
Training Large Audio Language Models with CLAP Pseudo-Labels
Audio-Text Alignment Score Prediction
Official implementation of our ICASSP 2026 paper
Installation | Quick Start | Training | Models | Citation
We present a Large Audio Language Model (LALM) system for predicting semantic alignment between audio and text pairs. Our approach leverages CLAP pseudo-labels for effective pretraining, achieving significant improvements over the baseline.
| Configuration | Val SRCC | Test SRCC |
|---|---|---|
| Official Baseline | 0.384 | 0.334 |
| Our System | 0.674 | 0.625 |
| Ensemble (Final) | 0.678 | 0.632 |
# Clone the repository
git clone https://github.com/shiotalab-tmu/tmu-xacle2026.git
cd tmu-xacle2026
# Install dependencies
uv syncDownload the BEATs_iter3+ (AS2M) checkpoint from: Microsoft UniLM - BEATs.
And place the file at: checkpoints/BEATs_iter3_plus_AS2M.pt.
| Model | Description | Val SRCC | Link |
|---|---|---|---|
| Stage 3 | XACLE fine-tuned | 0.674 | 🤗Atotti/xacle-tmu-2026 |
from tmu_xacle.model.xacle_model import XACLEModel
model = XACLEModel.from_pretrained(
"Atotti/xacle-tmu-2026",
beats_checkpoint="checkpoints/BEATs_iter3_plus_AS2M.pt",
device="cuda",
)from tmu_xacle.model.xacle_model import XACLEModel
# Load pre-trained model from Hugging Face
model = XACLEModel.from_pretrained(
"Atotti/xacle-tmu-2026",
beats_checkpoint="checkpoints/BEATs_iter3_plus_AS2M.pt",
device="cuda",
)
# Predict alignment score
score = model.predict("audio.wav", "A dog barking in the park")
print(f"Alignment Score: {score:.2f}")# Generate predictions for test set
uv run python scripts/inference.py \
--checkpoint checkpoints/stage3.pt \
--beats-checkpoint checkpoints/BEATs_iter3_plus_AS2M_finetuned.pt \
--csv data/xacle/test.csv \
--audio-dir data/xacle/wav \
--output submission.csv
# Evaluate on dev set (with metrics)
uv run python scripts/inference.py \
--checkpoint checkpoints/stage3.pt \
--beats-checkpoint checkpoints/BEATs_iter3_plus_AS2M_finetuned.pt \
--csv data/xacle/dev.csv \
--audio-dir data/xacle/wav \
--mode devOur training consists of three stages:
| Stage | Task | Data | Epochs | LR |
|---|---|---|---|---|
| 1 | AAC Pretraining | AudioCaps + VGGSound (273K) | 3 | 1e-5 |
| 2 | CLAP Pseudo-Label | + Negative Sampling (~1M) | 20 | 1e-5 |
| 3 | XACLE Fine-tuning | XACLE Train (7.5K) | 150 | 6.2e-6 |
uv run python scripts/train.py \
--stage 2 \
--train-csv data/clap_pretrain.csv \
--audio-dir data/audio \
--beats-checkpoint checkpoints/BEATs_iter3_plus_AS2M_finetuned.pt \
--epochs 20 \
--lr 1e-5 \
--batch-size 16uv run python scripts/train.py \
--stage 3 \
--train-csv data/xacle/train.csv \
--val-csv data/xacle/dev.csv \
--audio-dir data/xacle/wav \
--beats-checkpoint checkpoints/BEATs_iter3_plus_AS2M_finetuned.pt \
--checkpoint checkpoints/stage2.pt \
--epochs 150 \
--lr 6.2e-6 \
--freqm 15 --timem 30- Release training code
- Release inference code
- Release pre-trained models on Hugging Face Hub
- Paper camera-ready
@INPROCEEDINGS{11461019,
author={Tsutsumi, Ayuto and Tanaka, Kohei and Shiota, Sayaka},
booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={The tmu system for the XACLE challenge: Training large audio language models with CLAP pseudo-labels},
year={2026},
volume={},
number={},
pages={21892-21894},
keywords={Feeds;Feedback;Circuits;Protocols;HTTP;Large language models;Learning (artificial intelligence);Artificial intelligence;Weak supervision;Recurrent neural networks;text-to-audio generation;audio-caption alignment;audio language model;XACLE challenge},
doi={10.1109/ICASSP55912.2026.11461019}}Tokyo Metropolitan University - Shiota Laboratory