Skip to content

shiotalab-tmu/tmu-xacle2026

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The TMU System for the XACLE Challenge:
Training Large Audio Language Models with CLAP Pseudo-Labels

Audio-Text Alignment Score Prediction

Paper Model Python

3rd Place

Official implementation of our ICASSP 2026 paper

Installation | Quick Start | Training | Models | Citation


Overview

We present a Large Audio Language Model (LALM) system for predicting semantic alignment between audio and text pairs. Our approach leverages CLAP pseudo-labels for effective pretraining, achieving significant improvements over the baseline.

Key Results

Configuration Val SRCC Test SRCC
Official Baseline 0.384 0.334
Our System 0.674 0.625
Ensemble (Final) 0.678 0.632

Installation

# Clone the repository
git clone https://github.com/shiotalab-tmu/tmu-xacle2026.git
cd tmu-xacle2026

# Install dependencies
uv sync

Download BEATs Checkpoint

Download the BEATs_iter3+ (AS2M) checkpoint from: Microsoft UniLM - BEATs. And place the file at: checkpoints/BEATs_iter3_plus_AS2M.pt.


Pre-trained Models

Model Description Val SRCC Link
Stage 3 XACLE fine-tuned 0.674 🤗Atotti/xacle-tmu-2026
from tmu_xacle.model.xacle_model import XACLEModel

model = XACLEModel.from_pretrained(
    "Atotti/xacle-tmu-2026",
    beats_checkpoint="checkpoints/BEATs_iter3_plus_AS2M.pt",
    device="cuda",
)

Quick Start

Python API

from tmu_xacle.model.xacle_model import XACLEModel

# Load pre-trained model from Hugging Face
model = XACLEModel.from_pretrained(
    "Atotti/xacle-tmu-2026",
    beats_checkpoint="checkpoints/BEATs_iter3_plus_AS2M.pt",
    device="cuda",
)

# Predict alignment score
score = model.predict("audio.wav", "A dog barking in the park")
print(f"Alignment Score: {score:.2f}")

Command Line

# Generate predictions for test set
uv run python scripts/inference.py \
    --checkpoint checkpoints/stage3.pt \
    --beats-checkpoint checkpoints/BEATs_iter3_plus_AS2M_finetuned.pt \
    --csv data/xacle/test.csv \
    --audio-dir data/xacle/wav \
    --output submission.csv

# Evaluate on dev set (with metrics)
uv run python scripts/inference.py \
    --checkpoint checkpoints/stage3.pt \
    --beats-checkpoint checkpoints/BEATs_iter3_plus_AS2M_finetuned.pt \
    --csv data/xacle/dev.csv \
    --audio-dir data/xacle/wav \
    --mode dev

Training Pipeline

Our training consists of three stages:

Stage Task Data Epochs LR
1 AAC Pretraining AudioCaps + VGGSound (273K) 3 1e-5
2 CLAP Pseudo-Label + Negative Sampling (~1M) 20 1e-5
3 XACLE Fine-tuning XACLE Train (7.5K) 150 6.2e-6

Stage 2: CLAP Pseudo-Label Pretraining

uv run python scripts/train.py \
    --stage 2 \
    --train-csv data/clap_pretrain.csv \
    --audio-dir data/audio \
    --beats-checkpoint checkpoints/BEATs_iter3_plus_AS2M_finetuned.pt \
    --epochs 20 \
    --lr 1e-5 \
    --batch-size 16

Stage 3: XACLE Fine-tuning

uv run python scripts/train.py \
    --stage 3 \
    --train-csv data/xacle/train.csv \
    --val-csv data/xacle/dev.csv \
    --audio-dir data/xacle/wav \
    --beats-checkpoint checkpoints/BEATs_iter3_plus_AS2M_finetuned.pt \
    --checkpoint checkpoints/stage2.pt \
    --epochs 150 \
    --lr 6.2e-6 \
    --freqm 15 --timem 30

TODO

  • Release training code
  • Release inference code
  • Release pre-trained models on Hugging Face Hub
  • Paper camera-ready

Citation

@INPROCEEDINGS{11461019,
  author={Tsutsumi, Ayuto and Tanaka, Kohei and Shiota, Sayaka},
  booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={The tmu system for the XACLE challenge: Training large audio language models with CLAP pseudo-labels}, 
  year={2026},
  volume={},
  number={},
  pages={21892-21894},
  keywords={Feeds;Feedback;Circuits;Protocols;HTTP;Large language models;Learning (artificial intelligence);Artificial intelligence;Weak supervision;Recurrent neural networks;text-to-audio generation;audio-caption alignment;audio language model;XACLE challenge},
  doi={10.1109/ICASSP55912.2026.11461019}}

Tokyo Metropolitan University - Shiota Laboratory

About

Official implementation of our ICASSP 2026 paper "The TMU system for the XACLE challenge: Training Large Audio Language Models with CLAP Pseudo-Labels"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages