GitHub - Brain-Cog-Lab/S-CMRL: The repo for "Advancing Audio-Visual Spiking Neural Networks via Semantic-Alignment Cross-Modal Residual Learning"

Enhancing Audio-Visual Spiking Neural Networks through Semantic-Alignment and Cross-Modal Residual Learning

Xiang He*, Dongcheng Zhao*, Yiting Dong, Guobin Shen, Xin Yang†, Yi Zeng†

Institute of Automation, Chinese Academy of Sciences, Beijing
*Equal contribution †Corresponding author

Here is the PyTorch implementation of our paper. If you find this work useful for your research, please kindly cite our paper and star our repo.

Method

We construct a semantic-alignment cross-modal residual learning framework for multimodal SNNs. This framework provides an efficient feature fusion strategy and achieves state-of-the-art performance on three public datasets, demonstrating superior accuracy and robustness compared to existing methods.

Comparison of S-CMRL with state-of-the-art methods on three datasets:

Training Script

All experimental scripts can be found in run_classification.sh

A sample script for our method on the CREMA-D dataset is as follows：

CUDA_VISIBLE_DEVICES=0 python train_snn.py --model AVspikformer --dataset CREMAD --epoch 100 --batch-size 128 --num-classes 6 --step 4 --modality audio-visual --cross-attn --attn-method SpatialTemporal --alpha 1.5 --contrastive --temperature 0.07

The well-trained model weights and training logs are available here to reproduce the results from the paper.

Datasets

CREMA-D datasets：CREMA-D

UrbanSound8K-AV datasets: UrbanSound8K-AV

Citation

If our paper is useful for your research, please consider citing it:

@misc{he2025enhancingaudiovisualspikingneural,
      title={Enhancing Audio-Visual Spiking Neural Networks through Semantic-Alignment and Cross-Modal Residual Learning}, 
      author={Xiang He and Dongcheng Zhao and Yiting Dong and Guobin Shen and Xin Yang and Yi Zeng},
      year={2025},
      eprint={2502.12488},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2502.12488}, 
}

Acknowledgements

The UrbanSound8K-AV datasets used can be found in SMMT, thanks to their excellent work! The SNN implementation is based on Brain-Cog.

If you are confused about using it or have other feedback and comments, please feel free to contact us via [email protected]. Have a good day!

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.idea		.idea
Pytorch_Grad_Cam		Pytorch_Grad_Cam
SNN		SNN
braincog		braincog
data		data
figs		figs
model		model
utils		utils
.gitignore		.gitignore
GradCAM_Visualization.py		GradCAM_Visualization.py
README.md		README.md
audio_vis.py		audio_vis.py
dvs_acc_vis.py		dvs_acc_vis.py
plot.py		plot.py
plot_acc.py		plot_acc.py
requirements.txt		requirements.txt
split_cremad.py		split_cremad.py
thetis.py		thetis.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Enhancing Audio-Visual Spiking Neural Networks through Semantic-Alignment and Cross-Modal Residual Learning

Method

Training Script

Datasets

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Brain-Cog-Lab/S-CMRL

Folders and files

Latest commit

History

Repository files navigation

Enhancing Audio-Visual Spiking Neural Networks through Semantic-Alignment and Cross-Modal Residual Learning

Method

Training Script

Datasets

Citation

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages