Weichao Zhao, Hezhen Hu, Wengang Zhou, Yunyao Mao, Min Wang and Houqiang Li
This repository includes Python (PyTorch) implementation of this paper.
Accepted by TCSVT2024
python==3.8.13
torch==1.8.1+cu111
torchvision==0.9.1+cu111
tensorboard==2.9.0
scikit-learn==1.1.1
tqdm==4.64.0
numpy==1.22.4Please refer to the bash scripts
-
Download the original datasets, including SLR500, NMFs_CSL, WLASL and MSASL
-
Utilize the off-the-shelf pose estimator MMPose with the setting of Topdown Heatmap + Hrnet + Dark on Coco-Wholebody to extract the 2D keypoints for sign language videos.
-
The final data is formatted as follows:
Data
├── NMFs_CSL
├── SLR500
├── WLASL
└── MSASL
├── Video
├── Pose
└── Annotations
You can download the pretrained model from this link: pretrained model on four ISLR datasets
If you find this work useful for your research, please consider citing our work:
@article{zhao2024masa,
title={MASA: Motion-aware Masked Autoencoder with Semantic Alignment for Sign Language Recognition},
author={Zhao, Weichao and Hu, Hezhen and Zhou, Wengang and Mao, Yunyao and Wang, Min and Li, Houqiang},
journal={arXiv},
year={2024}
}
