Skip to content

Transformer with Controlled Attention for Synchronous Motion Captioning

Notifications You must be signed in to change notification settings

rd20karim/Synch-Transformer

Repository files navigation

Description

Official implementation of Synch-Transformer for synchronous motion captioning:

This work introduces a Transformer-based design to address the task of motion-to-text synchronization, as introduced in this previous project m2t-segmentation.

Synchronous captioning aims to generate text aligned with the time evolution of 3D human motion. Implicitly, this mapping provides fine-grained action recognition and unsupervised event localization with temporal phrase grounding through unsupervised motion-language segmentation.

Bibtex

If you find this work useful in your research, please cite:

@article{radouane2024ControlledTransformer,
      title={Transformer with Controlled Attention for Synchronous Motion Captioning}, 
      author={Karim Radouane and Sylvie Ranwez and Julien Lagarde and Andon Tchechmedjiev},
      journal = {arXiv},
      year = {2024} 
}

Quick start

conda env create -f environment.yaml
conda activate wbpy310
python -m spacy download en-core-web-sm

You need also to install wandb for hyperparameters tuning: pip install wandb

Preprocess datasets

For both HumanML3D and KIT-MLD (augmented versions) you can follow the steps here: project link

Training and evaluation

Training: Using train.py script allows you to train the model by defining the config file and the dataset path. You can also use the wandb for hyperparameters tuning.

Evaluation: Using evaluate_transformer.py script, you can evaluate the model on the test set by defining the config file and the model checkpoint.

Note: More details and Models checkpoints will be available soon.

Demonstration

In the following visual animations, we present the synchronized output results for some motions, mainly compositional, which include samples containing two or more actions:

Architecture

Our method introduces mechanisms to control self- and cross-attention distributions of the Transformer, allowing interpretability and time-aligned text generation. We achieve this through masking strategies and structuring losses that push the model to maximize attention only on the most important frames contributing to the generation of a motion word. These constraints aim to prevent undesired mixing of information in attention maps and to provide a monotonic attention distribution across tokens. Thus, the cross attentions of tokens are used for progressive text generation in synchronization with human motion sequences.

Motion Frozen in Time

  • Phrase-level
  • Word-level