GitHub - rd20karim/Synch-Transformer: Transformer with Controlled Attention for Synchronous Motion Captioning

Description

Official implementation of Synch-Transformer for synchronous motion captioning:

Transformer with Controlled Attention for Synchronous Captioning

This work introduces a Transformer-based design to address the task of motion-to-text synchronization, as introduced in this previous project m2t-segmentation.

Synchronous captioning aims to generate text aligned with the time evolution of 3D human motion. Implicitly, this mapping provides fine-grained action recognition and unsupervised event localization with temporal phrase grounding through unsupervised motion-language segmentation.

Bibtex

If you find this work useful in your research, please cite:

@article{radouane2024ControlledTransformer,
      title={Transformer with Controlled Attention for Synchronous Motion Captioning}, 
      author={Karim Radouane and Sylvie Ranwez and Julien Lagarde and Andon Tchechmedjiev},
      journal = {arXiv},
      year = {2024} 
}

Quick start

conda env create -f environment.yaml
conda activate wbpy310
python -m spacy download en-core-web-sm

You need also to install wandb for hyperparameters tuning: pip install wandb

Preprocess datasets

For both HumanML3D and KIT-MLD (augmented versions) you can follow the steps here: project link

Training and evaluation

Training: Using train.py script allows you to train the model by defining the config file and the dataset path. You can also use the wandb for hyperparameters tuning.

Evaluation: Using evaluate_transformer.py script, you can evaluate the model on the test set by defining the config file and the model checkpoint.

Note: More details and Models checkpoints will be available soon.

Demonstration

In the following visual animations, we present the synchronized output results for some motions, mainly compositional, which include samples containing two or more actions:

Architecture

Our method introduces mechanisms to control self- and cross-attention distributions of the Transformer, allowing interpretability and time-aligned text generation. We achieve this through masking strategies and structuring losses that push the model to maximize attention only on the most important frames contributing to the generation of a motion word. These constraints aim to prevent undesired mixing of information in attention maps and to provide a monotonic attention distribution across tokens. Thus, the cross attentions of tokens are used for progressive text generation in synchronization with human motion sequences.

Motion Frozen in Time

Phrase-level

Word-level

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Frozen		Frozen
architectures		architectures
configs		configs
datasets		datasets
synch_mesh_gifs		synch_mesh_gifs
.gitignore		.gitignore
Concept_Transformer_Synch.png		Concept_Transformer_Synch.png
README.md		README.md
environement.yaml		environement.yaml
evaluate_transformer.py		evaluate_transformer.py
train.py		train.py
transformer_synch.png		transformer_synch.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Description

Bibtex

Quick start

Preprocess datasets

Training and evaluation

Demonstration

Architecture

Motion Frozen in Time

About

Uh oh!

Packages

Languages

rd20karim/Synch-Transformer

Folders and files

Latest commit

History

Repository files navigation

Description

Bibtex

Quick start

Preprocess datasets

Training and evaluation

Demonstration

Architecture

Motion Frozen in Time

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Packages 0

Languages

Packages