Skip to content

Official Repo for COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition

License

Notifications You must be signed in to change notification settings

Breezelled/COMODO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric
Human Activity Recognition

1 School of Computer Science and Engineering, University of New South Wales, Sydney, Australia
2 ARC Centre of Excellence for Automated Decision Making + Society

arXiv python pytorch

🌟 Overview

COMODO is an open source framework for Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition.

🔑 The key features of COMODO:

  • Self-supervised Cross-modal Knowledge Transfer: We propose COMODO, a cross-modal self-supervised distillation framework that leverages pretrained video and time-series models enabling label-free knowledge transfer from a stronger modality (video) with richer training data to a weaker modality (IMU) with limited data.
  • A Self-supervised and Effective Cross-modal Queuing Mechanism: We introduce a cross-modal FIFO queue that maintains video embeddings as a stable and diverse reference distribution for IMU feature distillation, extending the instance queue distribution learning approach from single-modality to cross-modality.
  • Teacher-Student Model Agnostic: COMODO supports diverse video and time-series pretrained models, enabling flexible teacher-student configurations and future integration with stronger foundation models.
  • Cross-dataset Generalization: We demonstrate that COMODO maintains superior performance even when evaluated on unseen datasets, and more superior than fully supervised models, highlighting its robustness and generalizability for egocentric HAR tasks.

📂 Data & Results

All experimental results and ablation study findings can be found in the /results folder.

The /dataset folder contains the train, val, and test splits for each dataset, along with our preprocessing scripts. Specifically, ego4d_subset_ids.txt is a subset of all available IMU-containing IDs, which we obtained by applying the official Ego4D filter from their website. This represents the complete subset of data that we can access.

🚀 Getting started

Cross-modal Self-supervised Distillation

To run a Self-supervised Video-to-IMU Distillation, use the following command:

Note: [ ] denotes optional parameters.

Currently supported pretrained models:

  • Time-series models: MOMENT, Mantis
  • Video models: VideoMAE, TimeSformer

Other pretrained models can be used with minor modifications to the code.

python train.py \
    --video_ckpt "facebook/timesformer-base-finetuned-k400" \
    --imu_ckpt "paris-noah/Mantis-8M" \
    --dataset_path "DATASET_PATH" \
    --encoded_video_path "ENCODED_VIDEO_PATH" \
    --anchor_video_path "ANCHOR_VIDEO_PATH" \
    [--queue_size QUEUE_SIZE] \
    [--student_temp STUDENT_TEMP] \
    [--teacher_temp TEACHER_TEMP] \
    [--learning_rate LR] \
    [--num_epochs EPOCH] \
    [--batch_size BS] \
    [--num_clips 0] \
    [--seed SEED] \
    [--mlp_hidden_dim MLP_HIDDEN_DIM] \
    [--mlp_output_dim MLP_OUTPUT_DIM] \
    [--reduction "concat"] \
    [--is_raw true]

Unsupervised Representation Learning Evaluation

We evaluate the learned IMU representations in an unsupervised manner. See Section 3.2 in our paper. We train a Support Vector Machine (SVM) on the extracted IMU features and evaluate classification accuracy on the test set. Run the following command to start the evaluation:

python unsupervised_rep_test.py \
    --imu_ckpt "AutonLab/MOMENT-1-small" \
    --model_path "MODEL_WEIGHT_PATH" \
    --dataset_path "DATASET_PATH" \

🌍 Related Works & Baselines

There's a lot of outstanding work on time-series and human activity recognition! Here's an incomplete list. Checkout Table 1 in our paper for IMU-based Human Activity Recognition comparisons with these studies:

  • MOMENT: A Family of Open Time-series Foundation Models [Paper, Code, Hugging Face]
  • Mantis: Lightweight Calibrated Foundation Model for User-Friendly Time Series Classification [Paper, Code, Hugging Face]
  • TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis [Paper, Code]
  • DLinear: Are Transformers Effective for Time Series Forecasting? [Paper, Code]
  • Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting [Paper, Code]
  • IMU2CLIP: Language-grounded Motion Sensor Translation with Multimodal Contrastive Learning [Paper, Code]

📩 Contact

If you have any questions or suggestions, feel free to contact Baiyu (Breeze) at breeze.chen(at)student(dot)unsw(dot)edu(dot)au.

About

Official Repo for COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages