COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric
Human Activity Recognition

Baiyu Chen^1,2, Wilson Wongso^1,2, Zechen Li¹, Yonchanok Khaokaew^1,2, Hao Xue^1,2, and Flora Salim^1,2

¹ School of Computer Science and Engineering, University of New South Wales, Sydney, Australia
² ARC Centre of Excellence for Automated Decision Making + Society

🌟 Overview

COMODO is an open source framework for Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition.

🔑 The key features of COMODO:

Self-supervised Cross-modal Knowledge Transfer: We propose COMODO, a cross-modal self-supervised distillation framework that leverages pretrained video and time-series models enabling label-free knowledge transfer from a stronger modality (video) with richer training data to a weaker modality (IMU) with limited data.
A Self-supervised and Effective Cross-modal Queuing Mechanism: We introduce a cross-modal FIFO queue that maintains video embeddings as a stable and diverse reference distribution for IMU feature distillation, extending the instance queue distribution learning approach from single-modality to cross-modality.
Teacher-Student Model Agnostic: COMODO supports diverse video and time-series pretrained models, enabling flexible teacher-student configurations and future integration with stronger foundation models.
Cross-dataset Generalization: We demonstrate that COMODO maintains superior performance even when evaluated on unseen datasets, and more superior than fully supervised models, highlighting its robustness and generalizability for egocentric HAR tasks.

📂 Data & Results

All experimental results and ablation study findings can be found in the /results folder.

The /dataset folder contains the train, val, and test splits for each dataset, along with our preprocessing scripts. Specifically, ego4d_subset_ids.txt is a subset of all available IMU-containing IDs, which we obtained by applying the official Ego4D filter from their website. This represents the complete subset of data that we can access.

🚀 Getting started

Cross-modal Self-supervised Distillation

To run a Self-supervised Video-to-IMU Distillation, use the following command:

Note: [ ] denotes optional parameters.

Currently supported pretrained models:

Time-series models: MOMENT, Mantis

Video models: VideoMAE, TimeSformer

Other pretrained models can be used with minor modifications to the code.

python train.py \
    --video_ckpt "facebook/timesformer-base-finetuned-k400" \
    --imu_ckpt "paris-noah/Mantis-8M" \
    --dataset_path "DATASET_PATH" \
    --encoded_video_path "ENCODED_VIDEO_PATH" \
    --anchor_video_path "ANCHOR_VIDEO_PATH" \
    [--queue_size QUEUE_SIZE] \
    [--student_temp STUDENT_TEMP] \
    [--teacher_temp TEACHER_TEMP] \
    [--learning_rate LR] \
    [--num_epochs EPOCH] \
    [--batch_size BS] \
    [--num_clips 0] \
    [--seed SEED] \
    [--mlp_hidden_dim MLP_HIDDEN_DIM] \
    [--mlp_output_dim MLP_OUTPUT_DIM] \
    [--reduction "concat"] \
    [--is_raw true]

Unsupervised Representation Learning Evaluation

We evaluate the learned IMU representations in an unsupervised manner. See Section 3.2 in our paper. We train a Support Vector Machine (SVM) on the extracted IMU features and evaluate classification accuracy on the test set. Run the following command to start the evaluation:

python unsupervised_rep_test.py \
    --imu_ckpt "AutonLab/MOMENT-1-small" \
    --model_path "MODEL_WEIGHT_PATH" \
    --dataset_path "DATASET_PATH" \

🌍 Related Works & Baselines

There's a lot of outstanding work on time-series and human activity recognition! Here's an incomplete list. Checkout Table 1 in our paper for IMU-based Human Activity Recognition comparisons with these studies:

MOMENT: A Family of Open Time-series Foundation Models [Paper, Code, Hugging Face]
Mantis: Lightweight Calibrated Foundation Model for User-Friendly Time Series Classification [Paper, Code, Hugging Face]
TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis [Paper, Code]
DLinear: Are Transformers Effective for Time Series Forecasting? [Paper, Code]
Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting [Paper, Code]
IMU2CLIP: Language-grounded Motion Sensor Translation with Multimodal Contrastive Learning [Paper, Code]

📩 Contact

If you have any questions or suggestions, feel free to contact Baiyu (Breeze) at breeze.chen(at)student(dot)unsw(dot)edu(dot)au.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
comodo		comodo
dataset		dataset
results		results
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric
Human Activity Recognition

Baiyu Chen^1,2, Wilson Wongso^1,2, Zechen Li¹, Yonchanok Khaokaew^1,2, Hao Xue^1,2, and Flora Salim^1,2

🌟 Overview

🔑 The key features of COMODO:

📂 Data & Results

🚀 Getting started

Cross-modal Self-supervised Distillation

Unsupervised Representation Learning Evaluation

🌍 Related Works & Baselines

📩 Contact

About

Releases

Packages

Languages

License

Breezelled/COMODO

Folders and files

Latest commit

History

Repository files navigation

COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition

Baiyu Chen1,2, Wilson Wongso1,2, Zechen Li1, Yonchanok Khaokaew1,2, Hao Xue1,2, and Flora Salim1,2

🌟 Overview

🔑 The key features of COMODO:

📂 Data & Results

🚀 Getting started

Cross-modal Self-supervised Distillation

Unsupervised Representation Learning Evaluation

🌍 Related Works & Baselines

📩 Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric
Human Activity Recognition

Baiyu Chen^1,2, Wilson Wongso^1,2, Zechen Li¹, Yonchanok Khaokaew^1,2, Hao Xue^1,2, and Flora Salim^1,2

Packages