Baiyu Chen1,2, Wilson Wongso1,2, Zechen Li1, Yonchanok Khaokaew1,2, Hao Xue1,2, and Flora Salim1,2
1 School of Computer Science and Engineering, University of New South Wales, Sydney, Australia
2 ARC Centre of Excellence for Automated Decision Making + Society
COMODO is an open source framework for Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition.
- Self-supervised Cross-modal Knowledge Transfer: We propose COMODO, a cross-modal self-supervised distillation framework that leverages pretrained video and time-series models enabling label-free knowledge transfer from a stronger modality (video) with richer training data to a weaker modality (IMU) with limited data.
- A Self-supervised and Effective Cross-modal Queuing Mechanism: We introduce a cross-modal FIFO queue that maintains video embeddings as a stable and diverse reference distribution for IMU feature distillation, extending the instance queue distribution learning approach from single-modality to cross-modality.
- Teacher-Student Model Agnostic: COMODO supports diverse video and time-series pretrained models, enabling flexible teacher-student configurations and future integration with stronger foundation models.
- Cross-dataset Generalization: We demonstrate that COMODO maintains superior performance even when evaluated on unseen datasets, and more superior than fully supervised models, highlighting its robustness and generalizability for egocentric HAR tasks.
All experimental results and ablation study findings can be found in the /results
folder.
The /dataset
folder contains the train, val, and test splits for each dataset, along with our preprocessing scripts. Specifically, ego4d_subset_ids.txt is a subset of all available IMU-containing IDs, which we obtained by applying the official Ego4D filter from their website. This represents the complete subset of data that we can access.
To run a Self-supervised Video-to-IMU Distillation, use the following command:
Note:
[ ]
denotes optional parameters.
Currently supported pretrained models:
- Time-series models: MOMENT, Mantis
- Video models: VideoMAE, TimeSformer
Other pretrained models can be used with minor modifications to the code.
python train.py \
--video_ckpt "facebook/timesformer-base-finetuned-k400" \
--imu_ckpt "paris-noah/Mantis-8M" \
--dataset_path "DATASET_PATH" \
--encoded_video_path "ENCODED_VIDEO_PATH" \
--anchor_video_path "ANCHOR_VIDEO_PATH" \
[--queue_size QUEUE_SIZE] \
[--student_temp STUDENT_TEMP] \
[--teacher_temp TEACHER_TEMP] \
[--learning_rate LR] \
[--num_epochs EPOCH] \
[--batch_size BS] \
[--num_clips 0] \
[--seed SEED] \
[--mlp_hidden_dim MLP_HIDDEN_DIM] \
[--mlp_output_dim MLP_OUTPUT_DIM] \
[--reduction "concat"] \
[--is_raw true]
We evaluate the learned IMU representations in an unsupervised manner. See Section 3.2 in our paper. We train a Support Vector Machine (SVM) on the extracted IMU features and evaluate classification accuracy on the test set. Run the following command to start the evaluation:
python unsupervised_rep_test.py \
--imu_ckpt "AutonLab/MOMENT-1-small" \
--model_path "MODEL_WEIGHT_PATH" \
--dataset_path "DATASET_PATH" \
There's a lot of outstanding work on time-series and human activity recognition! Here's an incomplete list. Checkout Table 1 in our paper for IMU-based Human Activity Recognition comparisons with these studies:
- MOMENT: A Family of Open Time-series Foundation Models [Paper, Code, Hugging Face]
- Mantis: Lightweight Calibrated Foundation Model for User-Friendly Time Series Classification [Paper, Code, Hugging Face]
- TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis [Paper, Code]
- DLinear: Are Transformers Effective for Time Series Forecasting? [Paper, Code]
- Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting [Paper, Code]
- IMU2CLIP: Language-grounded Motion Sensor Translation with Multimodal Contrastive Learning [Paper, Code]
If you have any questions or suggestions, feel free to contact Baiyu (Breeze) at breeze.chen(at)student(dot)unsw(dot)edu(dot)au
.