An unsupervised behavioral phenotyping pipeline for studying aging in mice using Keypoint-MoSeq (KPMS). The pipeline processes pose-estimation data from video recordings of Diversity Outbred (DO) and C57BL/6 (B6) mice, learns behavioral syllables via an autoregressive hidden Markov model, and extracts feature matrices for downstream regression analysis of aging and frailty.
See KumarLabJax/Unsupervised-vFI for code for downstream prediction tasks and anshu957/kpms_kumarlab for various KPMS utility functions.
unsupervised-aging/
├── src/
│ ├── preprocessing/ # Data preparation and stratified sampling
│ ├── kpms_training/ # Keypoint-MoSeq model training
│ ├── kpms_inference/ # Batch inference on new datasets
│ ├── feature_extraction/ # Behavioral feature computation
│ └── model_evaluation/ # Nested cross-validation regression
├── scripts/templates/ # SLURM job submission templates
├── figures/ # Output visualizations
└── pyproject.toml # Project metadata and dependencies
Recommended Data Format:
unsupervised-aging/
├── data/
│ ├── datasets/ # Datasets
│ │ ├── dataset1/
│ │ │ ├── inference_groups/ # Batched videos for KPMS inference (kpms_inference/batch_videos.py)
│ │ │ ├── poses/ # .h5 pose files
│ │ │ ├── poses_csv/ # .csv pose files (preprocessing/convert_h5_to_csv.py)
│ │ │ ├── training_set/ # KPMS training set (preprocessing/sample_kpms_training_set.py)
│ │ │ └── videos/ # .mp4 video files
│ │ └── ...
│ └── kpms_projects/ # KPMS Projects (see KPMS documentation for details)
│ ├── project1/
│ │ ├── model1/ # A project may contain multiple models (kpms_training/kpms_model_comparison.py)
│ │ │ ├── results/ # per-batch .h5 result files (kpms_inference/kpms_inference.py)
│ │ │ ├── checkpoint.h5 # trained model weights
│ │ │ ├── results.h5 # merged .h5 result file (kpms_inference/merge_batch_results.py)
│ │ │ └── ...
│ │ ├── config.yml # KPMS config file (kpms_training/create_kpms_project.py)
│ │ └── index.csv # KPMS index file required by internal methods (kpms_inference/create_index_csv.py)
│ └── ...
└ ...
| Script | Description |
|---|---|
convert_h5_to_csv.py |
Convert pose-estimation H5 files to CSV |
sample_kpms_training_set.py |
Stratified sampling preserving joint distribution of categorical/numeric variables |
sample_combined_kpms_training_set.py |
Strain-balanced sampling across DO and B6 |
sample_young_old_kpms_training_set.py |
Age-filtered sampling with young/old splits based on quantile thresholds |
transfer_dataset.py |
Symlink trimmed videos and pose files into a dataset directory |
transfer_dataset_from_file.py |
Symlink pose H5 files from a text file listing |
| Script | Description |
|---|---|
create_kpms_project.py |
Initialize a KPMS project: set bodyparts/skeleton, run PCA |
kpms_training.py |
Train a KPMS model (AR-HMM then full model), reindex syllables, export results |
kpms_model_comparison.py |
Compute Expected Marginal Likelihood scores across model replicates |
kpms_training_utils.py |
Shared training utilities (fit_and_save_model) |
| Script | Description |
|---|---|
kpms_inference.py |
Apply a trained checkpoint to new pose data, write HDF5 results |
batch_videos.py |
Distribute files into sub-folders for memory-constrained batching |
create_index_csv.py |
Create placeholder index.csv mapping each pose file to its own group |
merge_batch_results.py |
Merge per-batch HDF5 result files into a single results.h5 |
merge_batch_results_no_dups.py |
Merge variant that skips duplicate groups |
| Script | Description |
|---|---|
feature_extraction.py |
Compute latent-embedding statistics, syllable frequencies, and optional meta-syllable transitions; merge with metadata and supervised features |
kpms_generate_trajectories.py |
Generate dendrograms, per-syllable trajectory plots, and grid movies |
create_dendrogram.py |
Compute and save a syllable similarity dendrogram |
kpms_syllable_merging.py |
Hierarchical syllable merging via agglomerative clustering (marimo notebook) |
| Script | Description |
|---|---|
model_evaluation.py |
Nested group-aware cross-validation with Elastic Net, Random Forest, XGBoost, and MLP regressors |
model_evaluation_utils.py |
TwoStageSearchCV for sequential hyperparameter tuning; compute_nested_kfold_validation for nested CV |
stack_evaluation_results.py |
Concatenate multiple evaluation CSVs into a single output |
- Python >= 3.10
- Keypoint-MoSeq and JAX (for training and inference)
- SLURM-managed HPC cluster (for GPU-accelerated model training)
Python dependencies are managed with uv and defined in pyproject.toml:
h5py >= 3.14.0
numpy >= 2.2.6
pandas >= 2.3.2
scikit-learn >= 1.7.2
xgboost >= 3.0.5
matplotlib >= 3.10.6
marimo >= 0.16.2
tqdm >= 4.67.1
-
Clone the repository:
git clone <repository-url> cd unsupervised-aging
-
Create the virtual environment:
uv sync
-
Prepare data: Convert pose-estimation H5 files to CSV and draw a stratified training sample:
python src/preprocessing/convert_h5_to_csv.py --dataset_dir <path> python src/preprocessing/sample_kpms_training_set.py \ --dataset_dir <path> --export_dir <path> --n_samples 200
-
Initialize a KPMS project and train:
python src/kpms_training/create_kpms_project.py \ --project_name my_project --kpms_dir <path> \ --videos_dir <path> --poses_csv_dir <path> python src/kpms_training/kpms_training.py \ --project_name my_project --model_name model_v1 \ --kpms_dir <path> --videos_dir <path> --poses_csv_dir <path> -
Run inference on new data:
python src/kpms_inference/kpms_inference.py \ --project_name my_project --model_name model_v1 \ --kpms_dir <path> --poses_csv_dir <path> -
Extract features and evaluate:
python src/feature_extraction/feature_extraction.py \ --project_name my_project --model_name model_v1 \ --kpms_dir <path> --dataset_dir <path> \ --adj_metadata_path <metadata.csv> --output_dir <path> python src/model_evaluation/model_evaluation.py \ --input_csv <features.csv> --xcat_json <categories.json> \ --output_path <results.csv>
Job submission scripts for each pipeline stage are in scripts/templates/. Each template contains placeholder paths and resource configurations for NVIDIA A100 GPUs:
| Template | Purpose |
|---|---|
convert_h5_to_csv.sh |
H5 to CSV conversion |
create_kpms_project.sh |
Project initialization and PCA |
kpms_training.sh |
Model training (task ID = seed) |
kpms_inference.sh |
Batch inference |
kpms_model_comparison.sh |
EML score computation |
model_evaluation.sh |
Nested CV regression |