Multimodal machine learning for depression subtype discovery
This project applies unsupervised machine learning (K-Means clustering, PCA, t-SNE) on the DAIC-WOZ Depression Database to identify hidden subtypes and biomarker patterns associated with Major Depressive Disorder (MDD).
β
2 distinct depression subtypes discovered
β
Statistically significant correlation with clinical labels (ΟΒ² = 6.44, p = 0.0112)
β
Analyzed 33 clinical interviews with multimodal features
β
Combined text (TF-IDF) + acoustic (COVAREP) features (396 dimensions)
- β Can unsupervised algorithms detect meaningful latent subtypes of MDD patients?
- β What multimodal biomarkers (speech acoustics + text) define these subtypes?
- β Do discovered clusters correlate with PHQ-8 depression severity?
- β Can dimensionality reduction (PCA) reveal interpretable patterns?
Gold standard clinical dataset from USC Institute for Creative Technologies (AVEC 2017):
- 189 clinical interviews (107 training, 82 validation/test)
- Modalities: Audio transcripts, COVAREP acoustic features, facial Action Units
- Labels: PHQ-8 depression scores (0-24), binary classification (threshold β₯10)
- Our analysis: 33 sessions (14 depressed, 19 healthy)
See DAIC_WOZ_MINIMAL.md for quick setup or DAIC_WOZ_SETUP.md for comprehensive guide.
Quick download:
.\scripts\download_daicwoz.ps1DAIC-WOZ Transcripts + COVAREP Acoustic Features
β
Feature Extraction (TF-IDF + Statistical Aggregation)
β
Feature Fusion (100 text + 296 acoustic = 396 features)
β
Normalization (StandardScaler)
β
PCA Dimensionality Reduction (396 β 27 components, 95% variance)
β
t-SNE Visualization (2D embeddings)
β
K-Means Clustering (k=2 optimal)
β
Statistical Validation (Chi-square test vs PHQ-8 labels)
β
Results: p = 0.0112 (Significant!) β
# Clone the repository
git clone https://github.com/param20h/MDD-biomarker-discovery-project.git
cd MDD-biomarker-discovery-project
# Install dependencies
pip install -r requirements.txt# Download CSV splits and sample sessions
.\scripts\download_daicwoz.ps1
# Or download more training sessions
.\scripts\download_training_sessions.ps1See DAIC_WOZ_MINIMAL.md for detailed setup.
# Open Jupyter notebook
jupyter notebook notebooks/03_DAICWOZ_unsupervised.ipynb
# Run all cells to reproduce results
# Results: 2 clusters, p=0.0112, significant correlation with PHQ-8- Notebook:
notebooks/03_DAICWOZ_unsupervised.ipynb - Research Paper:
docs/paper/research_paper_template.md
| Metric | Value | Interpretation |
|---|---|---|
| Optimal k | 2 | Two distinct subtypes |
| Silhouette Score | 0.168 | Positive separation |
| Davies-Bouldin | 1.871 | Moderate compactness |
| Calinski-Harabasz | 8.3 | Moderate density |
| Test | Value | Result |
|---|---|---|
| Chi-square (ΟΒ²) | 6.44 | - |
| p-value | 0.0112 | Significant! (p < 0.05) β |
| Degrees of freedom | 1 | - |
- Participants: 33 (14 depressed, 19 healthy)
- Features: 396 (100 text + 296 acoustic) β 27 via PCA
- Cluster sizes: 14 vs 19 participants
Conclusion: Unsupervised clustering successfully discovered depression subtypes with statistically significant correlation to clinical PHQ-8 labels.
MDD-biomarker-discovery-project/
β
βββ data/ # Data directory (gitignored)
β βββ splits/ # CSV train/dev/test splits
β β βββ train_split_Depression_AVEC2017.csv
β βββ raw/ # DAIC-WOZ session folders
β βββ 300_P/
β β βββ 300_TRANSCRIPT.csv
β β βββ 300_COVAREP.csv
β βββ ...
β
βββ notebooks/ # Jupyter notebooks
β βββ 03_DAICWOZ_unsupervised.ipynb # Main analysis (32 cells)
β
βββ scripts/ # Download & utility scripts
β βββ download_daicwoz.ps1 # Download CSV splits + samples
β βββ download_training_sessions.ps1 # Download training sessions
β
βββ docs/ # Documentation
β βββ paper/
β βββ research_paper_template.md # Research paper with results
β
βββ src/ # Source code (unused - analysis in notebook)
β βββ ...
β
βββ DAIC-WOZ.md # Dataset overview
βββ DAIC_WOZ_MINIMAL.md # Quick setup guide
βββ DAIC_WOZ_SETUP.md # Comprehensive setup guide
β
βββ .gitignore # Git ignore (excludes data/)
βββ requirements.txt # Python dependencies
βββ README.md # This file
- Scikit-Learn - K-Means clustering, PCA, t-SNE, StandardScaler
- NumPy - Numerical computing
- SciPy - Statistical tests (chi-square)
- TF-IDF Vectorizer - Text feature extraction (unigrams + bigrams)
- COVAREP Features - 74 acoustic features (F0, NAQ, QOQ, H1H2, PSP, MDQ, etc.)
- Matplotlib - Static plots (PCA scree, elbow method)
- Seaborn - Heatmaps and statistical visualizations
- Pandas - Data manipulation and analysis
- Text (TF-IDF): 100 features, unigrams+bigrams, min_df=2, max_df=0.8
- Acoustic (COVAREP): 296 features (74 Γ 4 statistics: mean/std/min/max)
- Multimodal feature fusion (horizontal concatenation)
- StandardScaler normalization (zero mean, unit variance)
- PCA: 396 β 27 components (95.4% variance retained)
- t-SNE: 2D visualization (perplexity=11, adapted for small dataset)
- K-Means: Tested k=2-6, optimal k=2 (silhouette optimization)
- Initialization: k-means++, n_init=20
- Chi-square test: Cluster vs PHQ-8 binary labels
- Metrics: Silhouette, Davies-Bouldin, Calinski-Harabasz
β
2 distinct depression subtypes identified through unsupervised learning
β
Statistical significance: ΟΒ² = 6.44, p = 0.0112 < 0.05
β
Multimodal approach: Combined text and acoustic features outperform single-modality
β
Clinical validation: Clusters correlate with PHQ-8 gold standard labels
β
Dimensionality reduction: PCA effectively reduced 396 features to 27 while retaining 95% variance
- Objective biomarkers for depression can be extracted from speech and text
- Hidden heterogeneity exists within MDD that unsupervised methods can reveal
- Personalized treatment potential based on subtype characteristics
- Gratch, J., et al. (2014). The Distress Analysis Interview Corpus of human and computer interviews. LREC.
- Valstar, M., et al. (2016). AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge. ACM ICMI.
- Degottex, G., et al. (2014). COVAREPβA collaborative voice analysis repository for speech technologies. IEEE ICASSP.
Paramjit - Machine Learning Dinesh - Contributor Jaivardhan - Contributor Research Project
MIT License - For research and educational purposes.
- USC Institute for Creative Technologies - DAIC-WOZ Depression Database (AVEC 2017)
- COVAREP Team - Acoustic feature extraction toolkit
Crisis Resources:
- National Suicide Prevention Lifeline: 988
- Crisis Text Line: Text HOME to 741741
"In the silence of data, we find the voice of invisible pain."