Skip to content

Latest commit

Β 

History

History
89 lines (69 loc) Β· 2.33 KB

File metadata and controls

89 lines (69 loc) Β· 2.33 KB

DAIC-WOZ Dataset - Minimal Setup

πŸ“Š What You Need

189 clinical interviews (sessions 300-492)

  • Labels: PHQ-8 depression scores (0=healthy, 1=depressed if score β‰₯10)
  • Data: Audio transcripts + acoustic features
  • Splits: Official train/dev/test

πŸ“₯ Download (Minimum Required)

1. CSV Files (5 KB total - CRITICAL)

http://dcapswoz.ict.usc.edu/wwwdaicwoz/train_split_Depression_AVEC2017.csv
http://dcapswoz.ict.usc.edu/wwwdaicwoz/dev_split_Depression_AVEC2017.csv
http://dcapswoz.ict.usc.edu/wwwdaicwoz/test_split_Depression_AVEC2017.csv

Save to: data/splits/

2. Session Data (Start Small)

Download 10-20 training sessions first (~5-8 GB):

http://dcapswoz.ict.usc.edu/wwwdaicwoz/300_P.zip (327M)
http://dcapswoz.ict.usc.edu/wwwdaicwoz/301_P.zip (403M)
...

Extract to: data/raw/300_P/, data/raw/301_P/, etc.

πŸ“ Folder Structure

data/
β”œβ”€β”€ splits/
β”‚   β”œβ”€β”€ train_split_Depression_AVEC2017.csv  ← IDs + labels
β”‚   β”œβ”€β”€ dev_split_Depression_AVEC2017.csv
β”‚   └── test_split_Depression_AVEC2017.csv
β”‚
└── raw/
    β”œβ”€β”€ 300_P/
    β”‚   β”œβ”€β”€ 300_TRANSCRIPT.csv     ← Use this (text)
    β”‚   └── 300_COVAREP.csv        ← Use this (74 acoustic features)
    β”œβ”€β”€ 301_P/
    └── ...

🎯 What to Use

For your unsupervised learning project:

  1. Text: XXX_TRANSCRIPT.csv - Interview transcripts
  2. Audio: XXX_COVAREP.csv - 74 acoustic features (F0, MFCC, jitter, etc.)

Ignore video/facial files for now (optional later).

⚑ Quick Start

# 1. Create folders
cd "m:\5th sem\ML2-project"
New-Item -ItemType Directory -Path "data\splits", "data\raw" -Force

# 2. Download CSV files manually to data/splits/

# 3. Download 10 sessions
# Download 300_P.zip through 309_P.zip from URL above
# Extract each to data/raw/

# 4. Run notebook
jupyter notebook notebooks/03_DAICWOZ_analysis.ipynb

πŸ“Š CSV Format

train_split_Depression_AVEC2017.csv:

Participant_ID,PHQ8_Binary,PHQ8_Score,Gender
300,0,3,Male
301,1,15,Female
...
  • PHQ8_Binary: 0=No depression, 1=Depression
  • PHQ8_Score: 0-24 (β‰₯10 = depression threshold)

πŸ’Ύ Storage

  • Minimal (10 sessions): ~4 GB
  • Training set only: ~50 GB
  • Full dataset: ~85 GB

Start with 10-20 sessions, then download more as needed.