Skip to content

Commit cbf759d

Browse files
committed
Update notes.md
1 parent ece3b21 commit cbf759d

1 file changed

Lines changed: 24 additions & 24 deletions

File tree

notes.md

Lines changed: 24 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -265,17 +265,17 @@ news2_features_patient.csv ← ML ready (patient-level aggregates, imputed
265265

266266
3. **Preparing Timestamp-Level ML Features**
267267

268-
- **Pipeline (make_timestamp_features.py)**:
269-
1. Start from news2_scores.csv (all vitals + NEWS2 + escalation labels).
270-
- Parse charttime as datetime.
271-
- Sort by subject_id, charttime.
272-
2. Create missingness flags for each vital (before fills).
273-
3. LOCF forward-fill per subject (optionally backward-fill for initial missingness or leave as NaN), do not use population median.
274-
4. Create carried-forward flags (binary indicator - 1 if the value came from LOCF). Helps ML distinguish between observed vs assumed stable, exploit missingness patterns (e.g. vitals measured more frequently when patients deteriorate).
275-
5. **Compute rolling windows (1h, 4h, 24h)**: mean,min,max,std,count,slope,AUC.
276-
6. Compute time since last observation (`time_since_last_obs`) for each vital (staleness).
277-
7. Convert textual escalation/risk labels → numeric ordinal encoding (Low=0, Low-Medium=1, Medium=2, High=3) for ML. Keeps things simple - one column, easy to track in feature importance
278-
8. Save news2_features_timestamp.csv.
268+
**Pipeline (make_timestamp_features.py)**:
269+
1. Start from news2_scores.csv (all vitals + NEWS2 + escalation labels).
270+
- Parse charttime as datetime.
271+
- Sort by subject_id, charttime.
272+
2. Create missingness flags for each vital (before fills).
273+
3. LOCF forward-fill per subject (optionally backward-fill for initial missingness or leave as NaN), do not use population median.
274+
4. Create carried-forward flags (binary indicator - 1 if the value came from LOCF). Helps ML distinguish between observed vs assumed stable, exploit missingness patterns (e.g. vitals measured more frequently when patients deteriorate).
275+
5. **Compute rolling windows (1h, 4h, 24h)**: mean,min,max,std,count,slope,AUC.
276+
6. Compute time since last observation (`time_since_last_obs`) for each vital (staleness).
277+
7. Convert textual escalation/risk labels → numeric ordinal encoding (Low=0, Low-Medium=1, Medium=2, High=3) for ML. Keeps things simple - one column, easy to track in feature importance
278+
8. Save news2_features_timestamp.csv.
279279

280280
- **Rationale**:
281281
- Trees can leverage trends and missingness.
@@ -285,19 +285,19 @@ news2_features_patient.csv ← ML ready (patient-level aggregates, imputed
285285

286286
4. **Preparing Patient-Level ML Features**
287287

288-
- **Pipeline (make_patient_features.py)**:
289-
1. Start from news2_scores.csv.
290-
2. **Group by patient**: Aggregate vitals per patient timeline (median, mean, min, max per vital).
291-
3. **Median imputation**: Fill missing values for each vital using patient-specific median (so their profile isn’t biased by others), if a patient never had a vital recorded, fall back to population median.
292-
4. **% Missing per vital**: Track proportion of missing values per vital before imputation (HR missing in 30% of their rows = 0.3), missingness itself may signal clinical patterns (e.g. some vitals only measured in deteriorating patients).
293-
5. **Encode risk/escalation labels**: Ordinal encoding (Low=0, Low-Medium=1, Medium=2, High=3), calculate summary stats per patient: max risk (highest escalation they reached), median risk (typical risk level), % time at High risk (what fraction of their trajectory was spent here).
294-
6. **Output**: news2_features_patient.csv (compact, one row per patient, ML-ready summary).
288+
**Pipeline (make_patient_features.py)**:
289+
1. Start from news2_scores.csv.
290+
2. **Group by patient**: Aggregate vitals per patient timeline (median, mean, min, max per vital).
291+
3. **Median imputation**: Fill missing values for each vital using patient-specific median (so their profile isn’t biased by others), if a patient never had a vital recorded, fall back to population median.
292+
4. **% Missing per vital**: Track proportion of missing values per vital before imputation (HR missing in 30% of their rows = 0.3), missingness itself may signal clinical patterns (e.g. some vitals only measured in deteriorating patients).
293+
5. **Encode risk/escalation labels**: Ordinal encoding (Low=0, Low-Medium=1, Medium=2, High=3), calculate summary stats per patient: max risk (highest escalation they reached), median risk (typical risk level), % time at High risk (what fraction of their trajectory was spent here).
294+
6. **Output**: news2_features_patient.csv (compact, one row per patient, ML-ready summary).
295295

296-
- **Rationale**:
297-
- Median imputation preserves patient-specific patterns without introducing bias from other patients.
298-
- % Missing captures signal from incomplete measurement patterns.
299-
- Ordinal risk encoding simplifies downstream ML model input while retaining interpretability. Together, these three summary features summarise a patient’s escalation profile across their stay. Proportion features (like % high) are standard numeric features (not encoded categories).
300-
- This is enough for model; don’t need optional metrics like streaks, AUC, or rolling windows for the patient summary.
296+
**Rationale**:
297+
- Median imputation preserves patient-specific patterns without introducing bias from other patients.
298+
- % Missing captures signal from incomplete measurement patterns.
299+
- Ordinal risk encoding simplifies downstream ML model input while retaining interpretability. Together, these three summary features summarise a patient’s escalation profile across their stay. Proportion features (like % high) are standard numeric features (not encoded categories).
300+
- This is enough for model; don’t need optional metrics like streaks, AUC, or rolling windows for the patient summary.
301301

302302

303303
5. **ML Model Selection**
@@ -318,7 +318,7 @@ news2_features_patient.csv ← ML ready (patient-level aggregates, imputed
318318
### Validation Issue & Fix: GCS → Level of Consciousness
319319
**Problem Identified:**
320320
- `score_vital` incorrectly ignored `level_of_consciousness` when computing NEWS2 scores.
321-
- Reason:
321+
- **Reason**:
322322
1. `compute_news2_score` passes `value = row.get("level_of_consciousness", pd.NA)`.
323323
2. If the row dictionary does not contain `level_of_consciousness` yet (common in synthetic test cases), `value=pd.NA`.
324324
3. Original code had `if pd.isna(value): return 0` at the top of `score_vital`.

0 commit comments

Comments
 (0)