Implemented Steps 2–4 (missingness, LOCF, carried-forward flags) and drafted Step 5 rolling features

SimonYip22 · SimonYip22 · commit 3487b42d02dc · 2025-09-17T00:28:33.000+01:00
•	Added add_missingness_flags() to create *_missing columns for each vital.
	•	Implemented apply_locf() with forward-fill + backfill per (subject_id, stay_id).
	•	Added add_carried_forward_flags() using missingness flags to distinguish true vs imputed values.
	•	Began add_rolling_features() to compute rolling-window stats (mean, min, max, std, slope, AUC) for 1h/4h/24h windows on numeric vitals.
	•	Verified output with df.head() checks after each step.
diff --git a/notes.md b/notes.md
@@ -427,4 +427,67 @@ Only Step 1 was implemented today; Steps 2–8 remain.
   - Add `_missing` columns for each vital before LOCF.  
   - Confirm flags align with actual NaNs.  
 - If possible, progress into **Step 3 (LOCF imputation)** and **Step 4 (Carried-forward flags)**.  
-- Keep using small previews (`.head()`, `.isna().sum()`) to verify correctness.  
+- Keep using small previews (`.head()`, `.isna().sum()`) to verify correctness.  
+
+---
+
+## Day 5 Notes - Missingness, Carried-Forward Flags & Rolling Features
+
+### Goals
+- Continue building `make_timestamp_features.py` pipeline.  
+- **Extend Step 2 → Step 5**:
+  - **Step 2**: Add missingness flags.
+  - **Step 3**: Apply forward-filling (LOCF).
+  - **Step 4**: Add carried-forward flags.
+  - **Step 5**: Start rolling window features (mean, min, max, std, slope, AUC).  
+
+### What We Did
+#### Step 2: Missingness Flags
+- Implemented `add_missingness_flags(df)` to generate new columns like `respiratory_rate_missing`, `spo2_missing`, etc.  
+- **Logic**: for each vital, `df[v].isna().astype(int)` creates a flag column where `1 = missing` and `0 = observed`.  
+- Called after loading + sorting the CSV with `load_and_sort_data(INPUT_FILE)`.  
+- Verified output by printing `df.head()`.
+#### Step 3: LOCF (Forward- and Back-Fill)
+- Wrote `apply_locf(df)` to handle missing values by carrying the last observed measurement forward (`ffill`) within each patient stay (`groupby(['subject_id', 'stay_id'])`).  
+- Added an extra `.bfill()` so the very first row of each stay (if missing) is backfilled with the next available measurement.  
+- Ensures no missing values remain for the chosen vitals.
+#### Step 4: Carried-Forward Flags
+- Added `add_carried_forward_flags(df)` to track which values in the filled dataset are real vs imputed.  
+- Used missingness flags from Step 2 as ground truth:  
+  - Carried = `value is not NaN after fill` **AND** `was missing before fill`.  
+- Output = new columns like `respiratory_rate_carried`, `spo2_carried`, etc.  
+- This avoids the problem of falsely marking naturally repeated values as carried-forward.
+#### Step 5: Rolling Features (in progress)
+- Started `add_rolling_features(df)` to compute rolling-window statistics on numeric vitals (`respiratory_rate`, `spo2`, `temperature`, `systolic_bp`, `heart_rate`).  
+- **Window sizes**: 1h, 4h, 24h.  
+- **Stats**: mean, min, max, std, slope (trend), AUC (cumulative exposure).  
+- For each vital × window combination, new feature columns are created, e.g.:
+  - `respiratory_rate_roll1h_mean`  
+  - `spo2_roll24h_slope`  
+- Implemented slope with a simple linear regression on index order; AUC as the cumulative sum over the window.  
+- Still clarifying whether slope/AUC should be computed on true timestamps (`charttime_numeric`) or just index order.  
+
+### Reflections
+#### Challenges
+- **Pandas syntax**:  
+  - Still feels overwhelming, especially with groupby, rolling, and applying custom functions.  
+  - Feels like "watching a chess grandmaster" without yet knowing the moves.  
+- **Redundant flags**:  
+  - Initially thought missingness flags already made carried-forward redundant.  
+  - **Learned they complement each other**: missing = gaps before filling, carried = which values were filled in.
+- **Rolling features**:  
+  - Hard to see how loops systematically build columns.  
+  - `charttime_numeric` looked confusing since we’re not yet using real timestamps in slope/AUC.
+#### Solutions & Learnings
+- Breaking code into **bite-sized functions** helps (e.g., Step 2–4 each modular).  
+- Printing `df.head()` after each step is essential for debugging.  
+- Carried-forward vs missingness flags = subtle but distinct concepts.  
+- Nested loops (`for v in vitals, for w in windows`) → systematic way to generate features.  
+- Recognised unused code (`rolling_features = []`, `charttime_numeric` placeholder).  
+
+### Next Steps
+- Finish **Step 5**:
+  - Decide whether slope/AUC should use real timestamps or simple index order.
+  - Simplify code by removing unused prep.  
+- Validate with small test DataFrame to confirm columns behave as expected.
+- **Move on to Step 6**: **time since last observation** once rolling features are stable.  
diff --git a/src/ml-data-prep/make_timestamp_features.py b/src/ml-data-prep/make_timestamp_features.py
@@ -47,18 +47,143 @@ def load_and_sort_data(input_file: Path) -> pd.DataFrame:
     print("Data loaded and sorted. Sample:")
     print(df.head())
 
-# ------------------------------
-# Step 2: Create missingness flags
-# ------------------------------
+# ----------------------------------------------------
+# Step 2: Create missingness flags before filling
+# ----------------------------------------------------
 def add_missingness_flags(df: pd.DataFrame) -> pd.DataFrame:
     vitals = [
         "respiratory_rate", "spo2", "supplemental_o2",
         "temperature", "systolic_bp", "heart_rate",
-        "level_of_consciousness"
+        "level_of_consciousness", "co2_retainer"
     ]
+    # loops through each vital sign column
     for v in vitals:
-        flag_col = f"{v}_missing"
-        df[flag_col] = df[v].isna().astype(int)
+        flag_col = f"{v}_missing" # name of new flag column created
+        df[flag_col] = df[v].isna().astype(int) # checks if value is NaN, returns a boolean, then converts to int (1 if NaN, else 0)
+                                                # store in new column df[flag_col]
+    return df 
+
+if __name__ == "__main__":
+    df = load_and_sort_data(INPUT_FILE) # df is loaded and sorted, then returned
+    df = add_missingness_flags(df) # missingness flag function is called here, and df is called and updated
+    print("Data with missingness flags. Sample:")
+    print(df.head())
+
+# -------------------------------------
+# Step 3: LOCF forward-fill per subject
+# -------------------------------------
+# Missingness flags already created in step 2 so the ML model knows which values were originally missing
+def apply_locf(df: pd.DataFrame) -> pd.DataFrame:
+    vitals = [
+        "respiratory_rate", "spo2", "supplemental_o2",
+        "temperature", "systolic_bp", "heart_rate",
+        "level_of_consciousness", "co2_retainer"
+    ]
+    # Group by subject_id and stay_id to ensure filling is done within each patient's hospital stay
+    # Then .ffill() and .bfill() are applied inside each group independently.
+    
+    # Forward-fill per subject_id + stay_id
+    # if row missing, fill with last available value
+    df[vitals] = df.groupby(["subject_id", "stay_id"])[vitals].ffill()
+
+    # Also backfill the very first missing values (first row per patient)
+    # if first row missing, fill with next available value 
+    df[vitals] = df.groupby(["subject_id", "stay_id"])[vitals].bfill()
+
     return df
 
 
+if __name__ == "__main__":
+    df = load_and_sort_data(INPUT_FILE)
+    df = add_missingness_flags(df)
+    df = apply_locf(df)
+    print("Data after LOCF applied. Sample:")
+    print(df.head(20))
+
+# -------------------------------------
+# Step 4: Create carried-forward flags
+# -------------------------------------
+# Marks which non-NaN values in the final dataset are actually imputed from LOCF instead of observed vitals
+# Using the _missing flags from Step 2 as ground truth, this avoids mislabeling repeated natural values as carried-forward.
+# Missingness flags are before filling, carried-forward flags are after filling
+def add_carried_forward_flags(df: pd.DataFrame) -> pd.DataFrame:
+    vitals = [
+        "respiratory_rate", "spo2", "supplemental_o2",
+        "temperature", "systolic_bp", "heart_rate",
+        "level_of_consciousness", "co2_retainer"
+    ]
+
+    for v in vitals:
+        carried_col = f"{v}_carried" # name of new carried forward flag column
+        missing_col = f"{v}_missing" # name of existing missingness flag column from Step 2
+        
+        # df[v].notna() → checks if the final value in this column is not NaN (so it exists after filling).
+        # (df[missing_col] == 1) → checks if that same row was missing before fill.
+        # & → logical AND operator, so both conditions must be true, if value exists and it was missing before
+        df[carried_col] = (
+            (df[v].notna()) & (df[missing_col] == 1)
+        ).astype(int) # Convert boolean to int (1 if carried forward (LOCF), 0 if observed naturally)
+
+    return df # Returns the same DataFrame with the new _carried columns added.
+
+if __name__ == "__main__":
+    df = load_and_sort_data(INPUT_FILE)
+    df = add_missingness_flags(df)
+    df = apply_locf(df)
+    df = add_carried_forward_flags(df)
+    print("Data with carried-forward flags. Sample:")
+    print(df.head(20))
+
+# -------------------------------------
+# Step 5: Compute rolling window features
+# -------------------------------------
+# Number of vitals = 5 (respiratory_rate, spo2, temperature, systolic_bp, heart_rate)
+# Number of windows = 3 (1h, 4h, 24h)
+# Number of stats per window = 6 (mean, min, max, std, slope, AUC)
+# 5 vitals x 3 windows x 6 stats = 90 new feature columns per row
+
+# NumPy is used for numerical operations like slope and AUC calculations
+import numpy as np
+
+def add_rolling_features(df: pd.DataFrame) -> pd.DataFrame:
+    # These 5 vitals will have rolling windows computed (numeric ones only)
+    vitals = ["respiratory_rate", "spo2", "temperature", "systolic_bp", "heart_rate"]
+    # Time window sizes in hours (3 total)
+    windows = [1, 4, 24]
+    # Stats to compute per window (6 total)
+    stats = ["mean", "min", "max", "std", "slope", "auc"]
+
+    # Convert charttime to numeric timestamp for slope/AUC calculations
+    df['charttime_numeric'] = df['charttime'].astype('int64') / 1e9  # seconds since epoch
+
+    # Loop through every vital and window size
+    for v in vitals:
+        for w in windows:
+            # 
+            roll = df.groupby(['subject_id', 'stay_id'])[v].rolling(
+                f"{w}H", on='charttime', min_periods=1
+            )
+            # Mean, min, max → capture magnitude.
+            # Std → capture variability.
+            # Slope → capture trend/direction.
+            # AUC → capture cumulative exposure/risk over time.
+
+            # Compute stats
+            df[f"{v}_roll{w}h_mean"] = roll.mean().reset_index(level=[0,1], drop=True)
+            df[f"{v}_roll{w}h_min"] = roll.min().reset_index(level=[0,1], drop=True)
+            df[f"{v}_roll{w}h_max"] = roll.max().reset_index(level=[0,1], drop=True)
+            df[f"{v}_roll{w}h_std"] = roll.std().reset_index(level=[0,1], drop=True)
+
+            # Slope via linear regression (simple approach)
+            def slope_func(x):
+                if len(x) < 2: return np.nan
+                t = np.arange(len(x))
+                return np.polyfit(t, x, 1)[0]
+            df[f"{v}_roll{w}h_slope"] = roll.apply(slope_func, raw=False).reset_index(level=[0,1], drop=True)
+
+            # AUC (cumulative sum * delta time)
+            df[f"{v}_roll{w}h_auc"] = roll.apply(lambda x: np.nansum(x), raw=False).reset_index(level=[0,1], drop=True)
+
+    # Drop temporary numeric timestamp
+    df = df.drop(columns=['charttime_numeric'])
+    return df