Skip to content

Commit 3487b42

Browse files
committed
Implemented Steps 2–4 (missingness, LOCF, carried-forward flags) and drafted Step 5 rolling features
• Added add_missingness_flags() to create *_missing columns for each vital. • Implemented apply_locf() with forward-fill + backfill per (subject_id, stay_id). • Added add_carried_forward_flags() using missingness flags to distinguish true vs imputed values. • Began add_rolling_features() to compute rolling-window stats (mean, min, max, std, slope, AUC) for 1h/4h/24h windows on numeric vitals. • Verified output with df.head() checks after each step.
1 parent 4c978eb commit 3487b42

2 files changed

Lines changed: 195 additions & 7 deletions

File tree

notes.md

Lines changed: 64 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -427,4 +427,67 @@ Only Step 1 was implemented today; Steps 2–8 remain.
427427
- Add `_missing` columns for each vital before LOCF.
428428
- Confirm flags align with actual NaNs.
429429
- If possible, progress into **Step 3 (LOCF imputation)** and **Step 4 (Carried-forward flags)**.
430-
- Keep using small previews (`.head()`, `.isna().sum()`) to verify correctness.
430+
- Keep using small previews (`.head()`, `.isna().sum()`) to verify correctness.
431+
432+
---
433+
434+
## Day 5 Notes - Missingness, Carried-Forward Flags & Rolling Features
435+
436+
### Goals
437+
- Continue building `make_timestamp_features.py` pipeline.
438+
- **Extend Step 2 → Step 5**:
439+
- **Step 2**: Add missingness flags.
440+
- **Step 3**: Apply forward-filling (LOCF).
441+
- **Step 4**: Add carried-forward flags.
442+
- **Step 5**: Start rolling window features (mean, min, max, std, slope, AUC).
443+
444+
### What We Did
445+
#### Step 2: Missingness Flags
446+
- Implemented `add_missingness_flags(df)` to generate new columns like `respiratory_rate_missing`, `spo2_missing`, etc.
447+
- **Logic**: for each vital, `df[v].isna().astype(int)` creates a flag column where `1 = missing` and `0 = observed`.
448+
- Called after loading + sorting the CSV with `load_and_sort_data(INPUT_FILE)`.
449+
- Verified output by printing `df.head()`.
450+
#### Step 3: LOCF (Forward- and Back-Fill)
451+
- Wrote `apply_locf(df)` to handle missing values by carrying the last observed measurement forward (`ffill`) within each patient stay (`groupby(['subject_id', 'stay_id'])`).
452+
- Added an extra `.bfill()` so the very first row of each stay (if missing) is backfilled with the next available measurement.
453+
- Ensures no missing values remain for the chosen vitals.
454+
#### Step 4: Carried-Forward Flags
455+
- Added `add_carried_forward_flags(df)` to track which values in the filled dataset are real vs imputed.
456+
- Used missingness flags from Step 2 as ground truth:
457+
- Carried = `value is not NaN after fill` **AND** `was missing before fill`.
458+
- Output = new columns like `respiratory_rate_carried`, `spo2_carried`, etc.
459+
- This avoids the problem of falsely marking naturally repeated values as carried-forward.
460+
#### Step 5: Rolling Features (in progress)
461+
- Started `add_rolling_features(df)` to compute rolling-window statistics on numeric vitals (`respiratory_rate`, `spo2`, `temperature`, `systolic_bp`, `heart_rate`).
462+
- **Window sizes**: 1h, 4h, 24h.
463+
- **Stats**: mean, min, max, std, slope (trend), AUC (cumulative exposure).
464+
- For each vital × window combination, new feature columns are created, e.g.:
465+
- `respiratory_rate_roll1h_mean`
466+
- `spo2_roll24h_slope`
467+
- Implemented slope with a simple linear regression on index order; AUC as the cumulative sum over the window.
468+
- Still clarifying whether slope/AUC should be computed on true timestamps (`charttime_numeric`) or just index order.
469+
470+
### Reflections
471+
#### Challenges
472+
- **Pandas syntax**:
473+
- Still feels overwhelming, especially with groupby, rolling, and applying custom functions.
474+
- Feels like "watching a chess grandmaster" without yet knowing the moves.
475+
- **Redundant flags**:
476+
- Initially thought missingness flags already made carried-forward redundant.
477+
- **Learned they complement each other**: missing = gaps before filling, carried = which values were filled in.
478+
- **Rolling features**:
479+
- Hard to see how loops systematically build columns.
480+
- `charttime_numeric` looked confusing since we’re not yet using real timestamps in slope/AUC.
481+
#### Solutions & Learnings
482+
- Breaking code into **bite-sized functions** helps (e.g., Step 2–4 each modular).
483+
- Printing `df.head()` after each step is essential for debugging.
484+
- Carried-forward vs missingness flags = subtle but distinct concepts.
485+
- Nested loops (`for v in vitals, for w in windows`) → systematic way to generate features.
486+
- Recognised unused code (`rolling_features = []`, `charttime_numeric` placeholder).
487+
488+
### Next Steps
489+
- Finish **Step 5**:
490+
- Decide whether slope/AUC should use real timestamps or simple index order.
491+
- Simplify code by removing unused prep.
492+
- Validate with small test DataFrame to confirm columns behave as expected.
493+
- **Move on to Step 6**: **time since last observation** once rolling features are stable.

src/ml-data-prep/make_timestamp_features.py

Lines changed: 131 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -47,18 +47,143 @@ def load_and_sort_data(input_file: Path) -> pd.DataFrame:
4747
print("Data loaded and sorted. Sample:")
4848
print(df.head())
4949

50-
# ------------------------------
51-
# Step 2: Create missingness flags
52-
# ------------------------------
50+
# ----------------------------------------------------
51+
# Step 2: Create missingness flags before filling
52+
# ----------------------------------------------------
5353
def add_missingness_flags(df: pd.DataFrame) -> pd.DataFrame:
5454
vitals = [
5555
"respiratory_rate", "spo2", "supplemental_o2",
5656
"temperature", "systolic_bp", "heart_rate",
57-
"level_of_consciousness"
57+
"level_of_consciousness", "co2_retainer"
5858
]
59+
# loops through each vital sign column
5960
for v in vitals:
60-
flag_col = f"{v}_missing"
61-
df[flag_col] = df[v].isna().astype(int)
61+
flag_col = f"{v}_missing" # name of new flag column created
62+
df[flag_col] = df[v].isna().astype(int) # checks if value is NaN, returns a boolean, then converts to int (1 if NaN, else 0)
63+
# store in new column df[flag_col]
64+
return df
65+
66+
if __name__ == "__main__":
67+
df = load_and_sort_data(INPUT_FILE) # df is loaded and sorted, then returned
68+
df = add_missingness_flags(df) # missingness flag function is called here, and df is called and updated
69+
print("Data with missingness flags. Sample:")
70+
print(df.head())
71+
72+
# -------------------------------------
73+
# Step 3: LOCF forward-fill per subject
74+
# -------------------------------------
75+
# Missingness flags already created in step 2 so the ML model knows which values were originally missing
76+
def apply_locf(df: pd.DataFrame) -> pd.DataFrame:
77+
vitals = [
78+
"respiratory_rate", "spo2", "supplemental_o2",
79+
"temperature", "systolic_bp", "heart_rate",
80+
"level_of_consciousness", "co2_retainer"
81+
]
82+
# Group by subject_id and stay_id to ensure filling is done within each patient's hospital stay
83+
# Then .ffill() and .bfill() are applied inside each group independently.
84+
85+
# Forward-fill per subject_id + stay_id
86+
# if row missing, fill with last available value
87+
df[vitals] = df.groupby(["subject_id", "stay_id"])[vitals].ffill()
88+
89+
# Also backfill the very first missing values (first row per patient)
90+
# if first row missing, fill with next available value
91+
df[vitals] = df.groupby(["subject_id", "stay_id"])[vitals].bfill()
92+
6293
return df
6394

6495

96+
if __name__ == "__main__":
97+
df = load_and_sort_data(INPUT_FILE)
98+
df = add_missingness_flags(df)
99+
df = apply_locf(df)
100+
print("Data after LOCF applied. Sample:")
101+
print(df.head(20))
102+
103+
# -------------------------------------
104+
# Step 4: Create carried-forward flags
105+
# -------------------------------------
106+
# Marks which non-NaN values in the final dataset are actually imputed from LOCF instead of observed vitals
107+
# Using the _missing flags from Step 2 as ground truth, this avoids mislabeling repeated natural values as carried-forward.
108+
# Missingness flags are before filling, carried-forward flags are after filling
109+
def add_carried_forward_flags(df: pd.DataFrame) -> pd.DataFrame:
110+
vitals = [
111+
"respiratory_rate", "spo2", "supplemental_o2",
112+
"temperature", "systolic_bp", "heart_rate",
113+
"level_of_consciousness", "co2_retainer"
114+
]
115+
116+
for v in vitals:
117+
carried_col = f"{v}_carried" # name of new carried forward flag column
118+
missing_col = f"{v}_missing" # name of existing missingness flag column from Step 2
119+
120+
# df[v].notna() → checks if the final value in this column is not NaN (so it exists after filling).
121+
# (df[missing_col] == 1) → checks if that same row was missing before fill.
122+
# & → logical AND operator, so both conditions must be true, if value exists and it was missing before
123+
df[carried_col] = (
124+
(df[v].notna()) & (df[missing_col] == 1)
125+
).astype(int) # Convert boolean to int (1 if carried forward (LOCF), 0 if observed naturally)
126+
127+
return df # Returns the same DataFrame with the new _carried columns added.
128+
129+
if __name__ == "__main__":
130+
df = load_and_sort_data(INPUT_FILE)
131+
df = add_missingness_flags(df)
132+
df = apply_locf(df)
133+
df = add_carried_forward_flags(df)
134+
print("Data with carried-forward flags. Sample:")
135+
print(df.head(20))
136+
137+
# -------------------------------------
138+
# Step 5: Compute rolling window features
139+
# -------------------------------------
140+
# Number of vitals = 5 (respiratory_rate, spo2, temperature, systolic_bp, heart_rate)
141+
# Number of windows = 3 (1h, 4h, 24h)
142+
# Number of stats per window = 6 (mean, min, max, std, slope, AUC)
143+
# 5 vitals x 3 windows x 6 stats = 90 new feature columns per row
144+
145+
# NumPy is used for numerical operations like slope and AUC calculations
146+
import numpy as np
147+
148+
def add_rolling_features(df: pd.DataFrame) -> pd.DataFrame:
149+
# These 5 vitals will have rolling windows computed (numeric ones only)
150+
vitals = ["respiratory_rate", "spo2", "temperature", "systolic_bp", "heart_rate"]
151+
# Time window sizes in hours (3 total)
152+
windows = [1, 4, 24]
153+
# Stats to compute per window (6 total)
154+
stats = ["mean", "min", "max", "std", "slope", "auc"]
155+
156+
# Convert charttime to numeric timestamp for slope/AUC calculations
157+
df['charttime_numeric'] = df['charttime'].astype('int64') / 1e9 # seconds since epoch
158+
159+
# Loop through every vital and window size
160+
for v in vitals:
161+
for w in windows:
162+
#
163+
roll = df.groupby(['subject_id', 'stay_id'])[v].rolling(
164+
f"{w}H", on='charttime', min_periods=1
165+
)
166+
# Mean, min, max → capture magnitude.
167+
# Std → capture variability.
168+
# Slope → capture trend/direction.
169+
# AUC → capture cumulative exposure/risk over time.
170+
171+
# Compute stats
172+
df[f"{v}_roll{w}h_mean"] = roll.mean().reset_index(level=[0,1], drop=True)
173+
df[f"{v}_roll{w}h_min"] = roll.min().reset_index(level=[0,1], drop=True)
174+
df[f"{v}_roll{w}h_max"] = roll.max().reset_index(level=[0,1], drop=True)
175+
df[f"{v}_roll{w}h_std"] = roll.std().reset_index(level=[0,1], drop=True)
176+
177+
# Slope via linear regression (simple approach)
178+
def slope_func(x):
179+
if len(x) < 2: return np.nan
180+
t = np.arange(len(x))
181+
return np.polyfit(t, x, 1)[0]
182+
df[f"{v}_roll{w}h_slope"] = roll.apply(slope_func, raw=False).reset_index(level=[0,1], drop=True)
183+
184+
# AUC (cumulative sum * delta time)
185+
df[f"{v}_roll{w}h_auc"] = roll.apply(lambda x: np.nansum(x), raw=False).reset_index(level=[0,1], drop=True)
186+
187+
# Drop temporary numeric timestamp
188+
df = df.drop(columns=['charttime_numeric'])
189+
return df

0 commit comments

Comments
 (0)