Skip to content

Commit e2449bc

Browse files
committed
Day 2: NEWS2 pipeline updates, bug fixes, and defensive coding
- Fixed supplemental O2 KeyError - Added GCS and CO2 merges safely - Updated notes.md with Day 2 summary - Ensured no duplicate columns and all vitals present
1 parent 58468c4 commit e2449bc

9 files changed

Lines changed: 27337 additions & 15 deletions

data/news2_vitals_with_co2.csv

Lines changed: 6011 additions & 0 deletions
Large diffs are not rendered by default.

news2_patient_summary.csv

Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
subject_id,min_news2_score,max_news2_score,mean_news2_score,median_news2_score,total_records
2+
10000032,2,6,3.764705882352941,4.0,17
3+
10001217,2,8,3.7205882352941178,4.0,68
4+
10001725,2,4,2.5428571428571427,2.0,35
5+
10002428,2,12,4.3402625820568925,4.0,914
6+
10002495,2,11,3.780141843971631,3.0,282
7+
10002930,2,7,2.8412698412698414,2.0,63
8+
10003046,2,11,5.3896103896103895,5.0,77
9+
10003400,2,12,3.743318831572405,3.0,1609
10+
10004235,2,11,5.098265895953757,5.0,173
11+
10004422,2,9,3.485074626865672,3.0,268
12+
10004457,2,8,3.7560975609756095,3.0,41
13+
10004720,2,10,2.911864406779661,2.0,295
14+
10004733,2,10,3.7960526315789473,3.0,304
15+
10005348,2,8,3.4065934065934065,3.0,91
16+
10005817,2,13,4.080491132332878,4.0,733
17+
10005866,2,11,4.75,5.0,72
18+
10005909,2,8,3.4404761904761907,3.0,168
19+
10006053,2,13,5.654545454545454,5.0,55
20+
10006580,2,6,3.52,3.0,25
21+
10007058,2,5,2.330188679245283,2.0,106
22+
10007795,2,8,5.225806451612903,6.0,31
23+
10007818,2,12,4.198871650211566,4.0,709
24+
10007928,2,10,4.612244897959184,5.0,98
25+
10008287,2,5,2.6136363636363638,2.5,44
26+
10008454,2,10,4.024242424242424,4.0,165
27+
10009035,2,9,4.8,5.0,35
28+
10009049,2,8,4.65,5.0,40
29+
10009628,2,12,3.929411764705882,3.0,85
30+
10010471,2,10,2.865079365079365,2.0,252
31+
10010867,2,11,3.55699481865285,2.0,386
32+
10011398,2,7,3.1578947368421053,3.0,38
33+
10012552,2,7,3.0454545454545454,2.0,110
34+
10012853,2,10,3.4831460674157304,2.0,178
35+
10013049,2,5,2.9047619047619047,2.0,42
36+
10014078,2,10,4.223076923076923,4.0,130
37+
10014354,2,10,2.8333333333333335,2.0,348
38+
10014729,2,6,3.2588235294117647,3.0,85
39+
10015272,2,10,4.061538461538461,4.0,65
40+
10015860,2,7,3.4727272727272727,3.0,55
41+
10015931,2,11,4.059139784946237,4.0,372
42+
10016150,2,6,3.1666666666666665,2.0,12
43+
10016742,2,9,2.796133567662566,2.0,569
44+
10016810,2,7,3.4948453608247423,3.0,97
45+
10017492,2,8,3.227272727272727,2.5,44
46+
10018081,2,11,4.679702048417132,5.0,537
47+
10018328,2,9,2.786279683377309,2.0,379
48+
10018423,2,9,3.7333333333333334,3.0,45
49+
10018501,2,5,2.608108108108108,2.0,74
50+
10018845,2,6,2.652173913043478,2.0,46
51+
10019003,2,11,4.270983213429257,4.0,417
52+
10019172,2,7,2.952,2.0,125
53+
10019385,2,8,3.2222222222222223,3.0,63
54+
10019568,2,9,3.046511627906977,2.0,43
55+
10019777,2,8,2.5751445086705202,2.0,346
56+
10019917,2,7,2.953488372093023,2.0,43
57+
10020187,2,9,2.7524271844660193,2.0,206
58+
10020306,2,11,5.029411764705882,5.0,102
59+
10020640,2,9,5.2105263157894735,5.0,76
60+
10020740,2,12,4.7883522727272725,5.0,704
61+
10020786,2,11,5.11864406779661,5.0,59
62+
10020944,2,10,3.8657243816254416,4.0,283
63+
10021118,2,8,3.3043478260869565,3.0,69
64+
10021312,2,9,3.2777777777777777,2.0,54
65+
10021487,2,12,5.414721723518851,5.0,557
66+
10021666,2,9,3.962962962962963,4.0,27
67+
10021938,2,5,2.3917525773195876,2.0,97
68+
10022017,2,10,3.349344978165939,2.0,229
69+
10022041,2,9,3.676056338028169,4.0,71
70+
10022281,2,7,3.186046511627907,3.0,43
71+
10022880,2,9,3.0,2.5,50
72+
10023117,2,13,4.530537830446673,4.0,1097
73+
10023239,2,10,4.026315789473684,3.0,190
74+
10023771,2,8,3.25531914893617,2.0,94
75+
10024043,2,6,2.7857142857142856,2.0,84
76+
10025463,2,10,4.521739130434782,5.0,23
77+
10025612,2,11,4.724444444444444,5.0,225
78+
10026255,2,11,4.577464788732394,5.0,71
79+
10026354,2,7,3.7674418604651163,4.0,43
80+
10026406,2,6,3.3684210526315788,3.0,19
81+
10027445,2,10,4.263513513513513,4.0,148
82+
10027602,2,13,5.658536585365853,5.0,451
83+
10029291,2,10,3.888468809073724,4.0,529
84+
10029484,2,7,4.761904761904762,5.0,21
85+
10031404,2,9,4.238095238095238,3.0,63
86+
10031757,2,11,3.973913043478261,3.0,115
87+
10032725,2,8,3.311111111111111,2.0,90
88+
10035185,2,6,3.142857142857143,3.0,42
89+
10035631,2,12,4.790496760259179,5.0,463
90+
10036156,2,7,3.025,3.0,40
91+
10037861,2,14,5.581818181818182,5.0,330
92+
10037928,2,8,3.8524590163934427,3.0,61
93+
10037975,2,14,5.247191011235955,5.0,178
94+
10038081,2,10,3.8073394495412844,3.0,218
95+
10038933,2,13,5.252631578947368,5.0,190
96+
10038992,2,10,3.8666666666666667,4.0,45
97+
10038999,2,11,4.049180327868853,3.0,366
98+
10039708,2,12,4.27789046653144,4.0,986
99+
10039831,2,8,4.931034482758621,5.0,58
100+
10039997,2,9,4.304347826086956,4.0,46
101+
10040025,2,10,2.7676767676767677,2.0,396

news2_scores.csv

Lines changed: 20814 additions & 0 deletions
Large diffs are not rendered by default.

notes.md

Lines changed: 103 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## Day 1: NEWS2 Data Extraction and Preliminary Scoring
44

5-
## Pipeline Overview
5+
### Pipeline Overview
66

77
Raw CSVs (chartevents.csv, etc.)
88
@@ -24,13 +24,13 @@ compute_news2.py
2424
2525
Final NEWS2 scores per patient
2626

27-
## Goals
27+
### Goals
2828
- Extract relevant vital signs from PhysioNet.org MIMIC-IV Clinical Database Demo synthetic dataset for NEWS2 scoring.
2929
- Identify and flag CO₂ retainers to ensure accurate oxygen scoring.
3030
- Implement basic NEWS2 scoring functions in Python.
3131
- Establish a pipeline that is extendable for future interoperability with real-world clinical data.
3232

33-
## What We Did
33+
### What We Did
3434
1. **Dataset Preparation**
3535
- Downloaded synthetic dataset `mimic-iv-clinical-database-demo-2.2.zip` and unzipped CSV files.
3636
- Explored `chartevents.csv` and other relevant CSVs to identify required vitals.
@@ -56,7 +56,7 @@ Final NEWS2 scores per patient
5656
- Functions to compute individual vital scores
5757
- Pandas used to process CSV and calculate total NEWS2 scores
5858

59-
## Reflections
59+
### Reflections
6060
- **Challenges:**
6161
- Understanding GCS scoring and mapping three separate components to level of consciousness.
6262
- Determining FiO₂ representation in dataset (0.21 vs. 21%).
@@ -71,23 +71,23 @@ Final NEWS2 scores per patient
7171
- Tuples `(min, max, score)` provide flexible, readable threshold definitions for each vital.
7272
- CO₂ retainer pipeline ensures accurate NEWS2 oxygen scoring now and for future datasets.
7373

74-
## Issues Encountered
74+
### Issues Encountered
7575
- Confusion around GCS mapping and timestamp alignment.
7676
- Initial uncertainty about FiO₂ and temperature units.
7777
- Need to verify CO₂ retainer thresholds and data format.
7878
- Feeling overwhelmed by the complexity of clinical data pipelines and Python functions.
7979

80-
## Lessons Learned
80+
### Lessons Learned
8181
- Extracting and standardising clinical data is a critical and time-consuming first step.
8282
- Structuring data in CSVs with consistent headers simplifies downstream processing.
8383
- Python dictionaries and tuple-based thresholds are powerful for flexible clinical scoring functions.
8484
- Documenting assumptions (temperature units, FiO₂ thresholds) is essential for reproducibility.
8585

86-
## Future Interoperability Considerations
86+
### Future Interoperability Considerations
8787
- Pipeline designed to support ingestion of FHIR-based EHR data for future integration.
8888
- Potential extension: map standardized FHIR resources to predictive EWS pipeline for real-world applicability.
8989

90-
## CO₂ Retainer Validation and NEWS2 Scoring Documentation
90+
### CO₂ Retainer Validation and NEWS2 Scoring Documentation
9191
1. **Objective:** Identify CO₂ retainers to ensure correct oxygen scoring.
9292
2. **Methodology:**
9393
- All ABG measurements in `chartevents.csv` examined.
@@ -98,3 +98,98 @@ Final NEWS2 scores per patient
9898
4. **Future-proofing:**
9999
- CO₂ retainer thresholds remain documented in code.
100100
- Future datasets will automatically flag and score retainers according to NEWS2 rules.
101+
102+
103+
104+
## Day 2: NEWS2 Pipeline Development
105+
106+
### Goals
107+
- Finalise **Phase 1** of the NEWS2 scoring pipeline.
108+
- Ensure robust extraction, computation, and output of NEWS2 scores from raw vital sign data.
109+
- Handle missing data, standardize column names, and prevent errors caused by merging or absent measurements.
110+
- Create clean, wide-format CSV outputs: `news2_scores.csv` (per-timestamp) and `news2_patient_summary.csv` (per-patient summary).
111+
112+
### What We Did
113+
1. **Updated `extract_news2_vitals.py`:**
114+
- Included missing `systolic_bp` itemids.
115+
- Added alternate names for vitals to capture all relevant measurements.
116+
- Produced `news2_vitals_with_co2.csv` with columns:
117+
`subject_id, hadm_id, stay_id, caregiver_id, charttime, storetime, itemid, value, valuenum, valueuom, warning, label, co2_retainer`.
118+
2. **Updated `compute_news2.py`:**
119+
- Pivoted long-format vitals to wide format, ensuring all **expected NEWS2 vitals** (`respiratory_rate, spo2, supplemental_o2, temperature, systolic_bp, heart_rate`) exist as columns.
120+
- Safely merged **GCS components**, computing `gcs_total` and `level_of_consciousness`.
121+
- Safely merged **CO₂ retainer** information.
122+
- Fixed **supplemental O₂** issues:
123+
- Checked if column exists before filling.
124+
- Filled missing rows with `0` (Room air).
125+
- Only merged if not already present to prevent duplication.
126+
127+
3. **Handled duplicates and column conflicts:**
128+
- Avoided `_x` / `_y` suffixes by careful merge logic:
129+
- GCS merge only added `gcs_total` and `level_of_consciousness`.
130+
- Supplemental O₂ merged only if missing.
131+
- CO₂ retainer merge ensured no overlap.
132+
133+
4. **Added human-readable labels:**
134+
- `consciousness_label`, `co2_retainer_label`, `supplemental_o2_label`.
135+
- Ensured columns exist before applying transformations to prevent KeyErrors.
136+
- The redundancy exists to **ensure the script runs safely** even if:
137+
- `level_of_consciousness` is missing (no GCS rows for some patients)
138+
- `co2_retainer` is missing
139+
- `supplemental_o2` is missing
140+
- Purpose:
141+
- Guarantee idempotency
142+
- Prevent KeyErrors
143+
- Keep CSV outputs consistent and complete
144+
145+
5. **Computed NEWS2 scores per row:**
146+
- Applied scoring rules for each vital.
147+
- Calculated `news2_score`, `risk`, `monitoring_freq`, and `response`.
148+
- Validated that **no scores exceeded 20**.
149+
150+
6. **Created outputs:**
151+
- `news2_scores.csv` – full dataset with scores and all vital measurements.
152+
- `news2_patient_summary.csv` – per-patient summary with `min_news2_score, max_news2_score, mean_news2_score, median_news2_score, total_records`.
153+
154+
7. **Implemented defensive coding & sanity checks:**
155+
- Missing vitals counted per row (`missing_vitals` column).
156+
- All merges and transformations check column existence.
157+
- Default values (0 or False) used for missing data to maintain dataset integrity.
158+
159+
### Reflections:
160+
**Challenges**:
161+
- KeyError on `supplemental_o2` when merging due to missing FiO₂ measurements.
162+
- Duplicate columns (`_x`, `_y`) after merges.
163+
- Missing GCS components for some patients.
164+
- Missing NEWS2 vitals in pivot.
165+
**Solutions & Learnings**:
166+
- Conditional merge and default fill (0). Always check column existence before accessing or transforming it in merged datasets.
167+
- Merge only necessary columns, avoid re-merging existing ones. Thoughtful merge design prevents downstream confusion and simplifies CSV outputs.
168+
- Added missing columns with `pd.NA` and computed `gcs_total` safely. Defensive coding is critical when working with real-world clinical data.
169+
- Added all expected vitals as NA before merges. Preemptive handling of expected columns reduces errors during scoring.
170+
171+
### Issues Encountered
172+
- Missing itemids in `extract_news2_vitals.py`.
173+
- KeyError when accessing non-existent supplemental O₂ or GCS columns.
174+
- Duplicate columns after merging GCS and supplemental O₂.
175+
- Variations in vital naming and units.
176+
- Some timestamps had missing vital measurements.
177+
178+
### Lessons Learned
179+
- Always **validate column existence** before transformations or merges.
180+
- Merge only necessary columns to prevent duplicates.
181+
- Filling missing data with safe defaults ensures pipeline stability.
182+
- Defensive coding allows robust handling of incomplete real-world datasets.
183+
- Maintaining clean, standardised column names simplifies both computation and human-readable output.
184+
185+
### Extra Considerations / Documentation Points
186+
- The pipeline now fully supports **Phase 1** outputs and can be run repeatedly on updated CSVs.
187+
- All merges are idempotent – repeated runs will not create duplicates.
188+
- All human-readable labels (`consciousness_label`, `co2_retainer_label`, `supplemental_o2_label`) are always generated.
189+
- **Defensive coding for human-readable labels**:
190+
- Two blocks exist in the code assigning `consciousness_label`, `co2_retainer_label`, and `supplemental_o2_label`.
191+
- Redundancy ensures the script runs safely even if some columns are missing (`level_of_consciousness`, `co2_retainer`, `supplemental_o2`).
192+
- Guarantees idempotency and prevents KeyErrors on incomplete datasets.
193+
- Best practice: could combine into a single block that creates defaults and assigns labels in one step.
194+
- Outputs `news2_scores.csv` and `news2_patient_summary.csv` are fully consistent with the pipeline’s intended design.
195+
- Next steps (Phase 2) could include visualisation, predictive modeling, or integrating NEWS2 trajectories into a dashboard.

src/api.py

Whitespace-only changes.

0 commit comments

Comments
 (0)