Conversation
When id_col is set, apply fill and gap masking within each entity group rather than globally, so values from one patient/admission cannot propagate into another. Also exclude id_col from numeric_cols to avoid treating it as a feature. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feat: add mimic_iii loader
There was a problem hiding this comment.
Pull request overview
This PR enhances temporal imputation robustness (especially forward/backward fill with max-gap masking), improves MIMIC ingestion schema/parse behavior (including Parquet handling), adds a new MIMIC-III loader, and substantially expands ingestion test coverage for FHIR and MIMIC loaders.
Changes:
- Refactors
Imputer.transformforward/backward fill to preserve input row order during gap masking and avoid user-column collisions via a UUID sentinel. - Adjusts
MimicLoader._load_tableto validate/parse datetimes only when appropriate (notably for Parquet and non-chunked reads). - Adds
MimicIIILoaderand extensive new tests for FHIR/MIMIC ingestion and temporal windowing/imputation edge cases.
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
clinops/temporal/imputation.py |
Sentinel-based order preservation + refactored gap-masking fill path. |
clinops/ingest/mimic.py |
Refined schema validation/datetime parsing conditions around Parquet and chunking. |
clinops/ingest/mimic_iii.py |
New MIMIC-III loader (flat dir, uppercase columns normalization, ICD-9 behavior). |
clinops/ingest/__init__.py |
Exposes MimicIIILoader in the public ingest API. |
tests/test_temporal/test_temporal.py |
Adds targeted tests for pivot behavior and many imputer edge cases (gaps, ordering, grouping). |
tests/test_ingest/test_ingest.py |
Adds expanded FHIRLoader tests (LOINC filtering, conditions edge cases, JSON/NDJSON/JSONL parsing). |
tests/test_ingest/test_mimic_tables.py |
Expands MimicTableLoader tests (hadm filtering, Parquet loading, missing tables reporting). |
tests/test_ingest/test_mimic_iii.py |
New comprehensive MimicIIILoader test suite with minimal fixtures and error handling checks. |
.gitignore |
Ignores issue-pyOpenSci.md. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| numeric_cols = [ | ||
| c for c in df.select_dtypes(include=[np.number]).columns | ||
| if c != self.id_col | ||
| ] |
There was a problem hiding this comment.
When per_patient=True, fit() populates _patient_means / _patient_medians but transform() only checks _fill_values. This makes a fitted per-patient MEAN/MEDIAN imputer behave as “unfitted” (logs a warning) and falls back to global stats from the transform input, which is incorrect and contradicts the per_patient parameter docs. Consider applying fillna per-group using the stored per-patient stats (and only warning when neither global nor per-patient stats are available).
| df = self._load("prescriptions") | ||
| df = self._filter(df, subject_ids, hadm_ids) | ||
| if drugs is not None: | ||
| pattern = "|".join(drugs) | ||
| df = df[df["drug"].str.contains(pattern, case=False, na=False)] | ||
| return df |
There was a problem hiding this comment.
prescriptions(drugs=...) builds a single regex via pattern = "|".join(drugs) and uses str.contains. This can (a) change semantics vs MimicLoader.prescriptions() (which does exact case-insensitive matching) despite the stated goal of a shared interface, and (b) produce incorrect matches / regex errors when a drug name contains regex metacharacters. Consider aligning with MimicLoader by using exact matching (e.g., normalizing case and using isin), or if substring matching is intended, escape each term (or use regex=False with a loop) to avoid regex injection.
| else: | ||
| df = pd.read_csv(path, low_memory=False) | ||
|
|
||
| if isinstance(df, pd.DataFrame) and not ( | ||
| self._cfg.chunk_size and table_name in {"chartevents", "labevents"} | ||
| ): | ||
| df = _normalise_columns(df) | ||
| if table_name == "diagnoses_icd" and "icd9_code" in df.columns: | ||
| df["icd9_code"] = df["icd9_code"].astype(str) | ||
| df = self._parse_datetimes(df, table_name) |
There was a problem hiding this comment.
For diagnoses_icd, casting icd9_code to str after read_csv won’t preserve leading zeros if pandas inferred the column as numeric (a common case when codes are all digits). To avoid silently changing codes, consider reading the ICD column as string at parse time (e.g., dtype={"ICD9_CODE": "string"} before normalization, or a converters entry) so values round-trip exactly.
| If set, large tables (``chartevents``, ``labevents``) return a chunked | ||
| reader instead of loading fully into memory. |
There was a problem hiding this comment.
The class docstring says chunk_size causes large tables to “return a chunked reader instead of loading fully into memory”, but _load() currently materializes all chunks into a single DataFrame via pd.concat(...). Either update the documentation to reflect that chunking is only used as a memory-friendly read/processing strategy (but still loads fully), or change the implementation/API to actually return a TextFileReader/iterator like the MIMIC-IV loader docs describe.
| If set, large tables (``chartevents``, ``labevents``) return a chunked | |
| reader instead of loading fully into memory. | |
| If set, large tables (for example ``chartevents`` and ``labevents``) | |
| are read and filtered in chunks to reduce peak memory usage during | |
| loading, but the chunks are still concatenated and returned as a | |
| fully materialized :class:`pandas.DataFrame`. |
This pull request introduces several improvements and fixes across the data ingestion and temporal imputation modules, as well as expands test coverage for FHIR and MIMIC table ingestion. The most significant changes include enhanced handling of forward/backward fill imputation with large time gaps, improved schema validation logic for MIMIC ingestion, and comprehensive new tests for FHIR and MIMIC data loaders.
Temporal Imputation Improvements:
Imputer.transformto use a UUID-based sentinel column, ensuring row order is preserved after group operations and preventing accidental overwriting of user data. The logic now robustly handles cases with amax_gap_hoursthreshold, masking large gaps and restoring NaNs where appropriate. This also includes a refactor to track which values were originally missing before filling. [1] [2] [3] [4] [5]MIMIC Ingestion Enhancements:
MimicLoader._load_tableto ensure that validation and datetime parsing are only performed when appropriate, particularly when chunked reading is not used or when reading from Parquet files.MimicIIILoaderto the public API inclinops.ingest.__init__.py, making it available for import.FHIR and MIMIC Loader Test Coverage:
FHIRLoader, including filtering observations by LOINC code, handling conditions with and without codings, loading from various file formats (JSON, NDJSON, JSONL), and robust handling of malformed or mismatched resource types.MimicTableLoaderto cover filtering by hospital admission IDs, loading from Parquet files, handling missing files, and ensuring the summary correctly reports missing tables with zero rows. [1] [2]