Feat/v0.2 by chaitanyakasaraneni · Pull Request #6 · chaitanyakasaraneni/clinops

chaitanyakasaraneni · 2026-04-08T14:08:03Z

This pull request introduces several improvements and fixes across the data ingestion and temporal imputation modules, as well as expands test coverage for FHIR and MIMIC table ingestion. The most significant changes include enhanced handling of forward/backward fill imputation with large time gaps, improved schema validation logic for MIMIC ingestion, and comprehensive new tests for FHIR and MIMIC data loaders.

Temporal Imputation Improvements:

Enhanced the forward and backward fill strategies in Imputer.transform to use a UUID-based sentinel column, ensuring row order is preserved after group operations and preventing accidental overwriting of user data. The logic now robustly handles cases with a max_gap_hours threshold, masking large gaps and restoring NaNs where appropriate. This also includes a refactor to track which values were originally missing before filling. [1] [2] [3] [4] [5]

MIMIC Ingestion Enhancements:

Improved schema validation in MimicLoader._load_table to ensure that validation and datetime parsing are only performed when appropriate, particularly when chunked reading is not used or when reading from Parquet files.
Added MimicIIILoader to the public API in clinops.ingest.__init__.py, making it available for import.

FHIR and MIMIC Loader Test Coverage:

Added extensive new tests for the FHIRLoader, including filtering observations by LOINC code, handling conditions with and without codings, loading from various file formats (JSON, NDJSON, JSONL), and robust handling of malformed or mismatched resource types.
Expanded tests for MimicTableLoader to cover filtering by hospital admission IDs, loading from Parquet files, handling missing files, and ensuring the summary correctly reports missing tables with zero rows. [1] [2]

When id_col is set, apply fill and gap masking within each entity group rather than globally, so values from one patient/admission cannot propagate into another. Also exclude id_col from numeric_cols to avoid treating it as a feature. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: add mimic_iii loader

Copilot

Pull request overview

This PR enhances temporal imputation robustness (especially forward/backward fill with max-gap masking), improves MIMIC ingestion schema/parse behavior (including Parquet handling), adds a new MIMIC-III loader, and substantially expands ingestion test coverage for FHIR and MIMIC loaders.

Changes:

Refactors Imputer.transform forward/backward fill to preserve input row order during gap masking and avoid user-column collisions via a UUID sentinel.
Adjusts MimicLoader._load_table to validate/parse datetimes only when appropriate (notably for Parquet and non-chunked reads).
Adds MimicIIILoader and extensive new tests for FHIR/MIMIC ingestion and temporal windowing/imputation edge cases.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`clinops/temporal/imputation.py`	Sentinel-based order preservation + refactored gap-masking fill path.
`clinops/ingest/mimic.py`	Refined schema validation/datetime parsing conditions around Parquet and chunking.
`clinops/ingest/mimic_iii.py`	New MIMIC-III loader (flat dir, uppercase columns normalization, ICD-9 behavior).
`clinops/ingest/__init__.py`	Exposes `MimicIIILoader` in the public ingest API.
`tests/test_temporal/test_temporal.py`	Adds targeted tests for pivot behavior and many imputer edge cases (gaps, ordering, grouping).
`tests/test_ingest/test_ingest.py`	Adds expanded FHIRLoader tests (LOINC filtering, conditions edge cases, JSON/NDJSON/JSONL parsing).
`tests/test_ingest/test_mimic_tables.py`	Expands MimicTableLoader tests (hadm filtering, Parquet loading, missing tables reporting).
`tests/test_ingest/test_mimic_iii.py`	New comprehensive MimicIIILoader test suite with minimal fixtures and error handling checks.
`.gitignore`	Ignores `issue-pyOpenSci.md`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-08T14:15:23Z

+        numeric_cols = [
+            c for c in df.select_dtypes(include=[np.number]).columns
+            if c != self.id_col
+        ]


When per_patient=True, fit() populates _patient_means / _patient_medians but transform() only checks _fill_values. This makes a fitted per-patient MEAN/MEDIAN imputer behave as “unfitted” (logs a warning) and falls back to global stats from the transform input, which is incorrect and contradicts the per_patient parameter docs. Consider applying fillna per-group using the stored per-patient stats (and only warning when neither global nor per-patient stats are available).

Copilot · 2026-04-08T14:15:23Z

+        df = self._load("prescriptions")
+        df = self._filter(df, subject_ids, hadm_ids)
+        if drugs is not None:
+            pattern = "|".join(drugs)
+            df = df[df["drug"].str.contains(pattern, case=False, na=False)]
+        return df


prescriptions(drugs=...) builds a single regex via pattern = "|".join(drugs) and uses str.contains. This can (a) change semantics vs MimicLoader.prescriptions() (which does exact case-insensitive matching) despite the stated goal of a shared interface, and (b) produce incorrect matches / regex errors when a drug name contains regex metacharacters. Consider aligning with MimicLoader by using exact matching (e.g., normalizing case and using isin), or if substring matching is intended, escape each term (or use regex=False with a loop) to avoid regex injection.

Copilot · 2026-04-08T14:15:23Z

+        else:
+            df = pd.read_csv(path, low_memory=False)
+
+        if isinstance(df, pd.DataFrame) and not (
+            self._cfg.chunk_size and table_name in {"chartevents", "labevents"}
+        ):
+            df = _normalise_columns(df)
+            if table_name == "diagnoses_icd" and "icd9_code" in df.columns:
+                df["icd9_code"] = df["icd9_code"].astype(str)
+            df = self._parse_datetimes(df, table_name)


For diagnoses_icd, casting icd9_code to str after read_csv won’t preserve leading zeros if pandas inferred the column as numeric (a common case when codes are all digits). To avoid silently changing codes, consider reading the ICD column as string at parse time (e.g., dtype={"ICD9_CODE": "string"} before normalization, or a converters entry) so values round-trip exactly.

Copilot · 2026-04-08T14:15:24Z

+        If set, large tables (``chartevents``, ``labevents``) return a chunked
+        reader instead of loading fully into memory.


The class docstring says chunk_size causes large tables to “return a chunked reader instead of loading fully into memory”, but _load() currently materializes all chunks into a single DataFrame via pd.concat(...). Either update the documentation to reflect that chunking is only used as a memory-friendly read/processing strategy (but still loads fully), or change the implementation/API to actually return a TextFileReader/iterator like the MIMIC-IV loader docs describe.

Suggested change

If set, large tables (``chartevents``, ``labevents``) return a chunked

reader instead of loading fully into memory.

If set, large tables (for example ``chartevents`` and ``labevents``)

are read and filtered in chunks to reduce peak memory usage during

loading, but the chunks are still concatenated and returned as a

fully materialized :class:`pandas.DataFrame`.

chaitanyakasaraneni and others added 11 commits March 1, 2026 17:06

feat: add mimic_iii loader

9d05db5

tests: update test coverage and minor bug fix in temporal

bcf1eb1

update: fix suggestions

35b74db

fix: update as per suggestions

1a8e057

fix: update tests as per suggestions

e49c351

bug fix in imputation

cc9d29c

update conditional checks and sort

1c43af7

fix bugs in imputation and mimic-iii

f3cf938

use UUID based sentinel

c4f39de

Merge pull request #4 from chaitanyakasaraneni/feat/MIMIC-III-loader

bd60d3a

feat: add mimic_iii loader

Copilot AI review requested due to automatic review settings April 8, 2026 14:08

chaitanyakasaraneni merged commit f1f86fb into main Apr 8, 2026
2 of 3 checks passed

chaitanyakasaraneni deleted the feat/v0.2 branch April 8, 2026 14:08

Copilot started reviewing on behalf of chaitanyakasaraneni April 8, 2026 14:08 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/v0.2#6

Feat/v0.2#6
chaitanyakasaraneni merged 11 commits intomainfrom
feat/v0.2

chaitanyakasaraneni commented Apr 8, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		If set, large tables (``chartevents``, ``labevents``) return a chunked
		reader instead of loading fully into memory.

-        If set, large tables (``chartevents``, ``labevents``) return a chunked
-        reader instead of loading fully into memory.
+        If set, large tables (for example ``chartevents`` and ``labevents``)
+        are read and filtered in chunks to reduce peak memory usage during
+        loading, but the chunks are still concatenated and returned as a
+        fully materialized :class:`pandas.DataFrame`.

Conversation

chaitanyakasaraneni commented Apr 8, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants