Skip to content

Feat/v0.2#6

Merged
chaitanyakasaraneni merged 11 commits intomainfrom
feat/v0.2
Apr 8, 2026
Merged

Feat/v0.2#6
chaitanyakasaraneni merged 11 commits intomainfrom
feat/v0.2

Conversation

@chaitanyakasaraneni
Copy link
Copy Markdown
Owner

This pull request introduces several improvements and fixes across the data ingestion and temporal imputation modules, as well as expands test coverage for FHIR and MIMIC table ingestion. The most significant changes include enhanced handling of forward/backward fill imputation with large time gaps, improved schema validation logic for MIMIC ingestion, and comprehensive new tests for FHIR and MIMIC data loaders.

Temporal Imputation Improvements:

  • Enhanced the forward and backward fill strategies in Imputer.transform to use a UUID-based sentinel column, ensuring row order is preserved after group operations and preventing accidental overwriting of user data. The logic now robustly handles cases with a max_gap_hours threshold, masking large gaps and restoring NaNs where appropriate. This also includes a refactor to track which values were originally missing before filling. [1] [2] [3] [4] [5]

MIMIC Ingestion Enhancements:

  • Improved schema validation in MimicLoader._load_table to ensure that validation and datetime parsing are only performed when appropriate, particularly when chunked reading is not used or when reading from Parquet files.
  • Added MimicIIILoader to the public API in clinops.ingest.__init__.py, making it available for import.

FHIR and MIMIC Loader Test Coverage:

  • Added extensive new tests for the FHIRLoader, including filtering observations by LOINC code, handling conditions with and without codings, loading from various file formats (JSON, NDJSON, JSONL), and robust handling of malformed or mismatched resource types.
  • Expanded tests for MimicTableLoader to cover filtering by hospital admission IDs, loading from Parquet files, handling missing files, and ensuring the summary correctly reports missing tables with zero rows. [1] [2]

chaitanyakasaraneni and others added 11 commits March 1, 2026 17:06
When id_col is set, apply fill and gap masking within each entity group
rather than globally, so values from one patient/admission cannot
propagate into another. Also exclude id_col from numeric_cols to avoid
treating it as a feature.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 8, 2026 14:08
@chaitanyakasaraneni chaitanyakasaraneni merged commit f1f86fb into main Apr 8, 2026
2 of 3 checks passed
@chaitanyakasaraneni chaitanyakasaraneni deleted the feat/v0.2 branch April 8, 2026 14:08
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances temporal imputation robustness (especially forward/backward fill with max-gap masking), improves MIMIC ingestion schema/parse behavior (including Parquet handling), adds a new MIMIC-III loader, and substantially expands ingestion test coverage for FHIR and MIMIC loaders.

Changes:

  • Refactors Imputer.transform forward/backward fill to preserve input row order during gap masking and avoid user-column collisions via a UUID sentinel.
  • Adjusts MimicLoader._load_table to validate/parse datetimes only when appropriate (notably for Parquet and non-chunked reads).
  • Adds MimicIIILoader and extensive new tests for FHIR/MIMIC ingestion and temporal windowing/imputation edge cases.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
clinops/temporal/imputation.py Sentinel-based order preservation + refactored gap-masking fill path.
clinops/ingest/mimic.py Refined schema validation/datetime parsing conditions around Parquet and chunking.
clinops/ingest/mimic_iii.py New MIMIC-III loader (flat dir, uppercase columns normalization, ICD-9 behavior).
clinops/ingest/__init__.py Exposes MimicIIILoader in the public ingest API.
tests/test_temporal/test_temporal.py Adds targeted tests for pivot behavior and many imputer edge cases (gaps, ordering, grouping).
tests/test_ingest/test_ingest.py Adds expanded FHIRLoader tests (LOINC filtering, conditions edge cases, JSON/NDJSON/JSONL parsing).
tests/test_ingest/test_mimic_tables.py Expands MimicTableLoader tests (hadm filtering, Parquet loading, missing tables reporting).
tests/test_ingest/test_mimic_iii.py New comprehensive MimicIIILoader test suite with minimal fixtures and error handling checks.
.gitignore Ignores issue-pyOpenSci.md.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +125 to +128
numeric_cols = [
c for c in df.select_dtypes(include=[np.number]).columns
if c != self.id_col
]
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When per_patient=True, fit() populates _patient_means / _patient_medians but transform() only checks _fill_values. This makes a fitted per-patient MEAN/MEDIAN imputer behave as “unfitted” (logs a warning) and falls back to global stats from the transform input, which is incorrect and contradicts the per_patient parameter docs. Consider applying fillna per-group using the stored per-patient stats (and only warning when neither global nor per-patient stats are available).

Copilot uses AI. Check for mistakes.
Comment on lines +400 to +405
df = self._load("prescriptions")
df = self._filter(df, subject_ids, hadm_ids)
if drugs is not None:
pattern = "|".join(drugs)
df = df[df["drug"].str.contains(pattern, case=False, na=False)]
return df
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prescriptions(drugs=...) builds a single regex via pattern = "|".join(drugs) and uses str.contains. This can (a) change semantics vs MimicLoader.prescriptions() (which does exact case-insensitive matching) despite the stated goal of a shared interface, and (b) produce incorrect matches / regex errors when a drug name contains regex metacharacters. Consider aligning with MimicLoader by using exact matching (e.g., normalizing case and using isin), or if substring matching is intended, escape each term (or use regex=False with a loop) to avoid regex injection.

Copilot uses AI. Check for mistakes.
Comment on lines +561 to +570
else:
df = pd.read_csv(path, low_memory=False)

if isinstance(df, pd.DataFrame) and not (
self._cfg.chunk_size and table_name in {"chartevents", "labevents"}
):
df = _normalise_columns(df)
if table_name == "diagnoses_icd" and "icd9_code" in df.columns:
df["icd9_code"] = df["icd9_code"].astype(str)
df = self._parse_datetimes(df, table_name)
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For diagnoses_icd, casting icd9_code to str after read_csv won’t preserve leading zeros if pandas inferred the column as numeric (a common case when codes are all digits). To avoid silently changing codes, consider reading the ICD column as string at parse time (e.g., dtype={"ICD9_CODE": "string"} before normalization, or a converters entry) so values round-trip exactly.

Copilot uses AI. Check for mistakes.
Comment on lines +166 to +167
If set, large tables (``chartevents``, ``labevents``) return a chunked
reader instead of loading fully into memory.
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class docstring says chunk_size causes large tables to “return a chunked reader instead of loading fully into memory”, but _load() currently materializes all chunks into a single DataFrame via pd.concat(...). Either update the documentation to reflect that chunking is only used as a memory-friendly read/processing strategy (but still loads fully), or change the implementation/API to actually return a TextFileReader/iterator like the MIMIC-IV loader docs describe.

Suggested change
If set, large tables (``chartevents``, ``labevents``) return a chunked
reader instead of loading fully into memory.
If set, large tables (for example ``chartevents`` and ``labevents``)
are read and filtered in chunks to reduce peak memory usage during
loading, but the chunks are still concatenated and returned as a
fully materialized :class:`pandas.DataFrame`.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants