clinops 🏥

Clinical ML Pipeline Toolkit — production-grade data loading, preprocessing, and time-series feature engineering for healthcare AI research.

Docs: https://clinops.readthedocs.io/ | PyPI: https://pypi.org/project/clinops/

Every healthcare AI project starts with the same two weeks of plumbing: loading MIMIC-IV tables without hitting memory limits, clipping physiologically impossible values before they corrupt your model, normalizing glucose from mmol/L to mg/dL across sites, building time-series windows that handle clinical missingness correctly, and splitting data without leaking patients across folds. clinops packages those hard-won patterns into a single, well-tested library so your first notebook is actual science.

Built from production experience in clinical and genomic data engineering across multi-cloud environments.

v0.1 Modules

Module	What it does
`clinops.ingest`	Loaders for MIMIC-IV, FHIR R4, and flat CSV/Parquet with schema validation. Includes `MimicTableLoader` with pre-built schemas for the five tables researchers always need.
`clinops.temporal`	Sliding/tumbling windows, gap-aware imputation, lag features, cohort alignment
`clinops.preprocess`	Outlier clipping with physiological bounds, unit normalization (mg/dL ↔ mmol/L etc.), ICD-9→10 mapping
`clinops.split`	Temporal, patient-level, and stratified patient train/test splitting

Roadmap: clinops.monitor (drift detection, data quality) and clinops.orchestrate (GCS/S3, Step Functions) are planned for v0.2.

Quickstart

pip install clinops

clinops.ingest

MimicTableLoader — pre-built schemas, no manual ColumnSpec required

MimicTableLoader wraps MimicLoader and exposes the five MIMIC-IV tables researchers use in every project with fully validated schemas out of the box. No ColumnSpec definitions, no schema boilerplate.

from clinops.ingest import MimicTableLoader

tbl = MimicTableLoader("/data/mimic-iv-2.2")

# ICU vitals — charttime parsed as datetime automatically
charts = tbl.chartevents(subject_ids=[10000032, 10000980])

# Lab results — reference range columns dropped by default (sparse in most exports)
labs = tbl.labevents(subject_ids=[10000032], with_ref_range=True)

# Hospital admissions with mortality flag
adm = tbl.admissions(subject_ids=[10000032])

# ICD-9/10 diagnoses — primary_only keeps only seq_num == 1
dx = tbl.diagnoses_icd(subject_ids=[10000032], primary_only=True)

# ICU stays — with_los_band adds <1d / 1-3d / 3-7d / >7d column
stays = tbl.icustays(subject_ids=[10000032], with_los_band=True)

Audit a new MIMIC download without loading full tables into memory:

tbl.summary()
#        table  rows_sampled  columns  null_rate_pct
#  chartevents         10000       23           8.41
#    labevents         10000       12           4.17
#   admissions         10000       15           6.02
# diagnoses_icd        10000        5           0.00
#     icustays         10000        8           2.31

MimicLoader — full control

For advanced filtering and chunk-based loading of large tables, use MimicLoader directly:

from clinops.ingest import MimicLoader

loader = MimicLoader("/data/mimic-iv-2.2")

charts = loader.chartevents(
    subject_ids=[10000032, 10000980],
    start_time="2150-01-01",
    end_time="2150-01-10",
)
labs  = loader.labevents(subject_ids=[10000032, 10000980])
stays = loader.icustays(subject_ids=[10000032, 10000980])

Load FHIR R4 resources

from clinops.ingest import FHIRLoader

loader   = FHIRLoader("/data/fhir_export")
obs      = loader.observations(category="vital-signs")
patients = loader.patients()

Validate any flat clinical export

from clinops.ingest import FlatFileLoader, ClinicalSchema, ColumnSpec

schema = ClinicalSchema(
    name="vitals",
    columns=[
        ColumnSpec("subject_id", nullable=False),
        ColumnSpec("heart_rate", min_value=0,  max_value=300),
        ColumnSpec("spo2",       min_value=50, max_value=100),
    ]
)
df = FlatFileLoader("vitals.csv", schema=schema).load()

clinops.preprocess

Clip physiologically impossible values

Standard statistical outlier methods (z-score, IQR) are wrong for clinical data — a heart rate of 180 in a patient with SVT is clinically meaningful and should not be removed. ClinicalOutlierClipper uses published physiological bounds to remove values that are impossible regardless of patient state.

from clinops.preprocess import ClinicalOutlierClipper

clipper = ClinicalOutlierClipper(action="clip")  # or "null" or "flag"
clean_df = clipper.fit_transform(vitals_df)

print(clipper.report())
#    column  low_outliers  high_outliers  pct_outliers  bound_low  bound_high
#  heart_rate             0              3         0.012          0         300
#        spo2             1              0         0.004         50         100

Built-in bounds cover 20 vitals and labs (heart_rate, spo2, sbp, glucose, creatinine, ph, wbc, and more). Add site-specific ranges with add_bounds().

Normalize units across sites

Multi-site studies routinely mix mg/dL and mmol/L for the same lab, or °F and °C for temperature. UnitNormalizer detects non-standard units via a companion unit column and converts in-place.

from clinops.preprocess import UnitNormalizer

# df has a "glucose" column and a "glucose_unit" column (mixed "mg/dL" / "mmol/L")
normalizer = UnitNormalizer(column_unit_map={"glucose": "glucose_unit"})
df = normalizer.transform(df)
# All glucose values now in mg/dL; glucose_unit column updated

print(normalizer.report())
#   column from_unit to_unit  n_converted
#  glucose    mmol/L   mg/dL          142

30 registered conversions covering glucose, creatinine, bilirubin, haemoglobin, calcium, temperature, weight, and height.

Harmonize ICD-9 and ICD-10 codes

MIMIC-III uses ICD-9, MIMIC-IV mixes both versions, and many real-world datasets span the October 2015 transition. ICDMapper converts ICD-9-CM codes to ICD-10-CM and adds chapter-level groupings for ML features.

from clinops.preprocess import ICDMapper

mapper = ICDMapper()

# Map a mixed-version DataFrame to ICD-10 in-place
df = mapper.harmonize(df, code_col="icd_code", version_col="icd_version")

# Add chapter-level grouping (e.g. "Diseases of the circulatory system")
df["chapter"] = mapper.chapter_series(df["icd_code"])

# Map a single code
mapper.map_code("4280")   # → "I509"

Ships with ~60 curated high-frequency mappings. Load the full CMS GEM file (~72,000 mappings) with ICDMapper.from_gem_file(path).

clinops.temporal

Build temporal feature windows

from clinops.temporal import TemporalWindower, ImputationStrategy

windower = TemporalWindower(
    window_hours=24,
    step_hours=6,
    imputation=ImputationStrategy.FORWARD_FILL,
    min_observations=3,
)

windows = windower.fit_transform(
    df=charts,
    id_col="subject_id",
    time_col="charttime",
    feature_cols=["heart_rate", "spo2", "resp_rate", "map"],
)
# → DataFrame: subject_id | window_start | window_end | heart_rate | spo2 | ...

Long-format input (MIMIC native itemid × valuenum)

windows = windower.fit_transform(
    df=charts,
    id_col="subject_id",
    time_col="charttime",
    item_col="itemid",    # auto-pivots to wide format
    value_col="valuenum",
)

Add lag and rolling features

from clinops.temporal import LagFeatureBuilder

enriched = LagFeatureBuilder(
    lags=[1, 2, 4],
    rolling_windows=[4, 8],
    id_col="subject_id",
).fit_transform(windows)
# Adds: heart_rate_lag1, heart_rate_roll4_mean, heart_rate_roll4_std, ...

Align a cohort to an anchor event (e.g. ICU admission)

from clinops.temporal import CohortAligner

aligned = CohortAligner(
    anchor_col="intime",
    max_hours_before=0,
    max_hours_after=48,
).align(events_df=charts, anchor_df=stays)
# → filtered to 48h post-admission, with hours_from_anchor column

Imputation strategies

Clinical data has unique missingness patterns that standard ML windowing gets wrong. clinops provides strategies tuned for clinical context:

Strategy	Best for
`FORWARD_FILL`	Slowly-changing vitals — carry last observation forward
`BACKWARD_FILL`	Values recorded with lag
`LINEAR`	Continuous signals with regular sampling
`MEAN` / `MEDIAN`	Fit on training set, apply to test (no leakage)
`INDICATOR`	Adds `{col}_missing` binary column — lets model learn from missingness
`NONE`	Leave NaN in place

from clinops.temporal import Imputer, ImputationStrategy

imputer = Imputer(ImputationStrategy.MEAN, per_patient=True, id_col="subject_id")
imputer.fit(train_windows)
test_windows = imputer.transform(test_windows)

clinops.split

Standard sklearn.train_test_split is wrong for clinical ML: it leaks future observations into training, and splits patients across folds so the model memorises patient-specific patterns rather than generalising.

Temporal split — no future leakage

from clinops.split import TemporalSplitter

result = TemporalSplitter(cutoff="2155-01-01", time_col="charttime").split(df)
# or auto-compute cutoff from the data:
result = TemporalSplitter(train_frac=0.8, time_col="charttime").split(df)

print(result.summary())
# Train: 38,400 rows (80.0%)
# Test:   9,600 rows (20.0%)
# cutoff: 2155-01-01 00:00:00

Patient-level split — no leakage across admissions

from clinops.split import PatientSplitter

result = PatientSplitter(id_col="subject_id", test_size=0.2).split(df)

# Guaranteed: no patient appears in both splits
assert not set(result.train["subject_id"]) & set(result.test["subject_id"])

Stratified patient split — preserves outcome rate

Critical for imbalanced clinical endpoints (in-hospital mortality is typically 5–15%). Stratifies on a binary outcome while respecting patient boundaries.

from clinops.split import StratifiedPatientSplitter

result = StratifiedPatientSplitter(
    id_col="subject_id",
    outcome_col="hospital_expire_flag",
    test_size=0.2,
).split(df)

print(result.summary())
# Train: 32,000 rows (80.0%)
# Test:   8,000 rows (20.0%)
# population_outcome_rate: 0.0821
# train_outcome_rate:      0.0819
# test_outcome_rate:       0.0826

Installation

Requires Python 3.12+.

pip install clinops           # core
pip install clinops[fhir]     # adds FHIR R4 loader
pip install clinops[gcp]      # adds GCP extras (for v0.2)
pip install -e ".[dev]"       # development

Supported sources

Source	Format
MIMIC-IV v2.0–v2.2	CSV, CSV.GZ, Parquet
FHIR R4	JSON Bundle, NDJSON
Flat files	CSV, CSV.GZ, Parquet

Generative AI usage disclosure

This library was developed with assistance from Claude (Anthropic) as an AI pair-programming tool. Specifically, Claude was used for:

Generating test cases — unit test drafts for edge cases (e.g., empty DataFrames, single-patient cohorts, missing unit columns), all reviewed and validated against real expected outputs
Drafting docstrings — inline documentation for public methods, reviewed and corrected for accuracy by the author
Iterating on error messages — improving the clarity of ValueError and logger.warning messages

All generated code was reviewed, tested, and validated by the author prior to release.

Contributing

See CONTRIBUTING.md. Run pytest tests/ -v and ruff check clinops/ before opening a PR.

Citation

@software{kasaraneni2026clinops,
  author  = {Kasaraneni, Chaitanya},
  title   = {clinops: Clinical ML Pipeline Toolkit},
  year    = {2026},
  url     = {https://github.com/chaitanyakasaraneni/clinops},
  version = {0.1.7}
}

A companion JOSS paper is in preparation.

License

Apache 2.0 — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.github/workflows		.github/workflows
clinops		clinops
docs		docs
examples		examples
imgs		imgs
paper		paper
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

clinops 🏥

v0.1 Modules

Quickstart

clinops.ingest

MimicTableLoader — pre-built schemas, no manual ColumnSpec required

MimicLoader — full control

Load FHIR R4 resources

Validate any flat clinical export

clinops.preprocess

Clip physiologically impossible values

Normalize units across sites

Harmonize ICD-9 and ICD-10 codes

clinops.temporal

Build temporal feature windows

Long-format input (MIMIC native itemid × valuenum)

Add lag and rolling features

Align a cohort to an anchor event (e.g. ICU admission)

Imputation strategies

clinops.split

Temporal split — no future leakage

Patient-level split — no leakage across admissions

Stratified patient split — preserves outcome rate

Installation

Supported sources

Generative AI usage disclosure

All generated code was reviewed, tested, and validated by the author prior to release.

Contributing

Citation

License

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages