Clinical Risk Scorer Demo

Overview (English)

This repository contains a small end-to-end ML demo for clinical risk scoring on synthetic tabular patient data.

Core idea:

we generate a synthetic patient dataset (age, BMI, blood pressure, simple risk factors);
define a binary target high_risk using a hidden formula (for demo purposes only);
build a reproducible ML pipeline in Python:
- data generation and loading,
- feature preprocessing (scikit-learn pipeline),
- model training and evaluation,
- metrics logging to JSON,
- a minimal FastAPI service exposing /predict.

All configuration (data paths, features, target, model type and params) is stored in configs/default.yaml. Models and metrics are saved to models/ and metrics/.

The dataset is fully synthetic and does not contain any real medical data. The goal of this project is to demonstrate an engineering-grade ML pipeline rather than to build a clinically validated tool.

Обзор (по-русски)

Этот репозиторий — учебный ML-проект по оценке клинического риска на табличных данных.

Идея:

сгенерировать синтетический датасет пациентов:
- возраст, индекс массы тела, давление, несколько бинарных флагов (smoker, diabetes, …);
задать бинарную цель high_risk (0/1) по скрытой формуле;
построить сквозной пайплайн:
- генерация и загрузка данных (src/data),
- препроцессинг признаков (скейлинг через sklearn-пайплайн),
- обучение модели (логистическая регрессия),
- вычисление метрик и сохранение их в metrics/,
- простой API на FastAPI (/predict), который по JSON-запросу возвращает risk_score и класс риска.

Проект задуман как демонстрация инженерного подхода к ML: конфиги в YAML, разделение кода по слоям (data/, features/, models/, api), автотесты и воспроизводимые эксперименты.

Данные полностью синтетические и не предназначены для реального медицинского применения.

Repository structure

clinical-risk-scorer-demo/
├─ data/
│  ├─ raw/                      # optional raw data
│  └─ processed/                # train.csv, test.csv (generated)
├─ configs/
│  └─ default.yaml              # data paths, feature list, target, model config
├─ src/
│  ├─ __init__.py
│  ├─ paths.py                  # central project/data/models/metrics paths
│  ├─ data/
│  │  ├─ generate_synthetic.py  # synthetic patient generator
│  │  └─ dataset.py             # train/test loading via YAML config
│  ├─ features/
│  │  └─ preprocess.py          # sklearn preprocessing pipeline
│  ├─ models/
│  │  ├─ train.py               # train model, compute metrics, save artifacts
│  │  └─ evaluate.py            # standalone evaluation, write metrics.json
│  └─ cli.py                    # CLI: generate-data, train, evaluate
├─ api/
│  └─ app.py                    # FastAPI app: /health, /predict
├─ notebooks/
│  └─ 01_exploration.ipynb      # basic EDA for the synthetic dataset
├─ tests/
│  ├─ test_data_generation.py   # unit test for data generator
│  ├─ test_pipeline_smoke.py    # end-to-end pipeline smoke test
│  └─ test_api_predict.py       # smoke test for /predict
├─ models/                      # trained sklearn pipelines (.joblib)
├─ metrics/
│  ├─ metrics.json              # example metrics file
│  └─ ...                       # metrics from training/evaluation
├─ docs/
│  ├─ Overview_RU.md            # Russian technical overview
│  └─ Overview_EN.md            # English technical overview
├─ README.md
├─ CHANGELOG.md
├─ LICENSE
├─ requirements.txt
├─ requirements-dev.txt
├─ .gitignore
└─ pytest.ini

More detailed explanations live in docs/Overview_EN.md and docs/Overview_RU.md.


## Data and configuration

Synthetic data are generated by:

python -m src.data.generate_synthetic --n-samples 2000



This creates:

data/processed/train.csv

data/processed/test.csv

The config file configs/default.yaml defines:

where train/test CSVs are located (data.train_path, data.test_path);

which column is the ID (data.id_column);

which column is the binary target (data.target_column);

the list of feature columns (data.feature_columns);

model type and hyperparameters (model.type, model.params).

You can change feature sets, target name or model configuration by editing this YAML file.



## How to run

1. Installation

From the project root:

pip install -r requirements.txt


(Optionally for development tools — black, mypy, Jupyter, etc.):

pip install -r requirements-dev.txt


2. Generate synthetic data

Using the CLI:

python -m src.cli generate-data --n-samples 2000


This will create data/processed/train.csv and data/processed/test.csv.

You can adjust:

--n-samples — number of synthetic patients,

--train-ratio — train/test split ratio,

--seed — random seed.


3. Train the model

python -m src.cli train --config default.yaml


This will:

load train/test according to configs/default.yaml,

build a preprocessing + LogisticRegression pipeline,

fit the model,

compute metrics on the test set,

save:

model to models/model_logistic_regression.joblib,

metrics to metrics/metrics_logistic_regression.json.


4. Evaluate a saved model

python -m src.cli evaluate \
  --config default.yaml \
  --model-path model_logistic_regression.joblib \
  --output metrics.json

This re-evaluates an existing model on the current test set and writes
a compact metrics/metrics.json file.


## API usage (FastAPI /predict)

Start the FastAPI app:

uvicorn api.app:app --reload


Health check:

curl http://127.0.0.1:8000/health


Prediction example:

curl -X POST "http://127.0.0.1:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{
        "age": 65,
        "sex": 1,
        "bmi": 30.5,
        "smoker": 1,
        "diabetes": 1,
        "systolic_bp": 150,
        "heart_rate": 82,
        "cholesterol": 6.2
      }'

Typical response:

{
  "risk_score": 0.78,
  "risk_label": "high",
  "threshold": 0.5,
  "model_name": "model_logistic_regression.joblib"
}


Input and output schemas are defined in api/app.py via Pydantic models:

PatientFeatures — request body,

PredictionResponse — response body.


## Tests

To run all tests:

pytest -vv


What is covered:

tests/test_data_generation.py
Unit test for generate_synthetic_patients: shape, required columns,
binary high_risk, basic sanity checks.

tests/test_pipeline_smoke.py
End-to-end smoke test:

generate synthetic data,

run training,

verify metrics are in a valid range,

check that model and metrics artifacts are created.

tests/test_api_predict.py

Smoke test for the FastAPI app:

prepare data and train a model,

import api.app,

call /health and /predict,

check response shape and risk_score ∈ 0,1.


## Limitations and disclaimer

The dataset is fully synthetic, generated by src/data/generate_synthetic.py.

The underlying “risk formula” is hand-crafted for demonstration purposes only.

The model has no clinical validation and must not be used
for any real-world medical decisions or patient care.

This project is intended purely as a technical demo of:

structuring a tabular ML pipeline,

using configs and sklearn pipelines,

logging metrics and exposing a minimal prediction API.



## Changelog and license

See CHANGELOG.md for release notes and version history.

Licensed under the MIT License — see LICENSE.


## Author

Clinical Risk Scorer Demo —
designed and implemented by Svetlana Romanova as an educational
ML engineering project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Clinical Risk Scorer Demo

Overview (English)

Обзор (по-русски)

Repository structure

About

Uh oh!

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
api		api
configs		configs
data/raw		data/raw
docs		docs
metrics		metrics
notebooks		notebooks
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

License

SvetLuna-Lab/-Clinical-risk-scorer-demo-

Folders and files

Latest commit

History

Repository files navigation

Clinical Risk Scorer Demo

Overview (English)

Обзор (по-русски)

Repository structure

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages