Skip to content

Commit 8a6f3d5

Browse files
committed
v2.1.0 : Improved data pipeline and Added the CI/CD pipeline
1 parent 52736fe commit 8a6f3d5

17 files changed

Lines changed: 1151 additions & 426 deletions

File tree

.github/workflows/pipeline.yml

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
name: CardioSense AI Clinical Pipeline
2+
3+
on:
4+
push:
5+
branches: [ main ]
6+
pull_request:
7+
branches: [ main ]
8+
9+
jobs:
10+
lint:
11+
runs-on: ubuntu-latest
12+
steps:
13+
- uses: actions/checkout@v4
14+
- name: Set up Python 3.12
15+
uses: actions/setup-python@v5
16+
with:
17+
python-version: '3.12'
18+
- name: Install Linting dependencies
19+
run: |
20+
pip install flake8
21+
- name: Lint with flake8
22+
# Stop build if there are Python syntax errors or undefined names
23+
run: |
24+
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
25+
26+
test:
27+
needs: lint
28+
runs-on: ubuntu-latest
29+
steps:
30+
- uses: actions/checkout@v4
31+
- name: Set up Python 3.12
32+
uses: actions/setup-python@v5
33+
with:
34+
python-version: '3.12'
35+
- name: Install Clinical stack
36+
run: |
37+
python -m pip install --upgrade pip
38+
pip install -r requirements.txt
39+
- name: Clinical Validation (Pytest)
40+
run: |
41+
export PYTHONPATH=$PYTHONPATH:.
42+
pytest tests/
43+
44+
model-audit:
45+
needs: test
46+
runs-on: ubuntu-latest
47+
steps:
48+
- uses: actions/checkout@v4
49+
- name: Set up Python 3.12
50+
uses: actions/setup-python@v5
51+
with:
52+
python-version: '3.12'
53+
- name: Install dependencies
54+
run: |
55+
pip install -r requirements.txt
56+
- name: Verify Model Ingest (Smoke Test)
57+
# Verifies main.py can run and produce preprocessor artifacts
58+
run: |
59+
export PYTHONPATH=$PYTHONPATH:.
60+
python src/data/preprocessor.py # Verify preprocessor logic specifically
61+
62+
docker-build:
63+
needs: test
64+
runs-on: ubuntu-latest
65+
steps:
66+
- uses: actions/checkout@v4
67+
- name: Build Clinical Docker Image
68+
run: |
69+
docker build -t cardiosense-api:latest .

Dockerfile

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# Use lightweight Python 3.12 image
2+
FROM python:3.12-slim
3+
4+
# Set environment variables
5+
ENV PYTHONDONTWRITEBYTECODE=1
6+
ENV PYTHONUNBUFFERED=1
7+
ENV PYTHONPATH=/app
8+
9+
# Set working directory
10+
WORKDIR /app
11+
12+
# Install system dependencies (required for some ML packages)
13+
RUN apt-get update && apt-get install -y \
14+
build-essential \
15+
curl \
16+
&& rm -rf /var/lib/apt/lists/*
17+
18+
# Copy only requirements first to leverage Docker layer caching
19+
COPY requirements.txt .
20+
RUN pip install --no-cache-dir -r requirements.txt
21+
22+
# Copy essential project directories and files
23+
COPY src/ ./src/
24+
COPY api/ ./api/
25+
COPY models/ ./models/
26+
COPY data/ ./data/
27+
28+
# Open port 8000 for FastAPI
29+
EXPOSE 8000
30+
31+
# Default command: Launch clinical inference gateway
32+
CMD ["uvicorn", "api.main:app", "--host", "0.0.0.0", "--port", "8000"]

README.md

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
# CardioSense AI: Clinical Decision Support System
22

3+
[![CardioSense AI Clinical Pipeline](https://github.com/khanz9664/CardioSense-AI/actions/workflows/pipeline.yml/badge.svg)](https://github.com/khanz9664/CardioSense-AI/actions/workflows/pipeline.yml)
4+
35
<p align="center">
46
<img src="app/assets/logo.png" width="200" alt="CardioSense AI Banner">
57
</p>
@@ -112,20 +114,17 @@ uvicorn api.main:app --host 0.0.0.0 --port 8000
112114

113115
---
114116

115-
## Core Intelligence Modules
116-
117+
- **The Preprocessing Pipeline**: Utilizes a Scikit-Learn `ColumnTransformer` with `StandardScaler` for vitals and `OneHotEncoder` for categorical clinical markers, ensuring training-inference consistency.
117118
- **The Safety & Trust Framework**: Implements ACC/AHA guideline overrides and out-of-distribution (OOD) monitoring.
118119
- **The Optimization Engine**: Calculates the "Least Effort Path" to risk reduction using clinical cost-weights.
119-
- **The Explainability Engine**: Powered by SHAP for local feature-level contributions and model reasoning.
120+
- **The Explainability Engine**: Powered by SHAP and LIME for local feature-level contributions and model reasoning.
120121

121122
---
122123

123-
## Technical Performance (v2.1.0)
124-
125124
| Metric | Score | Professional Interpretation |
126125
| :--- | :--- | :--- |
127-
| **Clinical Accuracy** | **91.80%** | Optimized via Optuna for high stability. |
128-
| **ROC-AUC Score** | **0.9589** | Exceptional ability to distinguish risk from health. |
126+
| **Clinical Accuracy** | **90.16%** | Production-grade stability via robust preprocessing. |
127+
| **ROC-AUC Score** | **0.9524** | Exceptional ability to distinguish risk from health. |
129128
| **Safety Engine** | **Active** | Standard-of-care guardrails for critical vitals. |
130129
| **Auditability** | **Enabled** | Full request tracing and model state hashing. |
131130

app/main.py

Lines changed: 35 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -131,8 +131,13 @@
131131
def load_clinical_vFINAL_wow():
132132
X_REF_PATH = os.path.join(BASE_DIR, "models", "X_reference.joblib")
133133
predictor = HeartDiseasePredictor(MODEL_PATH, preprocessor_path=PREPROCESSOR_PATH)
134-
explainer = HeartDiseaseExplainer(predictor.model, X_reference_path=X_REF_PATH)
135-
simulator = HeartDiseaseSimulator(predictor.model)
134+
# Explainer for SHAP/LIME
135+
explainer = HeartDiseaseExplainer(
136+
predictor.model,
137+
preprocessor=predictor.preprocessor,
138+
X_reference_path=X_REF_PATH
139+
)
140+
simulator = HeartDiseaseSimulator(predictor.model, preprocessor=predictor.preprocessor)
136141
recommender = HeartDiseaseRecommender()
137142
safety_engine = HeartDiseaseSafetyEngine()
138143

@@ -337,7 +342,7 @@ def user_input_features():
337342
shap_vals = explainer.get_explanations(input_df)
338343

339344
# Auto-generate Model Reasoning Summary
340-
reasoning_summary = explainer.get_reasoning_summary(shap_vals, input_df.columns)
345+
reasoning_summary = explainer.get_reasoning_summary(shap_vals)
341346

342347
st.info(f" **Model Reasoning Summary:** {reasoning_summary}")
343348

@@ -347,12 +352,13 @@ def user_input_features():
347352
# Prepare SHAP Waterfall Plot for PDF
348353
with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as tmpfile:
349354
fig_tmp, ax_tmp = plt.subplots(figsize=(10, 6))
350-
# Reconstruct clean Explanation to bypass SHAP internal library IndexErrors
355+
# Use Transformed Data to match SHAP value dimensions
356+
transformed_sample = predictor.preprocessor.transform(input_df)
351357
clean_exp_pdf = shap.Explanation(
352358
values=shap_vals.values[0],
353359
base_values=float(shap_vals.base_values[0]),
354-
data=input_df.iloc[0].values,
355-
feature_names=input_df.columns.tolist()
360+
data=transformed_sample.iloc[0].values,
361+
feature_names=transformed_sample.columns.tolist()
356362
)
357363
# Switch to Bar plot for local attribution (identical data, higher stability)
358364
shap.plots.bar(clean_exp_pdf, max_display=14, show=False)
@@ -417,12 +423,13 @@ def user_input_features():
417423
st.subheader("Local Interpretability (SHAP & LIME)")
418424

419425
st.write("**SHAP Waterfall Base Contribution**")
420-
# Reconstruct clean Explanation for UI stability
426+
# Use Transformed Data to match SHAP value dimensions
427+
transformed_sample = predictor.preprocessor.transform(input_df)
421428
clean_exp_ui = shap.Explanation(
422429
values=shap_vals.values[0],
423430
base_values=float(shap_vals.base_values[0]),
424-
data=input_df.iloc[0].values,
425-
feature_names=input_df.columns.tolist()
431+
data=transformed_sample.iloc[0].values,
432+
feature_names=transformed_sample.columns.tolist()
426433
)
427434
fig_wf, ax_wf = plt.subplots(figsize=(8, 6))
428435
# Switch to Bar plot for local attribution (identical data, higher stability)
@@ -581,22 +588,28 @@ def norm_radar(val, feat):
581588
st.markdown("---")
582589
st.subheader("Algorithmic Feature Analysis")
583590

584-
fa_col1, fa_col2, fa_col3 = st.columns([1, 1, 1])
591+
fa_col1, fa_col2, fa_col3 = st.columns([1.1, 1.1, 1])
585592
with fa_col1:
586593
st.write("**Native Feature Importance**")
594+
# Show all features as requested
587595
imp_df = pd.DataFrame(list(fa['importance'].items()), columns=['Feature', 'Importance']).sort_values(by='Importance', ascending=True)
588-
fig_imp, ax_imp = plt.subplots(figsize=(4.2, 2.8)) # 70% of 6x4
596+
fig_imp, ax_imp = plt.subplots(figsize=(9, 8))
589597
ax_imp.barh(imp_df['Feature'], imp_df['Importance'], color='#339af0')
590-
ax_imp.set_xlabel("Relative Importance")
598+
ax_imp.set_xlabel("Relative Importance", fontsize=9)
599+
ax_imp.tick_params(axis='y', labelsize=8)
600+
ax_imp.tick_params(axis='x', labelsize=8)
591601
st.pyplot(fig_imp)
592602

593603
with fa_col2:
594604
st.write("**Permutation Importance**")
595605
if 'permutation_importance' in fa:
606+
# Show all features as requested
596607
perm_df = pd.DataFrame(list(fa['permutation_importance'].items()), columns=['Feature', 'Importance']).sort_values(by='Importance', ascending=True)
597-
fig_perm, ax_perm = plt.subplots(figsize=(6, 4))
608+
fig_perm, ax_perm = plt.subplots(figsize=(9, 8))
598609
ax_perm.barh(perm_df['Feature'], perm_df['Importance'], color='#fcc419')
599-
ax_perm.set_xlabel("Mean AUC Drop")
610+
ax_perm.set_xlabel("Mean AUC Drop", fontsize=9)
611+
ax_perm.tick_params(axis='y', labelsize=8)
612+
ax_perm.tick_params(axis='x', labelsize=8)
600613
st.pyplot(fig_perm)
601614
else:
602615
st.info("Permutation importance unavailable.")
@@ -622,12 +635,17 @@ def norm_radar(val, feat):
622635

623636
st.write("**Feature Correlation Heatmap**")
624637
corr_df = pd.DataFrame(fa['correlation'])
625-
fig_corr, ax_corr = plt.subplots(figsize=(6, 3.5))
638+
# Significant size expansion for readability
639+
fig_corr, ax_corr = plt.subplots(figsize=(12, 7))
626640
sns.heatmap(corr_df, annot=True, fmt=".2f", cmap="vlag", center=0, ax=ax_corr,
627641
cbar_kws={'label': 'Pearson Correlation'},
628-
annot_kws={"size": 7}, linewidths=0.5)
642+
annot_kws={"size": 5}, linewidths=0.2)
643+
644+
# Rotate labels for better readability
645+
plt.xticks(rotation=45, ha='right', fontsize=8)
646+
plt.yticks(fontsize=8)
629647

630-
col_c1, col_c2, col_c3 = st.columns([1, 2, 1])
648+
col_c1, col_c2, col_c3 = st.columns([1, 4, 1]) # Expand display column
631649
with col_c2:
632650
st.pyplot(fig_corr, width='stretch')
633651

docs/ARCHITECTURE.md

Lines changed: 15 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -49,19 +49,25 @@ graph TB
4949

5050
## 2. The Training & Optimization Pipeline
5151

52-
We employ **XGBoost** as the primary engine, optimized via **Optuna** to ensure medical-grade accuracy.
52+
We employ **XGBoost** as the primary engine, optimized via **Optuna** and supported by a **Production Preprocessing Pipeline** to ensure medical-grade accuracy and inference stability.
53+
54+
1. **Robust Feature Engineering**:
55+
* **Numerical Normalization**: `StandardScaler` is applied to all continuous vitals (`age`, `trestbps`, `chol`, `thalach`, `oldpeak`) to prevent feature-dominance and ensure gradient stability.
56+
* **Categorical Encoding**: `OneHotEncoder(drop='if_binary')` converts clinical categorical markers (`sex`, `cp`, `fbs`, `restecg`, `exang`, `slope`, `ca`, `thal`) into a sparse, machine-readable format.
57+
2. **Pipeline Orchestration**: The entire transformation is wrapped in a Scikit-Learn `Pipeline`. This ensures that the exact same mathematical shifts are applied during real-time inference as were used during training, eliminating training-serving skew.
5358

5459
```mermaid
5560
graph LR
56-
A[(Raw Clinical Data)] --> B[Unified Preprocessing]
57-
B --> C{Optuna Meta-Learner}
58-
C --> D[XGBoost Hyper-Tuning]
59-
D --> E[Cross-Validation]
60-
E --> C
61-
C --> F[Optimized Model / .joblib]
62-
C --> G[Metadata & Metrics / .json]
61+
A[(Raw Clinical Data)] --> B[ColumnTransformer]
62+
B --> C[StandardScaler / OHE]
63+
C --> D{Optuna Meta-Learner}
64+
D --> E[XGBoost Hyper-Tuning]
65+
E --> F[Cross-Validation]
66+
F --> D
67+
D --> G[Optimized Model / .joblib]
68+
D --> H[Fitted Preprocessor / .joblib]
6369
64-
style C fill:#f9f,stroke:#333,stroke-width:2px
70+
style D fill:#f9f,stroke:#333,stroke-width:2px
6571
```
6672

6773
---

docs/DEVELOPMENT.md

Lines changed: 17 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -34,9 +34,12 @@ python main.py
3434
```
3535
This script will:
3636
1. Load raw data from `data/raw/`.
37-
2. Run the `Unified Preprocessor`.
37+
2. Run the **Robust Preprocessing Pipeline** (`ColumnTransformer` with `StandardScaler` and `OneHotEncoder`).
3838
3. Use **Optuna** to find the best hyperparameters.
39-
4. Save the artifacts (`.joblib`) and metadata (`.json`) to the `models/` directory.
39+
4. Save the artifacts:
40+
- `models/heart_disease_model.joblib`: The optimized XGBoost learner.
41+
- `models/preprocessor.joblib`: The fitted Scikit-Learn preprocessing pipeline.
42+
- `models/model_metadata.json`: The clinical metrics and versioning data.
4043

4144
### Launching the Dashboard (Streamlit)
4245
```bash
@@ -72,7 +75,18 @@ PYTHONPATH=. .venv/bin/pytest tests/
7275

7376
---
7477

75-
## 5. Dependencies
78+
## 5. CI/CD & Automated Clinical Pipelines
79+
80+
Every push to the `main` branch triggers an automated **Clinical Decision Guardrail Pipeline** via GitHub Actions:
81+
82+
1. **Job 1: Linting**: Ensures code quality and clinical-grade standards using `flake8`.
83+
2. **Job 2: Clinical Testing**: Automates the full `pytest` suite across the Safety, API, and Simulator modules.
84+
3. **Job 3: Model Ingest Audit**: Verifies that new clinical data patterns correctly traverse the `ColumnTransformer` preprocessing layer.
85+
4. **Job 4: Docker Build**: Packages the FastAPI inference gateway into a production-ready container (`Dockerfile`) to ensure deployment portability.
86+
87+
---
88+
89+
## 6. Dependencies
7690

7791
- **Modeling**: `xgboost`, `scikit-learn`, `optuna`, `joblib`.
7892
- **Explainability**: `shap`, `matplotlib`, `seaborn`.

docs/PAPER.md

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,14 @@ CardioSense AI addresses these gaps by implementing a **Post-Hoc Attribution Lay
2828

2929
## 2. Methodology & Mathematical Foundations
3030

31-
### 2.1 The Core Intelligence Engine (XGBoost)
31+
### 2.1 Robust Preprocessing Pipeline
32+
To ensuring model stability and training-inference consistency, we implement a **Scikit-Learn Pipeline** architecture:
33+
1. **Feature Normalization**: Numerical vitals ($x_{num} \in \{\text{age, trestbps, chol, thalach, oldpeak}\}$) are transformed using **Z-score normalization** (StandardScaler):
34+
$$z = \frac{x - \mu}{\sigma}$$
35+
2. **Categorical Encoding**: Nominal features ($x_{cat} \in \{\text{sex, cp, fbs, restecg, exang, slope, ca, thal}\}$) are transformed via **One-Hot Encoding (OHE)** to a sparse binary vector space.
36+
3. **Pipeline Consistency**: The transformation parameters ($\mu, \sigma$) are fitted exclusively on the training set and persisted in the `preprocessor.joblib` artifact to eliminate data leakage.
37+
38+
### 2.2 The Core Intelligence Engine (XGBoost)
3239
We utilize **eXtreme Gradient Boosting (XGBoost)**, which optimizes the following regularized objective function:
3340

3441
$$\mathcal{L}(\phi) = \sum_i l(\hat{y}_i, y_i) + \sum_k \Omega(f_k)$$
@@ -75,10 +82,10 @@ CardioSense AI was validated on the **UCI Cleveland Heart Disease** dataset ($N=
7582
### 3.2 Performance Metrics (v2.1.0)
7683
| Metric | Score | Clinical Interpretation |
7784
| :--- | :--- | :--- |
78-
| **Accuracy** | **91.80%** | Comprehensive Diagnostic Reliability |
79-
| **ROC-AUC** | **0.9589** | Exceptional Discrimination Power |
80-
| **Recall (Sensitivity)** | **96.43%** | Minimized False Negatives (Patient Safety) |
81-
| **Brier Score** | **0.0787** | High Calibration Integrity |
85+
| **Accuracy** | **90.16%** | Comprehensive Diagnostic Reliability |
86+
| **ROC-AUC** | **0.9524** | Exceptional Discrimination Power |
87+
| **Recall (Sensitivity)** | **92.86%** | Minimized False Negatives (Patient Safety) |
88+
| **Brier Score** | **0.0841** | High Calibration Integrity |
8289

8390
![Performance Summary](../app/assets/App_Screenshots/10.png)
8491

0 commit comments

Comments
 (0)