Khanz9664
diff --git a/‎.github/workflows/pipeline.yml‎
Lines changed: 69 additions & 0 deletions b/‎.github/workflows/pipeline.yml‎
Lines changed: 69 additions & 0 deletions
diff --git a/‎Dockerfile‎
Lines changed: 32 additions & 0 deletions b/‎Dockerfile‎
Lines changed: 32 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 6 additions & 7 deletions b/‎README.md‎
Lines changed: 6 additions & 7 deletions
diff --git a/‎app/main.py‎
Lines changed: 35 additions & 17 deletions b/‎app/main.py‎
Lines changed: 35 additions & 17 deletions
diff --git a/‎docs/ARCHITECTURE.md‎
Lines changed: 15 additions & 9 deletions b/‎docs/ARCHITECTURE.md‎
Lines changed: 15 additions & 9 deletions
diff --git a/‎docs/DEVELOPMENT.md‎
Lines changed: 17 additions & 3 deletions b/‎docs/DEVELOPMENT.md‎
Lines changed: 17 additions & 3 deletions
diff --git a/‎docs/PAPER.md‎
Lines changed: 12 additions & 5 deletions b/‎docs/PAPER.md‎
Lines changed: 12 additions & 5 deletions
@@ -0,0 +1,69 @@
+name: CardioSense AI Clinical Pipeline
+
+on:
+  push:
+    branches: [ main ]
+  pull_request:
+    branches: [ main ]
+
+jobs:
+  lint:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Set up Python 3.12
+        uses: actions/setup-python@v5
+        with:
+          python-version: '3.12'
+      - name: Install Linting dependencies
+        run: |
+          pip install flake8
+      - name: Lint with flake8
+        # Stop build if there are Python syntax errors or undefined names
+        run: |
+          flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
+
+  test:
+    needs: lint
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Set up Python 3.12
+        uses: actions/setup-python@v5
+        with:
+          python-version: '3.12'
+      - name: Install Clinical stack
+        run: |
+          python -m pip install --upgrade pip
+          pip install -r requirements.txt
+      - name: Clinical Validation (Pytest)
+        run: |
+          export PYTHONPATH=$PYTHONPATH:.
+          pytest tests/
+
+  model-audit:
+    needs: test
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Set up Python 3.12
+        uses: actions/setup-python@v5
+        with:
+          python-version: '3.12'
+      - name: Install dependencies
+        run: |
+          pip install -r requirements.txt
+      - name: Verify Model Ingest (Smoke Test)
+        # Verifies main.py can run and produce preprocessor artifacts
+        run: |
+          export PYTHONPATH=$PYTHONPATH:.
+          python src/data/preprocessor.py # Verify preprocessor logic specifically
+
+  docker-build:
+    needs: test
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Build Clinical Docker Image
+        run: |
+          docker build -t cardiosense-api:latest .
@@ -0,0 +1,32 @@
+# Use lightweight Python 3.12 image
+FROM python:3.12-slim
+
+# Set environment variables
+ENV PYTHONDONTWRITEBYTECODE=1
+ENV PYTHONUNBUFFERED=1
+ENV PYTHONPATH=/app
+
+# Set working directory
+WORKDIR /app
+
+# Install system dependencies (required for some ML packages)
+RUN apt-get update && apt-get install -y \
+    build-essential \
+    curl \
+    && rm -rf /var/lib/apt/lists/*
+
+# Copy only requirements first to leverage Docker layer caching
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+
+# Copy essential project directories and files
+COPY src/ ./src/
+COPY api/ ./api/
+COPY models/ ./models/
+COPY data/ ./data/
+
+# Open port 8000 for FastAPI
+EXPOSE 8000
+
+# Default command: Launch clinical inference gateway
+CMD ["uvicorn", "api.main:app", "--host", "0.0.0.0", "--port", "8000"]
@@ -1,5 +1,7 @@
 # CardioSense AI: Clinical Decision Support System
 
+[![CardioSense AI Clinical Pipeline](https://github.com/khanz9664/CardioSense-AI/actions/workflows/pipeline.yml/badge.svg)](https://github.com/khanz9664/CardioSense-AI/actions/workflows/pipeline.yml)
+
 <p align="center">
   <img src="app/assets/logo.png" width="200" alt="CardioSense AI Banner">
 </p>
@@ -112,20 +114,17 @@ uvicorn api.main:app --host 0.0.0.0 --port 8000
 
 ---
 
-##  Core Intelligence Modules
-
+- **The Preprocessing Pipeline**: Utilizes a Scikit-Learn `ColumnTransformer` with `StandardScaler` for vitals and `OneHotEncoder` for categorical clinical markers, ensuring training-inference consistency.
 - **The Safety & Trust Framework**: Implements ACC/AHA guideline overrides and out-of-distribution (OOD) monitoring.
 - **The Optimization Engine**: Calculates the "Least Effort Path" to risk reduction using clinical cost-weights.
-- **The Explainability Engine**: Powered by SHAP for local feature-level contributions and model reasoning.
+- **The Explainability Engine**: Powered by SHAP and LIME for local feature-level contributions and model reasoning.
 
 ---
 
-##  Technical Performance (v2.1.0)
-
 | Metric | Score | Professional Interpretation |
 | :--- | :--- | :--- |
-| **Clinical Accuracy** | **91.80%** | Optimized via Optuna for high stability. |
-| **ROC-AUC Score** | **0.9589** | Exceptional ability to distinguish risk from health. |
+| **Clinical Accuracy** | **90.16%** | Production-grade stability via robust preprocessing. |
+| **ROC-AUC Score** | **0.9524** | Exceptional ability to distinguish risk from health. |
 | **Safety Engine** | **Active** | Standard-of-care guardrails for critical vitals. |
 | **Auditability** | **Enabled** | Full request tracing and model state hashing. |
 
 
@@ -131,8 +131,13 @@
 def load_clinical_vFINAL_wow():
     X_REF_PATH = os.path.join(BASE_DIR, "models", "X_reference.joblib")
     predictor = HeartDiseasePredictor(MODEL_PATH, preprocessor_path=PREPROCESSOR_PATH)
-    explainer = HeartDiseaseExplainer(predictor.model, X_reference_path=X_REF_PATH)
-    simulator = HeartDiseaseSimulator(predictor.model)
+    # Explainer for SHAP/LIME
+    explainer = HeartDiseaseExplainer(
+        predictor.model, 
+        preprocessor=predictor.preprocessor, 
+        X_reference_path=X_REF_PATH
+    )
+    simulator = HeartDiseaseSimulator(predictor.model, preprocessor=predictor.preprocessor)
     recommender = HeartDiseaseRecommender()
     safety_engine = HeartDiseaseSafetyEngine()
 
@@ -337,7 +342,7 @@ def user_input_features():
     shap_vals = explainer.get_explanations(input_df)
 
     # Auto-generate Model Reasoning Summary
-    reasoning_summary = explainer.get_reasoning_summary(shap_vals, input_df.columns)
+    reasoning_summary = explainer.get_reasoning_summary(shap_vals)
 
     st.info(f" **Model Reasoning Summary:** {reasoning_summary}")
 
@@ -347,12 +352,13 @@ def user_input_features():
     # Prepare SHAP Waterfall Plot for PDF
     with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as tmpfile:
         fig_tmp, ax_tmp = plt.subplots(figsize=(10, 6))
-        # Reconstruct clean Explanation to bypass SHAP internal library IndexErrors
+        # Use Transformed Data to match SHAP value dimensions
+        transformed_sample = predictor.preprocessor.transform(input_df)
         clean_exp_pdf = shap.Explanation(
             values=shap_vals.values[0],
             base_values=float(shap_vals.base_values[0]),
-            data=input_df.iloc[0].values,
-            feature_names=input_df.columns.tolist()
+            data=transformed_sample.iloc[0].values,
+            feature_names=transformed_sample.columns.tolist()
         )
         # Switch to Bar plot for local attribution (identical data, higher stability)
         shap.plots.bar(clean_exp_pdf, max_display=14, show=False)
@@ -417,12 +423,13 @@ def user_input_features():
         st.subheader("Local Interpretability (SHAP & LIME)")
 
         st.write("**SHAP Waterfall Base Contribution**")
-        # Reconstruct clean Explanation for UI stability
+        # Use Transformed Data to match SHAP value dimensions
+        transformed_sample = predictor.preprocessor.transform(input_df)
         clean_exp_ui = shap.Explanation(
             values=shap_vals.values[0],
             base_values=float(shap_vals.base_values[0]),
-            data=input_df.iloc[0].values,
-            feature_names=input_df.columns.tolist()
+            data=transformed_sample.iloc[0].values,
+            feature_names=transformed_sample.columns.tolist()
         )
         fig_wf, ax_wf = plt.subplots(figsize=(8, 6))
         # Switch to Bar plot for local attribution (identical data, higher stability)
@@ -581,22 +588,28 @@ def norm_radar(val, feat):
         st.markdown("---")
         st.subheader("Algorithmic Feature Analysis")
 
-        fa_col1, fa_col2, fa_col3 = st.columns([1, 1, 1])
+        fa_col1, fa_col2, fa_col3 = st.columns([1.1, 1.1, 1])
         with fa_col1:
             st.write("**Native Feature Importance**")
+            # Show all features as requested
             imp_df = pd.DataFrame(list(fa['importance'].items()), columns=['Feature', 'Importance']).sort_values(by='Importance', ascending=True)
-            fig_imp, ax_imp = plt.subplots(figsize=(4.2, 2.8)) # 70% of 6x4
+            fig_imp, ax_imp = plt.subplots(figsize=(9, 8))
             ax_imp.barh(imp_df['Feature'], imp_df['Importance'], color='#339af0')
-            ax_imp.set_xlabel("Relative Importance")
+            ax_imp.set_xlabel("Relative Importance", fontsize=9)
+            ax_imp.tick_params(axis='y', labelsize=8)
+            ax_imp.tick_params(axis='x', labelsize=8)
             st.pyplot(fig_imp)
 
         with fa_col2:
             st.write("**Permutation Importance**")
             if 'permutation_importance' in fa:
+                # Show all features as requested
                 perm_df = pd.DataFrame(list(fa['permutation_importance'].items()), columns=['Feature', 'Importance']).sort_values(by='Importance', ascending=True)
-                fig_perm, ax_perm = plt.subplots(figsize=(6, 4))
+                fig_perm, ax_perm = plt.subplots(figsize=(9, 8))
                 ax_perm.barh(perm_df['Feature'], perm_df['Importance'], color='#fcc419')
-                ax_perm.set_xlabel("Mean AUC Drop")
+                ax_perm.set_xlabel("Mean AUC Drop", fontsize=9)
+                ax_perm.tick_params(axis='y', labelsize=8)
+                ax_perm.tick_params(axis='x', labelsize=8)
                 st.pyplot(fig_perm)
             else:
                 st.info("Permutation importance unavailable.")
@@ -622,12 +635,17 @@ def norm_radar(val, feat):
 
         st.write("**Feature Correlation Heatmap**")
         corr_df = pd.DataFrame(fa['correlation'])
-        fig_corr, ax_corr = plt.subplots(figsize=(6, 3.5)) 
+        # Significant size expansion for readability
+        fig_corr, ax_corr = plt.subplots(figsize=(12, 7)) 
         sns.heatmap(corr_df, annot=True, fmt=".2f", cmap="vlag", center=0, ax=ax_corr, 
                     cbar_kws={'label': 'Pearson Correlation'},
-                    annot_kws={"size": 7}, linewidths=0.5)
+                    annot_kws={"size": 5}, linewidths=0.2)
+        
+        # Rotate labels for better readability
+        plt.xticks(rotation=45, ha='right', fontsize=8)
+        plt.yticks(fontsize=8)
 
-        col_c1, col_c2, col_c3 = st.columns([1, 2, 1])
+        col_c1, col_c2, col_c3 = st.columns([1, 4, 1]) # Expand display column
         with col_c2:
             st.pyplot(fig_corr, width='stretch')
 
 
@@ -49,19 +49,25 @@ graph TB
 
 ## 2. The Training & Optimization Pipeline
 
-We employ **XGBoost** as the primary engine, optimized via **Optuna** to ensure medical-grade accuracy.
+We employ **XGBoost** as the primary engine, optimized via **Optuna** and supported by a **Production Preprocessing Pipeline** to ensure medical-grade accuracy and inference stability.
+
+1.  **Robust Feature Engineering**:
+    *   **Numerical Normalization**: `StandardScaler` is applied to all continuous vitals (`age`, `trestbps`, `chol`, `thalach`, `oldpeak`) to prevent feature-dominance and ensure gradient stability.
+    *   **Categorical Encoding**: `OneHotEncoder(drop='if_binary')` converts clinical categorical markers (`sex`, `cp`, `fbs`, `restecg`, `exang`, `slope`, `ca`, `thal`) into a sparse, machine-readable format.
+2.  **Pipeline Orchestration**: The entire transformation is wrapped in a Scikit-Learn `Pipeline`. This ensures that the exact same mathematical shifts are applied during real-time inference as were used during training, eliminating training-serving skew.
 
 ```mermaid
 graph LR
-    A[(Raw Clinical Data)] --> B[Unified Preprocessing]
-    B --> C{Optuna Meta-Learner}
-    C --> D[XGBoost Hyper-Tuning]
-    D --> E[Cross-Validation]
-    E --> C
-    C --> F[Optimized Model / .joblib]
-    C --> G[Metadata & Metrics / .json]
+    A[(Raw Clinical Data)] --> B[ColumnTransformer]
+    B --> C[StandardScaler / OHE]
+    C --> D{Optuna Meta-Learner}
+    D --> E[XGBoost Hyper-Tuning]
+    E --> F[Cross-Validation]
+    F --> D
+    D --> G[Optimized Model / .joblib]
+    D --> H[Fitted Preprocessor / .joblib]
     
-    style C fill:#f9f,stroke:#333,stroke-width:2px
+    style D fill:#f9f,stroke:#333,stroke-width:2px
 ```
 
 ---
 
@@ -34,9 +34,12 @@ python main.py
 ```
 This script will:
 1. Load raw data from `data/raw/`.
-2. Run the `Unified Preprocessor`.
+2. Run the **Robust Preprocessing Pipeline** (`ColumnTransformer` with `StandardScaler` and `OneHotEncoder`).
 3. Use **Optuna** to find the best hyperparameters.
-4. Save the artifacts (`.joblib`) and metadata (`.json`) to the `models/` directory.
+4. Save the artifacts:
+   - `models/heart_disease_model.joblib`: The optimized XGBoost learner.
+   - `models/preprocessor.joblib`: The fitted Scikit-Learn preprocessing pipeline.
+   - `models/model_metadata.json`: The clinical metrics and versioning data.
 
 ### Launching the Dashboard (Streamlit)
 ```bash
@@ -72,7 +75,18 @@ PYTHONPATH=. .venv/bin/pytest tests/
 
 ---
 
-## 5. Dependencies
+## 5. CI/CD & Automated Clinical Pipelines
+
+Every push to the `main` branch triggers an automated **Clinical Decision Guardrail Pipeline** via GitHub Actions:
+
+1.  **Job 1: Linting**: Ensures code quality and clinical-grade standards using `flake8`.
+2.  **Job 2: Clinical Testing**: Automates the full `pytest` suite across the Safety, API, and Simulator modules.
+3.  **Job 3: Model Ingest Audit**: Verifies that new clinical data patterns correctly traverse the `ColumnTransformer` preprocessing layer.
+4.  **Job 4: Docker Build**: Packages the FastAPI inference gateway into a production-ready container (`Dockerfile`) to ensure deployment portability.
+
+---
+
+## 6. Dependencies
 
 - **Modeling**: `xgboost`, `scikit-learn`, `optuna`, `joblib`.
 - **Explainability**: `shap`, `matplotlib`, `seaborn`.
 
@@ -28,7 +28,14 @@ CardioSense AI addresses these gaps by implementing a **Post-Hoc Attribution Lay
 
 ## 2. Methodology & Mathematical Foundations
 
-### 2.1 The Core Intelligence Engine (XGBoost)
+### 2.1 Robust Preprocessing Pipeline
+To ensuring model stability and training-inference consistency, we implement a **Scikit-Learn Pipeline** architecture:
+1.  **Feature Normalization**: Numerical vitals ($x_{num} \in \{\text{age, trestbps, chol, thalach, oldpeak}\}$) are transformed using **Z-score normalization** (StandardScaler):
+    $$z = \frac{x - \mu}{\sigma}$$
+2.  **Categorical Encoding**: Nominal features ($x_{cat} \in \{\text{sex, cp, fbs, restecg, exang, slope, ca, thal}\}$) are transformed via **One-Hot Encoding (OHE)** to a sparse binary vector space.
+3.  **Pipeline Consistency**: The transformation parameters ($\mu, \sigma$) are fitted exclusively on the training set and persisted in the `preprocessor.joblib` artifact to eliminate data leakage.
+
+### 2.2 The Core Intelligence Engine (XGBoost)
 We utilize **eXtreme Gradient Boosting (XGBoost)**, which optimizes the following regularized objective function:
 
 $$\mathcal{L}(\phi) = \sum_i l(\hat{y}_i, y_i) + \sum_k \Omega(f_k)$$
@@ -75,10 +82,10 @@ CardioSense AI was validated on the **UCI Cleveland Heart Disease** dataset ($N=
 ### 3.2 Performance Metrics (v2.1.0)
 | Metric | Score | Clinical Interpretation |
 | :--- | :--- | :--- |
-| **Accuracy** | **91.80%** | Comprehensive Diagnostic Reliability |
-| **ROC-AUC** | **0.9589** | Exceptional Discrimination Power |
-| **Recall (Sensitivity)** | **96.43%** | Minimized False Negatives (Patient Safety) |
-| **Brier Score** | **0.0787** | High Calibration Integrity |
+| **Accuracy** | **90.16%** | Comprehensive Diagnostic Reliability |
+| **ROC-AUC** | **0.9524** | Exceptional Discrimination Power |
+| **Recall (Sensitivity)** | **92.86%** | Minimized False Negatives (Patient Safety) |
+| **Brier Score** | **0.0841** | High Calibration Integrity |
 
 ![Performance Summary](../app/assets/App_Screenshots/10.png)