Skip to content

Latest commit

 

History

History
400 lines (317 loc) · 11.4 KB

File metadata and controls

400 lines (317 loc) · 11.4 KB

AIris Security — Machine Learning Module

Hybrid risk scoring engine combining an XGBoost attack predictor, an NLP payload classifier, and CVE-enriched features.

For complete technical coverage of every file under ml/, see ml/ML_MODULE_DOCUMENTATION.md.


📌 Overview

The ML module provides the intelligent core of AIris Security. It is used in two ways:

  1. Embedded in the backendml_service.py loads trained models at startup and runs run_hybrid_ml() after every scan
  2. Standalone microserviceinference_api.py exposes a FastAPI endpoint on port 9000 (optional)

Core capabilities:

Capability Implementation
Payload classification TF-IDF + Logistic Regression (payload_classifier.joblib)
Attack type prediction Multi-class classifier (attack_predictor.joblib)
Risk scoring Hybrid XGBoost + NLP blend, then scanner-evidence boosts
CVE context NVD data enrichment via parse_cve_data.py

🗂️ Structure

ml/
├── src/
│   ├── __init__.py
│   ├── build_payload_dataset.py    Build labelled payload CSV from raw sources
│   ├── build_attack_dataset.py     Build attack feature dataset from scan fixtures
│   ├── data_ingest.py              Data loading and validation utilities
│   ├── feature_pipeline.py         Feature extraction shared by training + inference
│   ├── parse_cve_data.py           Parse NVD JSON feeds → processed CSV
│   ├── train_payload_classifier.py Train TF-IDF + LogReg payload classifier
│   ├── train_attack_predictor.py   Train multi-class attack predictor
│   ├── inference.py                Inference engine (used by backend ml_service)
│   └── inference_api.py            Optional standalone FastAPI ML microservice
│
├── models/                         Saved models (output of training scripts)
│   ├── payload_classifier.joblib
│   ├── attack_predictor.joblib
│   └── attack_label_encoder.joblib
│
├── data/
│   ├── raw/                        Source data (not committed)
│   │   ├── cve/                    NVD JSON feeds
│   │   └── payloads/               Raw payload text files
│   └── processed/                  Cleaned CSVs ready for training
│       ├── payloads.csv
│       └── attack_features.csv
│
├── tests/
│   ├── test_inference.py           Unit tests for inference.py
│   └── test_real_site_simulation.py  Integration tests against fixture scan data
│
├── notebooks/                      EDA and evaluation notebooks
├── reports/                        Generated metrics / confusion matrices
├── requirements.txt
└── README.md

⚙️ Installation

cd ml
python -m venv .venv

# Windows
.venv\Scripts\Activate.ps1
# Linux / Mac
# source .venv/bin/activate

pip install -r requirements.txt

Key dependencies:

numpy>=1.21
pandas>=1.3
scikit-learn>=1.0
xgboost>=1.5
joblib>=1.1
nltk>=3.6
tqdm>=4.62
matplotlib>=3.4

📦 Data Preparation

1. CVE data

Download NVD JSON feeds and parse them:

# Download (example — 2023 feed)
# https://nvd.nist.gov/vuln/data-feeds

python src/parse_cve_data.py --input data/raw/cve/ --output data/processed/

Output columns: cve_id, description, severity, cvss_score, attack_vector

2. Payload dataset

Sources: PayloadsAllTheThings, SecLists, synthetically generated benign samples.

python src/build_payload_dataset.py
# Writes: data/processed/payloads.csv

Sample rows:

payload label
' OR '1'='1' -- SQLI
<script>alert(1)</script> XSS
../../etc/passwd PATH_TRAVERSAL
normal search query BENIGN

3. Attack feature dataset

Built from scan result fixtures:

python src/build_attack_dataset.py
# Writes: data/processed/attack_features.csv

Features: open_port_count, critical_port_flag, nikto_warning_count, ssl_issues_flag, dir_critical_count, cve_avg_severity, ...


🧠 Model Training

Payload Classifier

python src/train_payload_classifier.py
  • Algorithm: TF-IDF vectoriser + Logistic Regression
  • Input: Raw payload strings
  • Output classes: SQLI, XSS, PATH_TRAVERSAL, COMMAND_INJECTION, BENIGN
  • Saved to: models/payload_classifier.joblib
  • Typical accuracy: ~92 % on held-out test set

Attack Predictor

python src/train_attack_predictor.py
  • Algorithm: Multi-class XGBoost classifier
  • Input: Numerical scan features from feature_pipeline.py
  • Output classes: SQLi, XSS, RCE, Path Traversal, Weak SSL, Open Port, NONE, ...
  • Saved to: models/attack_predictor.joblib, models/attack_label_encoder.joblib

🔮 Inference

Embedded mode (used by backend)

from ml.src.inference import predict_attack

result = predict_attack(findings, scanner_results)
# Returns: {attack_type, risk_score, confidence, predicted_attack_types, cve_context}

Standalone microservice (optional)

uvicorn ml.src.inference_api:app --host 0.0.0.0 --port 9000

Input

{
  "scan": {
    "open_ports": [22, 80, 443],
    "nikto_warnings": 5,
    "ssl_issues": true,
    "dir_critical_count": 2,
    "scanner_text": "Possible SQLi detected in /search?q=",
    "cve_list": [{"severity": 9.8}, {"severity": 7.5}]
  }
}

Output

{
  "predicted_attack": "SQLi",
  "probabilities": {"SQLi": 0.91, "XSS": 0.05, "NONE": 0.04},
  "risk_score": 84,
  "explanation": "High SQL-related signals found."
}

🔀 Hybrid Risk Scoring (Backend Integration)

backend/app/services/ml_service.py implements the full hybrid pipeline:

Scanner findings
      │
      ├── XGBoost attack predictor  →  attack_prediction / confidence
      └── NLP Payload classifier    →  payload confidence score
                  │
            Hybrid risk blend
                  │
         Scanner evidence boosts:
           • SSL weak ciphers    +3 each  (cap +15)
           • Deprecated TLS      +5 flat
           • Exposed dir paths   +2 each  (cap +20)
           • Critical dir finds  +10 flat
                  │
           clamp [0, 100]
                  │
              risk_score

🎯 Advanced ML Output Features

AIris provides 6 comprehensive ML metrics (1 baseline + 5 enhanced features) for deep security analysis:

1. Risk Score (Baseline)

  • Range: 0-100
  • Purpose: Overall threat severity assessment
  • Calculation: Hybrid blend of ML confidence (60%) + scanner evidence (40%)
  • Output: Integer score with color-coded severity (Critical/High/Medium/Low)

2. Severity Distribution ✨ NEW

  • Purpose: Shape of risk - how findings are distributed across severity levels
  • Output Structure:
{
  "critical": 5,
  "high": 8,
  "medium": 12,
  "low": 6,
  "informational": 3,
  "total": 34,
  "percentages": {"critical": 14.7, "high": 23.5, ...},
  "shape": "top-heavy"
}
  • Shape Classifications:
    • top-heavy: ≥50% critical/high findings (urgent action required)
    • balanced: Mixed distribution (steady remediation)
    • low-heavy: ≥50% low/info findings (hardened target)
  • Use Case: Understand risk concentration and prioritize remediation efforts

3. Attack Surface Score ✨ NEW

  • Range: 0-100
  • Purpose: Structural exposure measurement (independent of attack likelihood)
  • Scoring Breakdown:
    • Port exposure (30 pts): Each open port = +3 points
    • Path exposure (25 pts): Exposed web paths = +2.5 points each
    • Protocol weakness (25 pts): Deprecated TLS + weak ciphers = +5 points each
    • Service visibility (20 pts): Services with CVEs = +4 points each
  • Output: Integer score with exposure level (Minimal/Low/Moderate/High)
  • Use Case: Measure attack surface before implementing security controls

4. Threat Category ✨ NEW

  • Purpose: Attack type classification with model uncertainty analysis
  • Output Structure:
{
  "primary": {"attack_type": "SQL Injection", "probability": 0.68},
  "secondary": {"attack_type": "XSS", "probability": 0.22},
  "uncertainty_gap": 0.46,
  "confidence_level": "high"
}
  • Confidence Levels:
    • high: Gap ≥ 0.4 (clear primary threat)
    • medium: Gap 0.2-0.4 (monitor evolving threats)
    • low: Gap < 0.2 (multiple attack vectors likely)
  • Use Case: Understand model certainty and prepare for multiple attack scenarios

5. Exploitability Index ✨ NEW

  • Range: 0-100
  • Purpose: CVSS-inspired ease-of-exploitation metric
  • Scoring Factors:
    • Access Complexity (0-40): Remote service exposure
    • Authentication Bypass (0-30): Auth vulnerability detection
    • Impact Severity (0-30): CVE CVSS scores
  • Output Structure:
{
  "score": 78,
  "level": "high",
  "factors": {
    "access_complexity": 30,
    "authentication_required": 20,
    "impact_score": 28
  }
}
  • Levels: Critical (80+), High (60-79), Medium (40-59), Low (<40)
  • Use Case: Assess immediate exploitation risk and prioritize patching windows

6. Remediation Priorities ✨ NEW

  • Purpose: Prescriptive ranked action list with estimated risk reduction
  • Output Structure:
[
  {
    "priority": 1,
    "category": "Patch/Update",
    "finding_count": 8,
    "severity_breakdown": {"critical": 3, "high": 5},
    "estimated_risk_reduction": 35,
    "actions": [
      "Update Apache to 2.4.59 (CVE-2024-1234)",
      "Patch OpenSSL to 3.0.14 (CVE-2024-5678)"
    ]
  }
]
  • Categories:
    • Patch/Update: CVE-related vulnerabilities requiring software updates
    • Configuration: SSL/TLS settings, headers, server misconfigurations
    • Access Control: Exposed paths, open ports, permission issues
    • Input Validation: SQLi, XSS, and other injection vulnerabilities
  • Priority Calculation:
    • Critical findings = 10 points each
    • High findings = 5 points each
    • Medium findings = 2 points each
    • Boosted by exploitability level (1.3x-1.5x multiplier)
  • Use Case: Create actionable remediation roadmap with estimated impact

🧪 Testing

cd ml
pytest tests/

Tests cover:

  • test_inference.py — unit tests for predict_attack() with fixture data
  • test_real_site_simulation.py — integration test simulating a full scan result

📊 Model Evaluation

After training, evaluation reports are saved to reports/:

  • model_performance.html — interactive metrics (accuracy, F1, ROC)
  • confusion_matrix.png — multi-class confusion matrix

Payload classifier metrics (typical):

Metric Value
Accuracy ~92 %
F1 (macro) ~0.91
Precision ~0.93
Recall ~0.90

📜 Dataset Licensing & Ethics

All training data is sourced from public, open-licence repositories:

Source Licence Use
PayloadsAllTheThings MIT Malicious payload samples
NVD / NIST Public domain (US Govt) CVE severity statistics
Synthetic benign samples N/A — self-generated Balance payload dataset
Kaggle SQLi/XSS datasets CC0 / public Additional payload labels

No proprietary, private, or personally identifiable data is used. No payloads are executed against real systems. Models detect attacks — they do not generate them.


Last updated: March 2026 · v2.1.0