Skip to content

Latest commit

 

History

History
157 lines (129 loc) · 8.42 KB

File metadata and controls

157 lines (129 loc) · 8.42 KB

AIris Security ML Module Documentation

This is the detailed ML module reference for the current tracked ml/ tree.

1. Module Purpose

ml/ provides:

  • payload intent classification
  • attack class prediction from payload + scanner-derived features
  • risk score generation
  • CVE enrichment helper logic
  • optional FastAPI inference endpoint

2. Current Tracked Structure

ml/
  .gitignore
  ML_MODULE_DOCUMENTATION.md
  README.md
  requirements.txt

  data/
    raw/
      README.md
      sample_scan.json
      payloads/
        normal/
          normal_payloads.txt
        sqli/
          Auth_Bypass.txt
          Generic_ErrorBased.txt
          Generic_TimeBased.txt
          Generic_UnionSelect.txt
          SQLi_Polyglots.txt
        xss/
          MarioXSSVectors.txt
          XSS_Polyglots.txt
          xss_alert.txt
          xss_alert_identifiable.txt

  src/
    __init__.py
    attack_graph.py
    build_attack_dataset.py
    build_payload_dataset.py
    cve_lookup.py
    data_ingest.py
    feature_pipeline.py
    inference.py
    inference_api.py
    parse_cve_data.py
    train_attack_predictor.py
    train_payload_classifier.py

  tests/
    __init__.py
    test_inference.py
    test_real_site_simulation.py

Notes:

  • Generated artifacts in ml/models/, ml/data/processed/, ml/reports/, and notebooks are mostly untracked and produced at runtime/training time.

3. Runtime and Training Flow

  1. Build payload corpus with src/build_payload_dataset.py.
  2. Parse CVE feed maps with src/parse_cve_data.py.
  3. Train payload classifier with src/train_payload_classifier.py.
  4. Build attack-feature dataset with src/build_attack_dataset.py.
  5. Train attack predictor with src/train_attack_predictor.py.
  6. Load trained artifacts in src/inference.py and call predict_attack(scan_input).
  7. Optionally expose POST /predict through src/inference_api.py.

4. Detailed File-by-File Reference

4.1 ML Root Files

File What It Contains Detailed Behavior
ml/.gitignore Ignore policy Excludes generated models, processed data, notebooks/reports artifacts where configured.
ml/ML_MODULE_DOCUMENTATION.md ML module knowledge base This document.
ml/README.md ML usage guide Setup, training commands, inference usage, and high-level pipeline description.
ml/requirements.txt Dependency list Runtime/training dependencies: numpy, pandas, sklearn, xgboost, joblib, fastapi, uvicorn, nltk, pytest, networkx.

4.2 Raw Data Files (ml/data/raw)

File What It Contains Detailed Behavior
ml/data/raw/README.md Raw dataset instructions Documents external dataset sources, expected local placement, and quick setup commands.
ml/data/raw/sample_scan.json Example scan fixture Small fixture used by ingestion/testing scripts for quick local validation.
ml/data/raw/payloads/normal/normal_payloads.txt Benign payload corpus Plaintext normal requests used as non-malicious class examples.
ml/data/raw/payloads/sqli/Auth_Bypass.txt SQLi sample set Auth bypass SQL injection payload strings.
ml/data/raw/payloads/sqli/Generic_ErrorBased.txt SQLi sample set Error-based SQL injection payload strings.
ml/data/raw/payloads/sqli/Generic_TimeBased.txt SQLi sample set Time-based SQL injection payload strings.
ml/data/raw/payloads/sqli/Generic_UnionSelect.txt SQLi sample set UNION SELECT SQL injection payload strings.
ml/data/raw/payloads/sqli/SQLi_Polyglots.txt SQLi polyglot set Polyglot SQLi payload strings with cross-context behavior.
ml/data/raw/payloads/xss/MarioXSSVectors.txt XSS sample set XSS payload vectors from public corpus-style lists.
ml/data/raw/payloads/xss/XSS_Polyglots.txt XSS polyglot set Polyglot XSS payload strings for parser/context stress.
ml/data/raw/payloads/xss/xss_alert.txt XSS sample set Alert-style script payload examples.
ml/data/raw/payloads/xss/xss_alert_identifiable.txt XSS sample set Identifiable alert-style payload examples.

4.3 Source Files (ml/src)

File What It Contains Detailed Behavior
ml/src/__init__.py Package marker Enables package-style imports (from src...).
ml/src/build_payload_dataset.py Payload dataset builder Loads TXT and optional Kaggle CSV sources, normalizes labels/columns, deduplicates rows, writes data/processed/payloads_clean.csv. Handles missing folders/files with warnings.
ml/src/parse_cve_data.py CVE service-map builder Parses NVD JSON feeds (recent + modified) from raw payloads folder, extracts CPE/severity/CVSS entries, writes data/processed/cve_service_map.csv.
ml/src/train_payload_classifier.py Payload model trainer Trains TF-IDF + LogisticRegression pipeline on payloads_clean.csv, prints classification report, saves models/payload_classifier.joblib.
ml/src/feature_pipeline.py Attack-feature preprocessor Builds ColumnTransformer with numeric StandardScaler + text TfidfVectorizer (payload_text), used by attack predictor training pipeline.
ml/src/build_attack_dataset.py Attack-feature dataset builder Loads payload model + payload dataset, computes payload probabilities, adds simulated scanner-signal features (open_ports, nikto_warnings, ssl_weak_ciphers, etc.), maps Normal -> NONE, writes data/processed/attack_features.csv.
ml/src/train_attack_predictor.py Attack predictor trainer Loads attack features, label-encodes attack_label, trains Pipeline(preprocess + XGBClassifier), prints metrics/confusion matrix, saves models/attack_predictor.joblib + models/attack_label_encoder.joblib.
ml/src/cve_lookup.py Service-to-CVE lookup utility Loads data/processed/cve_service_map.csv if present else empty frame; lookup_service_cves(service_name) returns count/max CVSS for matching CPE strings.
ml/src/attack_graph.py Risk graph utilities Builds target/subdomain graph (networkx), attaches node risk, computes simple propagated risk (current + 0.1 * neighbor_sum, capped 100).
ml/src/inference.py Main inference engine Loads model artifacts at import time; extracts scanner features + services; predicts payload class, attack label, confidence; computes risk score and graph risk score; returns structured prediction payload.
ml/src/inference_api.py Minimal inference API wrapper FastAPI app with POST /predict that passes request dict to predict_attack.
ml/src/data_ingest.py JSON-to-CSV helper Converts one raw scan JSON into a tabular row (open_ports, warnings, CVE count, payload text, temporary label) and writes CSV output.

4.4 Tests (ml/tests)

File What It Contains Detailed Behavior
ml/tests/__init__.py Test package marker Enables package-aware test imports.
ml/tests/test_inference.py Inference contract test Calls predict_attack with a synthetic dict; asserts expected keys + numeric bounds (risk_score 0..100, confidence 0..1).
ml/tests/test_real_site_simulation.py Simulation test script Sends a more realistic payload-style input to predict_attack and prints result for manual validation.

5. Current Inference Contract

Function:

  • predict_attack(scan_input: dict) -> dict in ml/src/inference.py

Primary input keys used by current inference code:

  • payload (string)
  • optional nested scanner blocks: nmap, nikto, sslscan, dirsearch
  • optional graph context: subdomains, subfinder.subdomains, amass.subdomains, subdomain_scan_results

Key output fields:

  • payload_prediction
  • attack_prediction
  • confidence (0..1 float rounded)
  • risk_score (0..100 int)
  • graph_risk_score
  • scanner_summary
  • explanation

6. Operational Notes

  • ml/src/inference.py loads model artifacts at import time; missing model files will fail import.
  • ml/src/cve_lookup.py degrades gracefully when CVE map file is absent.
  • Most scripts resolve paths from Path(__file__) (ML_ROOT) and are resilient to caller cwd.

7. Known Caveats

  • Some data-generation scripts include simulated/random fields for training convenience and are not strict production telemetry pipelines.
  • XGBoost is now the active attack predictor training model; ensure xgboost is installed in the runtime/training environment.
  • ml/tests/test_inference.py currently sends keys like payload_text while inference expects payload; test still passes due to default payload fallback behavior.