This is the detailed ML module reference for the current tracked ml/ tree.
ml/ provides:
- payload intent classification
- attack class prediction from payload + scanner-derived features
- risk score generation
- CVE enrichment helper logic
- optional FastAPI inference endpoint
ml/
.gitignore
ML_MODULE_DOCUMENTATION.md
README.md
requirements.txt
data/
raw/
README.md
sample_scan.json
payloads/
normal/
normal_payloads.txt
sqli/
Auth_Bypass.txt
Generic_ErrorBased.txt
Generic_TimeBased.txt
Generic_UnionSelect.txt
SQLi_Polyglots.txt
xss/
MarioXSSVectors.txt
XSS_Polyglots.txt
xss_alert.txt
xss_alert_identifiable.txt
src/
__init__.py
attack_graph.py
build_attack_dataset.py
build_payload_dataset.py
cve_lookup.py
data_ingest.py
feature_pipeline.py
inference.py
inference_api.py
parse_cve_data.py
train_attack_predictor.py
train_payload_classifier.py
tests/
__init__.py
test_inference.py
test_real_site_simulation.py
Notes:
- Generated artifacts in
ml/models/,ml/data/processed/,ml/reports/, and notebooks are mostly untracked and produced at runtime/training time.
- Build payload corpus with
src/build_payload_dataset.py. - Parse CVE feed maps with
src/parse_cve_data.py. - Train payload classifier with
src/train_payload_classifier.py. - Build attack-feature dataset with
src/build_attack_dataset.py. - Train attack predictor with
src/train_attack_predictor.py. - Load trained artifacts in
src/inference.pyand callpredict_attack(scan_input). - Optionally expose
POST /predictthroughsrc/inference_api.py.
| File | What It Contains | Detailed Behavior |
|---|---|---|
ml/.gitignore |
Ignore policy | Excludes generated models, processed data, notebooks/reports artifacts where configured. |
ml/ML_MODULE_DOCUMENTATION.md |
ML module knowledge base | This document. |
ml/README.md |
ML usage guide | Setup, training commands, inference usage, and high-level pipeline description. |
ml/requirements.txt |
Dependency list | Runtime/training dependencies: numpy, pandas, sklearn, xgboost, joblib, fastapi, uvicorn, nltk, pytest, networkx. |
| File | What It Contains | Detailed Behavior |
|---|---|---|
ml/data/raw/README.md |
Raw dataset instructions | Documents external dataset sources, expected local placement, and quick setup commands. |
ml/data/raw/sample_scan.json |
Example scan fixture | Small fixture used by ingestion/testing scripts for quick local validation. |
ml/data/raw/payloads/normal/normal_payloads.txt |
Benign payload corpus | Plaintext normal requests used as non-malicious class examples. |
ml/data/raw/payloads/sqli/Auth_Bypass.txt |
SQLi sample set | Auth bypass SQL injection payload strings. |
ml/data/raw/payloads/sqli/Generic_ErrorBased.txt |
SQLi sample set | Error-based SQL injection payload strings. |
ml/data/raw/payloads/sqli/Generic_TimeBased.txt |
SQLi sample set | Time-based SQL injection payload strings. |
ml/data/raw/payloads/sqli/Generic_UnionSelect.txt |
SQLi sample set | UNION SELECT SQL injection payload strings. |
ml/data/raw/payloads/sqli/SQLi_Polyglots.txt |
SQLi polyglot set | Polyglot SQLi payload strings with cross-context behavior. |
ml/data/raw/payloads/xss/MarioXSSVectors.txt |
XSS sample set | XSS payload vectors from public corpus-style lists. |
ml/data/raw/payloads/xss/XSS_Polyglots.txt |
XSS polyglot set | Polyglot XSS payload strings for parser/context stress. |
ml/data/raw/payloads/xss/xss_alert.txt |
XSS sample set | Alert-style script payload examples. |
ml/data/raw/payloads/xss/xss_alert_identifiable.txt |
XSS sample set | Identifiable alert-style payload examples. |
| File | What It Contains | Detailed Behavior |
|---|---|---|
ml/src/__init__.py |
Package marker | Enables package-style imports (from src...). |
ml/src/build_payload_dataset.py |
Payload dataset builder | Loads TXT and optional Kaggle CSV sources, normalizes labels/columns, deduplicates rows, writes data/processed/payloads_clean.csv. Handles missing folders/files with warnings. |
ml/src/parse_cve_data.py |
CVE service-map builder | Parses NVD JSON feeds (recent + modified) from raw payloads folder, extracts CPE/severity/CVSS entries, writes data/processed/cve_service_map.csv. |
ml/src/train_payload_classifier.py |
Payload model trainer | Trains TF-IDF + LogisticRegression pipeline on payloads_clean.csv, prints classification report, saves models/payload_classifier.joblib. |
ml/src/feature_pipeline.py |
Attack-feature preprocessor | Builds ColumnTransformer with numeric StandardScaler + text TfidfVectorizer (payload_text), used by attack predictor training pipeline. |
ml/src/build_attack_dataset.py |
Attack-feature dataset builder | Loads payload model + payload dataset, computes payload probabilities, adds simulated scanner-signal features (open_ports, nikto_warnings, ssl_weak_ciphers, etc.), maps Normal -> NONE, writes data/processed/attack_features.csv. |
ml/src/train_attack_predictor.py |
Attack predictor trainer | Loads attack features, label-encodes attack_label, trains Pipeline(preprocess + XGBClassifier), prints metrics/confusion matrix, saves models/attack_predictor.joblib + models/attack_label_encoder.joblib. |
ml/src/cve_lookup.py |
Service-to-CVE lookup utility | Loads data/processed/cve_service_map.csv if present else empty frame; lookup_service_cves(service_name) returns count/max CVSS for matching CPE strings. |
ml/src/attack_graph.py |
Risk graph utilities | Builds target/subdomain graph (networkx), attaches node risk, computes simple propagated risk (current + 0.1 * neighbor_sum, capped 100). |
ml/src/inference.py |
Main inference engine | Loads model artifacts at import time; extracts scanner features + services; predicts payload class, attack label, confidence; computes risk score and graph risk score; returns structured prediction payload. |
ml/src/inference_api.py |
Minimal inference API wrapper | FastAPI app with POST /predict that passes request dict to predict_attack. |
ml/src/data_ingest.py |
JSON-to-CSV helper | Converts one raw scan JSON into a tabular row (open_ports, warnings, CVE count, payload text, temporary label) and writes CSV output. |
| File | What It Contains | Detailed Behavior |
|---|---|---|
ml/tests/__init__.py |
Test package marker | Enables package-aware test imports. |
ml/tests/test_inference.py |
Inference contract test | Calls predict_attack with a synthetic dict; asserts expected keys + numeric bounds (risk_score 0..100, confidence 0..1). |
ml/tests/test_real_site_simulation.py |
Simulation test script | Sends a more realistic payload-style input to predict_attack and prints result for manual validation. |
Function:
predict_attack(scan_input: dict) -> dictinml/src/inference.py
Primary input keys used by current inference code:
payload(string)- optional nested scanner blocks:
nmap,nikto,sslscan,dirsearch - optional graph context:
subdomains,subfinder.subdomains,amass.subdomains,subdomain_scan_results
Key output fields:
payload_predictionattack_predictionconfidence(0..1 float rounded)risk_score(0..100 int)graph_risk_scorescanner_summaryexplanation
ml/src/inference.pyloads model artifacts at import time; missing model files will fail import.ml/src/cve_lookup.pydegrades gracefully when CVE map file is absent.- Most scripts resolve paths from
Path(__file__)(ML_ROOT) and are resilient to caller cwd.
- Some data-generation scripts include simulated/random fields for training convenience and are not strict production telemetry pipelines.
- XGBoost is now the active attack predictor training model; ensure
xgboostis installed in the runtime/training environment. ml/tests/test_inference.pycurrently sends keys likepayload_textwhile inference expectspayload; test still passes due to default payload fallback behavior.