A fully Bayesian, unsupervised anomaly detection system for DNS domains, using PyMC for inference, ArviZ for diagnostics, and well-explained feature modeling.
This system models maliciousness of DNS domains based on features extracted every hour (aggregated from DNS logs).
We use a Bayesian generative model with latent variables to estimate the posterior probability that a domain is malicious — even without labeled data.
Each domain is modeled as coming from one of two unknown groups:
- 🟢 Benign
- 🔴 Malicious
Since we don’t have ground truth, we treat the class as a latent variable:
θ ~ Beta(1,1) # prior belief of global malicious rate
malicious[i] ~ Bernoulli(θ) # latent variable per domain
Features depend on whether the domain is malicious or not:
X[i] ~ Normal(mu_0, sigma) if benign
X[i] ~ Normal(mu_1, sigma) if malicious
We model the means (mu_0
, mu_1
) and std (sigma
) of the features as random variables too.
- Extract features per domain per hour
- Standardize them using mean/std
- Use PyMC to build the joint probabilistic model
- Sample from the posterior using NUTS
- Infer:
P(domain is malicious)
mu_0
,mu_1
→ how malicious/benign domains tend to behave
- Use
arviz
to:- Diagnose convergence
- Plot and interpret results
θ ~ Beta(1,1)
↓
malicious[i] ~ Bernoulli(θ)
↓
┌──────────────┐
│ Feature X │
└──────────────┘
↑
┌─────────────────────────┐
│ if malicious[i] == 0 → N(mu_0, σ)
│ if malicious[i] == 1 → N(mu_1, σ)
└─────────────────────────┘
The model expects 21 normalized features like:
num_requests
,avg_ttl
,ttl_range
,ttl_entropy
num_ips
,ips_entropy
,ip_sharing_count
- FFT features:
dominant_frequency
,spectral_entropy
- Flags:
is_in_TI
,is_in_tranco
See the scripts for full details.
domain-bayes-unsup.py
— Training + inference + diagnostics + plotsinfer_new_domain.py
— PredictsP(malicious)
for new domains using trained posteriorbatch_infer.py
— Predicts for many domains in a CSVdns_bayesian_unsup_analysis.ipynb
— Jupyter version for explorationrequirements.txt
— All dependenciesmaliciousness_report.csv
— Predictionsrun_log.md
— Describes each run (timestamped)
pip install -r requirements.txt
python domain-bayes-unsup.py --output_dir results/YYYY-MM-DD
python infer_new_domain.py --num_requests=... --min_ttl=... ...
python batch_infer.py
This README was generated on: 2025-04-07 19:51:57
- No labels needed — fully unsupervised
- Produces uncertainty + confidence intervals
- Highly interpretable
- Probabilistic scores (not black-box classifications)
- Feature weights learned from data
- Online / streaming inference with dynamic priors
- Custom priors based on domain heuristics
- Hierarchical model across networks or tenants
- Time-evolving beliefs
Developed with ❤️ using PyMC, ArviZ, and NumPy.