Statistics Course Thesis — Cybersecurity
This repository contains the LaTeX source, Python experiments, and compiled PDF for a thesis investigating online algorithms for real-time statistical computation, with applications to intrusion detection and machine learning.
This project investigates online algorithms for real-time statistical computations and their applications to cybersecurity and machine learning. We focus on streaming estimators (e.g. online mean and variance), change-point detection methods, and online learning techniques relevant for intrusion detection and anomaly detection in network traffic.
├── main.tex # Root LaTeX document
├── main.pdf # Compiled thesis (tracked for convenience)
├── references.bib # BibLaTeX bibliography
├── chapters/ # Chapter source files
│ ├── 1-Introduction.tex
│ ├── 2-Background.tex
│ ├── 3-Estimation.tex # Online estimators (Welford, EMA)
│ ├── 4-Detection.tex # CUSUM, EWMA control charts
│ ├── 5-Learning.tex # SGD, online logistic regression
│ ├── 6-Study.tex # NSL-KDD case study
│ └── 7-Conclusion.tex
├── experiment/ # Python experiment code
│ ├── experiment_nsl_kdd.py
│ └── requirements.txt
└── data/ # NSL-KDD dataset
├── KDDTrain+.txt
└── KDDTest+.txt
- TeX Live 2022+ or equivalent LaTeX distribution
- Biber (for BibLaTeX bibliography processing)
# Recommended: use latexmk for automated builds
latexmk -pdf main.tex
# Manual build (if latexmk unavailable)
pdflatex main.tex
biber main
pdflatex main.tex
pdflatex main.texThe case study compares batch vs online logistic regression on the NSL-KDD intrusion detection dataset.
# Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/macOS
# venv\Scripts\activate # Windows
# Install dependencies
pip install -r experiment/requirements.txtpython experiment/experiment_nsl_kdd.py \
--data-root data \
--eta 0.01 \
--lambda-reg 1e-4 \
--bootstrap-runs 1000 \
--bootstrap-seed 42 \
--results-json experiment/results/latest_results.jsonKey CLI flags:
--data-root,--train-file,--test-fileselect alternative NSL-KDD splits.--eta,--lambda-reg,--thresholdtune the online learner without editing code.--bootstrap-runs,--bootstrap-seedrequest variability estimates via paired bootstrap resampling of the test stream.--results-jsoncontrols where a machine-readable metrics artifact is stored (use--no-jsonto skip).
The script prints dataset stats, confusion matrices, and a publication-ready LaTeX table comparing:
- Batch Logistic Regression — trained once on full training set with
class_weight="balanced". - Online Logistic Regression (SGD) — pre-trained with a single pass and updated prequentially on the test stream.
The JSON artifact contains the same metrics, hyperparameters, confusion matrices, runtimes, and (when enabled) bootstrap summaries so that Table 6.1 and the reported confidence intervals can be regenerated directly from the repository.
Note: The executable code currently covers only the logistic-regression study from Chapter 6. The CUSUM/EWMA schemes discussed in Chapter 4 are presented at the theoretical level and do not yet have accompanying simulation scripts in this repository.
| Chapter | Topic | Key Algorithms |
|---|---|---|
| 3 | Online Estimation | Welford's algorithm, EMA |
| 4 | Change Detection | CUSUM, EWMA control charts |
| 5 | Online Learning | SGD, online logistic regression |
| 6 | Case Study | NSL-KDD intrusion detection |
- Thesis content: All rights reserved (academic use permitted with citation)
- Code (
experiment/): MIT License - Dataset (
data/): Public domain (seedata/README.mdfor attribution)
See LICENSE.md for full details.
Aldo Ristori
Master of Science in Cybersecurity — Statistics 25/26