Skip to content

Kaldreic/Statistics_Thesis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Online Algorithms for Real-Time Statistical Computations

Statistics Course Thesis — Cybersecurity

This repository contains the LaTeX source, Python experiments, and compiled PDF for a thesis investigating online algorithms for real-time statistical computation, with applications to intrusion detection and machine learning.

Abstract

This project investigates online algorithms for real-time statistical computations and their applications to cybersecurity and machine learning. We focus on streaming estimators (e.g. online mean and variance), change-point detection methods, and online learning techniques relevant for intrusion detection and anomaly detection in network traffic.

Repository Structure

├── main.tex                 # Root LaTeX document
├── main.pdf                 # Compiled thesis (tracked for convenience)
├── references.bib           # BibLaTeX bibliography
├── chapters/                # Chapter source files
│   ├── 1-Introduction.tex
│   ├── 2-Background.tex
│   ├── 3-Estimation.tex     # Online estimators (Welford, EMA)
│   ├── 4-Detection.tex      # CUSUM, EWMA control charts
│   ├── 5-Learning.tex       # SGD, online logistic regression
│   ├── 6-Study.tex          # NSL-KDD case study
│   └── 7-Conclusion.tex
├── experiment/              # Python experiment code
│   ├── experiment_nsl_kdd.py
│   └── requirements.txt
└── data/                    # NSL-KDD dataset
    ├── KDDTrain+.txt
    └── KDDTest+.txt

Building the Thesis

Requirements

  • TeX Live 2022+ or equivalent LaTeX distribution
  • Biber (for BibLaTeX bibliography processing)

Build Commands

# Recommended: use latexmk for automated builds
latexmk -pdf main.tex

# Manual build (if latexmk unavailable)
pdflatex main.tex
biber main
pdflatex main.tex
pdflatex main.tex

Running the Experiment

The case study compares batch vs online logistic regression on the NSL-KDD intrusion detection dataset.

Setup

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/macOS
# venv\Scripts\activate   # Windows

# Install dependencies
pip install -r experiment/requirements.txt

Run

python experiment/experiment_nsl_kdd.py \
	--data-root data \
	--eta 0.01 \
	--lambda-reg 1e-4 \
	--bootstrap-runs 1000 \
	--bootstrap-seed 42 \
	--results-json experiment/results/latest_results.json

Key CLI flags:

  • --data-root, --train-file, --test-file select alternative NSL-KDD splits.
  • --eta, --lambda-reg, --threshold tune the online learner without editing code.
  • --bootstrap-runs, --bootstrap-seed request variability estimates via paired bootstrap resampling of the test stream.
  • --results-json controls where a machine-readable metrics artifact is stored (use --no-json to skip).

Expected Output

The script prints dataset stats, confusion matrices, and a publication-ready LaTeX table comparing:

  1. Batch Logistic Regression — trained once on full training set with class_weight="balanced".
  2. Online Logistic Regression (SGD) — pre-trained with a single pass and updated prequentially on the test stream.

The JSON artifact contains the same metrics, hyperparameters, confusion matrices, runtimes, and (when enabled) bootstrap summaries so that Table 6.1 and the reported confidence intervals can be regenerated directly from the repository.

Note: The executable code currently covers only the logistic-regression study from Chapter 6. The CUSUM/EWMA schemes discussed in Chapter 4 are presented at the theoretical level and do not yet have accompanying simulation scripts in this repository.

Key Topics Covered

Chapter Topic Key Algorithms
3 Online Estimation Welford's algorithm, EMA
4 Change Detection CUSUM, EWMA control charts
5 Online Learning SGD, online logistic regression
6 Case Study NSL-KDD intrusion detection

License

  • Thesis content: All rights reserved (academic use permitted with citation)
  • Code (experiment/): MIT License
  • Dataset (data/): Public domain (see data/README.md for attribution)

See LICENSE.md for full details.

Author

Aldo Ristori
Master of Science in Cybersecurity — Statistics 25/26

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors