This project implements a Transformer-based framework for detecting and reconstructing multi-stage cyber attacks from Security Information and Event Management (SIEM) logs. The goal is to model event sequences at the session level, identify high-risk attack activity, and reconstruct plausible attack chains using attention-based correlations.
The system performs:
- Sequence-level attack detection using a Transformer encoder
- Risk-based supervision derived from SIEM metadata
- Post-hoc attack reconstruction using attention weights
- Greedy decoding and graph-based correlation analysis
- Analyst-friendly visualizations of reconstructed attack chains
The project satisfies all required deliverables of the assignment, including probability-based inference, reconstructed attack chains in JSON format, and visual attack-chain representations.
The experiments use the Advanced SIEM Dataset, a large-scale synthetic SIEM log dataset hosted on Hugging Face:
Dataset URL: https://huggingface.co/datasets/darkknight25/Advanced_SIEM_Dataset
Advanced-SIEM-Transformer/
│
├── data/
│ └── 1_data_loading.py # Preprocessing and sequence construction
│
├── src/
│ └── model/
│ └── transformer.py # Transformer encoder with attention extraction
│
├── train_transformer.py # Training, evaluation, and reconstruction
├── reconstructed_chains.py # Generates reconstructed_chains.json
│
├── processed/ # Generated preprocessing artifacts (ignored by git)
├── results/ # Model outputs and figures (ignored by git)
│
├── README.md
└── .gitignore
This project requires Python 3.10+ and the following core libraries:
pip install numpy pandas scikit-learn torch matplotlib networkxRun the preprocessing script to:
- Load the dataset
- Normalize timestamps
- Construct sessions
- Build fixed-length sequences
- Encode features and labels
python3 data/1_data_loading.pyThis step generates artifacts in the processed/ directory, including:
sequences_cat.npysequences_num.npysequence_labels.npysequence_event_ids.pkl
Train the Transformer model, evaluate performance, and generate reconstruction artifacts:
python3 train_transformer.pyThis script performs:
- Model training with class-weighted loss
- Evaluation with accuracy, precision, recall, F1-score, and ROC-AUC
- Attention extraction for reconstruction
- Greedy decoding of attack paths
- Graph reconstruction of event correlations
Outputs are saved in the results/ directory.
Generate analyst-ready reconstructed attack chains in JSON format:
python3 reconstructed_chains.pyThis produces:
results/reconstructed_chains.json
The JSON file contains reconstructed attack chains with:
- Ordered event IDs
- Anomaly scores
- Attention-based influence values
- Reconstruction method metadata
The following visual artifacts are generated automatically:
| File | Description |
|---|---|
confusion_matrix.png |
Classification performance |
roc_curve.png |
ROC curve for sequence-level detection |
mean_attention.npy |
Mean attention matrix (used for heatmaps) |
chain_1_timeline.png |
Timeline view of reconstructed attack chain |
chain_1_graph.png |
Graph visualization of reconstructed attack chain |
These figures are suitable for direct inclusion in the assignment report.
- Due to the synthetic and highly imbalanced nature of the dataset, ROC-AUC should be interpreted with caution.
- Attention weights are used as a proxy for event correlation and are not guaranteed to represent true causality.
- Reconstruction is performed post-hoc and does not influence model training.