Skip to content

Security Information and Event Management (SIEM)-inspired system capable of multi-source log fusion, detection, and attack sequence reconstruction using machine learning—specifically, Transformer-based models.

Notifications You must be signed in to change notification settings

Yazan-Hamad/Advanced-SIEM-Transformer

Repository files navigation

Advanced SIEM Attack Detection and Reconstruction Using Transformers

Assignment Summary

This project implements a Transformer-based framework for detecting and reconstructing multi-stage cyber attacks from Security Information and Event Management (SIEM) logs. The goal is to model event sequences at the session level, identify high-risk attack activity, and reconstruct plausible attack chains using attention-based correlations.

The system performs:

  • Sequence-level attack detection using a Transformer encoder
  • Risk-based supervision derived from SIEM metadata
  • Post-hoc attack reconstruction using attention weights
  • Greedy decoding and graph-based correlation analysis
  • Analyst-friendly visualizations of reconstructed attack chains

The project satisfies all required deliverables of the assignment, including probability-based inference, reconstructed attack chains in JSON format, and visual attack-chain representations.


Dataset Source

The experiments use the Advanced SIEM Dataset, a large-scale synthetic SIEM log dataset hosted on Hugging Face:

Dataset URL: https://huggingface.co/datasets/darkknight25/Advanced_SIEM_Dataset


Project Structure

Advanced-SIEM-Transformer/
│
├── data/
│   └── 1_data_loading.py          # Preprocessing and sequence construction
│
├── src/
│   └── model/
│       └── transformer.py         # Transformer encoder with attention extraction
│
├── train_transformer.py           # Training, evaluation, and reconstruction
├── reconstructed_chains.py        # Generates reconstructed_chains.json
│
├── processed/                     # Generated preprocessing artifacts (ignored by git)
├── results/                       # Model outputs and figures (ignored by git)
│
├── README.md
└── .gitignore

How to Run

1. Environment Setup

This project requires Python 3.10+ and the following core libraries:

pip install numpy pandas scikit-learn torch matplotlib networkx

2. Preprocessing

Run the preprocessing script to:

  • Load the dataset
  • Normalize timestamps
  • Construct sessions
  • Build fixed-length sequences
  • Encode features and labels
python3 data/1_data_loading.py

This step generates artifacts in the processed/ directory, including:

  • sequences_cat.npy
  • sequences_num.npy
  • sequence_labels.npy
  • sequence_event_ids.pkl

3. Training and Evaluation

Train the Transformer model, evaluate performance, and generate reconstruction artifacts:

python3 train_transformer.py

This script performs:

  • Model training with class-weighted loss
  • Evaluation with accuracy, precision, recall, F1-score, and ROC-AUC
  • Attention extraction for reconstruction
  • Greedy decoding of attack paths
  • Graph reconstruction of event correlations

Outputs are saved in the results/ directory.


4. Reconstructed Attack Chains

Generate analyst-ready reconstructed attack chains in JSON format:

python3 reconstructed_chains.py

This produces:

  • results/reconstructed_chains.json

The JSON file contains reconstructed attack chains with:

  • Ordered event IDs
  • Anomaly scores
  • Attention-based influence values
  • Reconstruction method metadata

Example Output Figures

The following visual artifacts are generated automatically:

File Description
confusion_matrix.png Classification performance
roc_curve.png ROC curve for sequence-level detection
mean_attention.npy Mean attention matrix (used for heatmaps)
chain_1_timeline.png Timeline view of reconstructed attack chain
chain_1_graph.png Graph visualization of reconstructed attack chain

These figures are suitable for direct inclusion in the assignment report.


Notes

  • Due to the synthetic and highly imbalanced nature of the dataset, ROC-AUC should be interpreted with caution.
  • Attention weights are used as a proxy for event correlation and are not guaranteed to represent true causality.
  • Reconstruction is performed post-hoc and does not influence model training.

About

Security Information and Event Management (SIEM)-inspired system capable of multi-source log fusion, detection, and attack sequence reconstruction using machine learning—specifically, Transformer-based models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages