Transformer-Based Web Application Firewall (WAF) for Zero-Day Attack Detection

Overview

This project implements an anomaly-based Web Application Firewall (WAF) that uses a Transformer (GPT-2) trained exclusively on benign web application traffic. Instead of relying on static signatures, the system learns normal HTTP request patterns and detects anomalous or malicious requests in real time by scoring request sequences with a causal language model (perplexity).

The implementation is intended for security research, demos, and hackathon prototypes — demonstrating how modern language models can be adapted for zero-day detection in web traffic.

Key Features

Reverse-proxy deployment using Nginx in front of a vulnerable web application (OWASP Juice Shop).
Log ingestion from Nginx access logs (batch and live traffic).
Parsing and normalization of HTTP requests (method, path, parameter ordering, basic sanitization).
Tokenization of normalized requests into Transformer-friendly sequences.
Fine-tuning a pre-trained GPT-2 model on benign traffic (GPU-friendly workflow using Google Colab).
Unsupervised anomaly detection using perplexity scoring of request sequences.
Real-time, non-blocking inference via a FastAPI microservice.
Live detection of malicious payloads injected via curl or automated scripts.
Modular pipeline suitable for incremental retraining and experimentation.

System Architecture

High-level flow:

Browser → Nginx (WAF placement) → Juice Shop
Nginx → FastAPI WAF service → GPT-2 inference
Logs → Normalization → Tokenization → Model training

In operation, Nginx acts as the reverse proxy in front of the application and forwards request metadata (and optionally bodies) to the FastAPI WAF service for scoring. All traffic is logged to Nginx access logs; logs are normalized and tokenized for batch training and threshold calibration.

Tech Stack

Python 3.9+ (core services)
Hugging Face Transformers
PyTorch (training & inference)
FastAPI (real-time scoring microservice)
Nginx (reverse proxy / WAF placement)
Docker (Juice Shop demo)
Google Colab (training with GPUs)
Linux (Fedora development environment)

Setup Instructions

Below are high-level steps to run the demo locally. Adjust paths and ports to fit your environment.

--first you need to clone the juice shop app from https://github.com/juice-shop/juice-shop.git

Start OWASP Juice Shop (Docker)

docker run --rm -p 3000:3000 bkimminich/juice-shop

Configure Nginx as a reverse proxy

Create a site config that forwards traffic to Juice Shop and forwards metadata to the FastAPI WAF service. Minimal example (add to your nginx sites-available):

server {
  listen 80;
  server_name localhost;

  location / {
    proxy_pass http://127.0.0.1:3000;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header Host $host;
    access_log /var/log/nginx/juice_access.log combined;
  }
}

Generate benign traffic

Use browsers, automated Selenium scripts, or wrk/ab to generate typical user interactions for Juice Shop to create benign training data.

Extract and normalize logs

Run the included normalization pipeline (examples in parse_logs.py / normalize_logs.py) to transform Nginx access records into the normalized request representation used for tokenization and training.

Train GPT-2 on Google Colab

Upload train_tokens.txt (or use a mounted Google Drive). Use Hugging Face Transformers + Trainer to fine-tune GPT-2 on benign sequences. See the Model Training section below for details.

Run FastAPI WAF service

# from the project root
uvicorn waf_service:app --host 0.0.0.0 --port 8000 --workers 1

Test anomaly detection

Send benign and malicious requests (examples below) and observe scoring and logs.

Model Training

Why GPT-2

GPT-2 is a causal language model that estimates the conditional likelihood of token sequences. This supports probabilistic scoring (per-token log-probabilities and sequence perplexity), making it suitable for unsupervised anomaly detection on sequential text-like data such as tokenized HTTP requests.

Tokenization strategy

Normalize requests into a canonical textual form containing method, path, sorted query parameters, and a deterministic representation of body or headers when included.
Use a custom tokenizer vocabulary derived from tokenizing normalized requests and merging with GPT-2's tokenizer (or extend via special_tokens_map.json) to capture common HTTP atoms (methods, separators, encodings).

Training process (high level)

Prepare a dataset of benign sequences: one normalized request per line, tokenized into GPT-2 tokens.
Use DataCollatorForLanguageModeling to create batches for causal LM fine-tuning (no masked LM).
Use Hugging Face Trainer with GPT2LMHeadModel and standard causal LM loss.
Recommended: mixed precision (fp16) and gradient accumulation if sequences are long.

Threshold selection

After training, compute the perplexity distribution on a held-out benign validation set. Choose a detection threshold (e.g., mean + k * std) based on your desired false positive rate and operational constraints.
Store thresholds and use them during live scoring.

Anomaly Detection Logic

Perplexity calculation

For a given normalized & tokenized request x = (x1...xn), the model produces token log-probabilities. The sequence negative log-likelihood is summed across tokens; perplexity is:

$$\text{Perplexity}(x)=\exp\left( -\frac{1}{n} \sum_{i=1}^{n} \log p(x_i|x_{<i}) \right)$$

Threshold derivation

Calculate perplexity for many benign requests and choose an operational threshold. Typical approach: set threshold to a percentile (e.g., 99th) of benign perplexities or mean + 3σ depending on risk tolerance.

Operational behavior (non-blocking)

The FastAPI service scores requests and logs anomalies to a secure audit store or SIEM but by default forwards traffic to the upstream application (non-blocking). This enables monitoring without impacting availability.
A separate blocking mode can be enabled to drop or challenge requests with perplexity above a higher enforcement threshold.

Logging & observability

For each request the WAF logs: normalized request, token length, perplexity, threshold used, detection decision, and metadata (timestamp, source IP).

Demo Instructions

Start services (one-liners):

# Start Juice Shop
docker run --rm -p 3000:3000 bkimminich/juice-shop &

# Start FastAPI WAF (ensure model & tokenizer paths are configured)
uvicorn waf_service:app --host 0.0.0.0 --port 8000

# Ensure Nginx is running and proxying to Juice Shop
sudo systemctl restart nginx

Example benign request (curl)

curl -i http://localhost/ # proxied to Juice Shop

Expected WAF log entry (example)

{
  "request": "GET /api/score?item=1",
  "perplexity": 12.3,
  "threshold": 50.0,
  "anomaly": false
}

Example malicious request (SQL injection)

curl -i "http://localhost/search?q=' OR 1=1 --"

Expected WAF log entry (example)

{
  "request": "GET /search?q=' OR 1=1 --",
  "perplexity": 620.4,
  "threshold": 50.0,
  "anomaly": true
}

Note: the exact numeric values depend on model, tokenizer, and training data.

Future Improvements

Incremental fine-tuning: continuously update the model on newly observed benign traffic using a FIFO or reservoir of past benign samples.
Blocking mode enforcement: add configurable blocking or challenge behavior for very high perplexity scores.
Header and body inspection: extend normalization to include selected headers or body fields (careful with PII and privacy).
Multi-app support: allow separate models or domain adapters per hosted application.
Model optimization: quantization, distillation, or ONNX export to reduce inference latency for high-throughput deployments.

Files & Useful Scripts

parse_logs.py — log parsing helpers.
normalize_logs.py — normalization routines to canonicalize requests.
prepare_training_data.py — tokenization and dataset preparation.
waf_service.py — FastAPI real-time scoring service.
generate_benign.sh — example benign traffic generator for Juice Shop.

Security & Privacy Notes

This project is intended for research and demo use. Avoid forwarding or storing sensitive payloads in logs without proper redaction and user consent.
When deploying in production, ensure model artifacts, logs, and thresholds are protected and rotated according to organizational policy.

Project maintained in this repository. For quick access to the generated README, see README.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transformer-Based Web Application Firewall (WAF) for Zero-Day Attack Detection

Overview

Key Features

System Architecture

Tech Stack

Setup Instructions

Model Training

Anomaly Detection Logic

Demo Instructions

Future Improvements

Files & Useful Scripts

Security & Privacy Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
README.md		README.md
generate_benign.sh		generate_benign.sh
normalize_logs.py		normalize_logs.py
parse_logs.py		parse_logs.py
prepare_training_data.py		prepare_training_data.py
train_data.zip		train_data.zip
waf_finetune.ipynb		waf_finetune.ipynb
waf_service.py		waf_service.py

Folders and files

Latest commit

History

Repository files navigation

Transformer-Based Web Application Firewall (WAF) for Zero-Day Attack Detection

Overview

Key Features

System Architecture

Tech Stack

Setup Instructions

Model Training

Anomaly Detection Logic

Demo Instructions

Future Improvements

Files & Useful Scripts

Security & Privacy Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages