Server Workload Clustering Pipeline with Airflow

An Apache Airflow DAG that automates K-Means clustering of server workload telemetry data to identify workload profiles (CPU-bound, memory-bound, GPU-intensive, idle). Runs inside Docker using docker compose.

Pipeline Overview

The DAG Server_Workload_Clustering consists of 6 tasks:

start_pipeline → load_data_task → data_preprocessing_task → build_save_model_task → load_model_task → end_pipeline

Task	Operator	Description
`start_pipeline`	BashOperator	Logs start time, lists data files
`load_data_task`	PythonOperator	Loads `server_workloads.csv` (300 records)
`data_preprocessing_task`	PythonOperator	Drops nulls, MinMax scales 6 features
`build_save_model_task`	PythonOperator	Fits K-Means for k=1..20, saves model
`load_model_task`	PythonOperator	Finds optimal k via elbow method, classifies test workloads
`end_pipeline`	BashOperator	Verifies model artifacts, logs completion

Schedule: 0 6 * * * (daily at 6 AM UTC)

Dataset

Synthetic server workload telemetry with 300 records and 6 features:

Feature	Description
`cpu_percent`	CPU utilization (%)
`memory_percent`	Memory utilization (%)
`disk_io_mbps`	Disk I/O throughput (MB/s)
`network_mbps`	Network throughput (MB/s)
`runtime_seconds`	Job runtime (seconds)
`gpu_util_percent`	GPU utilization (%)

The data contains 4 natural clusters: CPU-bound, memory-bound, GPU-intensive, and idle/lightweight jobs. To regenerate the dataset, run python3 generate_server_workloads.py from the dags/data/ directory.

ML Model

K-Means clustering with the elbow method (via kneed) to determine optimal cluster count.

Functions (`src/lab.py`)

load_data() — Loads CSV, serializes via pickle+base64 for XCom transport.
data_preprocessing(data_b64) — Deserializes, selects features, applies MinMaxScaler.
build_save_model(data_b64, filename) — Fits K-Means for k=1..20, saves model, returns SSE list.
load_model_elbow(filename, sse) — Finds optimal k via elbow, classifies test workloads.

Project Structure

airflow-workload-clustering/
├── dags/
│   ├── airflow.py                 # DAG definition
│   ├── data/
│   │   ├── server_workloads.csv   # Training data (300 records)
│   │   ├── test.csv               # Test workloads (5 records)
│   │   └── generate_server_workloads.py  # Dataset generator
│   ├── model/
│   │   └── workload_model.sav     # Saved model (generated by pipeline)
│   └── src/
│       ├── __init__.py
│       └── lab.py                 # ML functions
├── logs/
├── plugins/
├── config/
├── .env
├── docker-compose.yaml
├── setup.sh
└── README.md

How to Run

Prerequisites

Docker Desktop installed and running (4GB+ memory)

Steps

Clone the repo and navigate to it:

git clone https://github.com/tengli-alaska/airflow-workload-clustering.git
cd airflow-workload-clustering

Run the setup script:
```
chmod +x setup.sh
./setup.sh
```
This creates the required directories, sets the Airflow user, and verifies all files are in place.
Initialize the database:
```
docker compose up airflow-init
```

Start Airflow:

docker compose up

Wait for the health check output:

airflow-webserver-1 | ... "GET /health HTTP/1.1" 200 ...

Open the UI: Visit localhost:8080 and log in with airflow2 / airflow2.
Trigger the DAG: Toggle Server_Workload_Clustering on and click the play button.
Check results: Click load_model_task → Logs to see optimal cluster count and workload classifications.
Stop Airflow:
```
docker compose down
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Server Workload Clustering Pipeline with Airflow

Pipeline Overview

Dataset

ML Model

Functions (`src/lab.py`)

Project Structure

How to Run

Prerequisites

Steps

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
config		config
dags		dags
logs		logs
.DS_Store		.DS_Store
README.md		README.md
docker-compose.yaml		docker-compose.yaml
setup.sh		setup.sh

tengli-alaska/workload-clustering-airflow

Folders and files

Latest commit

History

Repository files navigation

Server Workload Clustering Pipeline with Airflow

Pipeline Overview

Dataset

ML Model

Functions (src/lab.py)

Project Structure

How to Run

Prerequisites

Steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Functions (`src/lab.py`)

Packages