Skip to content

Apache Airflow DAG pipeline for K-Means clustering of server workload telemetry data. Identifies workload profiles (CPU-bound, memory-bound, GPU-intensive, idle) using the elbow method. Runs in Docker.

Notifications You must be signed in to change notification settings

tengli-alaska/workload-clustering-airflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Server Workload Clustering Pipeline with Airflow

An Apache Airflow DAG that automates K-Means clustering of server workload telemetry data to identify workload profiles (CPU-bound, memory-bound, GPU-intensive, idle). Runs inside Docker using docker compose.

Pipeline Overview

The DAG Server_Workload_Clustering consists of 6 tasks:

start_pipeline → load_data_task → data_preprocessing_task → build_save_model_task → load_model_task → end_pipeline
Task Operator Description
start_pipeline BashOperator Logs start time, lists data files
load_data_task PythonOperator Loads server_workloads.csv (300 records)
data_preprocessing_task PythonOperator Drops nulls, MinMax scales 6 features
build_save_model_task PythonOperator Fits K-Means for k=1..20, saves model
load_model_task PythonOperator Finds optimal k via elbow method, classifies test workloads
end_pipeline BashOperator Verifies model artifacts, logs completion

Schedule: 0 6 * * * (daily at 6 AM UTC)

Dataset

Synthetic server workload telemetry with 300 records and 6 features:

Feature Description
cpu_percent CPU utilization (%)
memory_percent Memory utilization (%)
disk_io_mbps Disk I/O throughput (MB/s)
network_mbps Network throughput (MB/s)
runtime_seconds Job runtime (seconds)
gpu_util_percent GPU utilization (%)

The data contains 4 natural clusters: CPU-bound, memory-bound, GPU-intensive, and idle/lightweight jobs. To regenerate the dataset, run python3 generate_server_workloads.py from the dags/data/ directory.

ML Model

K-Means clustering with the elbow method (via kneed) to determine optimal cluster count.

Functions (src/lab.py)

  1. load_data() — Loads CSV, serializes via pickle+base64 for XCom transport.
  2. data_preprocessing(data_b64) — Deserializes, selects features, applies MinMaxScaler.
  3. build_save_model(data_b64, filename) — Fits K-Means for k=1..20, saves model, returns SSE list.
  4. load_model_elbow(filename, sse) — Finds optimal k via elbow, classifies test workloads.

Project Structure

airflow-workload-clustering/
├── dags/
│   ├── airflow.py                 # DAG definition
│   ├── data/
│   │   ├── server_workloads.csv   # Training data (300 records)
│   │   ├── test.csv               # Test workloads (5 records)
│   │   └── generate_server_workloads.py  # Dataset generator
│   ├── model/
│   │   └── workload_model.sav     # Saved model (generated by pipeline)
│   └── src/
│       ├── __init__.py
│       └── lab.py                 # ML functions
├── logs/
├── plugins/
├── config/
├── .env
├── docker-compose.yaml
├── setup.sh
└── README.md

How to Run

Prerequisites

  • Docker Desktop installed and running (4GB+ memory)

Steps

  1. Clone the repo and navigate to it:

    git clone https://github.com/tengli-alaska/airflow-workload-clustering.git
    cd airflow-workload-clustering
  2. Run the setup script:

    chmod +x setup.sh
    ./setup.sh

    This creates the required directories, sets the Airflow user, and verifies all files are in place.

  3. Initialize the database:

    docker compose up airflow-init
  4. Start Airflow:

    docker compose up

    Wait for the health check output:

    airflow-webserver-1 | ... "GET /health HTTP/1.1" 200 ...
    
  5. Open the UI: Visit localhost:8080 and log in with airflow2 / airflow2.

  6. Trigger the DAG: Toggle Server_Workload_Clustering on and click the play button.

  7. Check results: Click load_model_taskLogs to see optimal cluster count and workload classifications.

  8. Stop Airflow:

    docker compose down
    airflow

About

Apache Airflow DAG pipeline for K-Means clustering of server workload telemetry data. Identifies workload profiles (CPU-bound, memory-bound, GPU-intensive, idle) using the elbow method. Runs in Docker.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published