An Apache Airflow DAG that automates K-Means clustering of server workload telemetry data to identify workload profiles (CPU-bound, memory-bound, GPU-intensive, idle). Runs inside Docker using docker compose.
The DAG Server_Workload_Clustering consists of 6 tasks:
start_pipeline → load_data_task → data_preprocessing_task → build_save_model_task → load_model_task → end_pipeline
| Task | Operator | Description |
|---|---|---|
start_pipeline |
BashOperator | Logs start time, lists data files |
load_data_task |
PythonOperator | Loads server_workloads.csv (300 records) |
data_preprocessing_task |
PythonOperator | Drops nulls, MinMax scales 6 features |
build_save_model_task |
PythonOperator | Fits K-Means for k=1..20, saves model |
load_model_task |
PythonOperator | Finds optimal k via elbow method, classifies test workloads |
end_pipeline |
BashOperator | Verifies model artifacts, logs completion |
Schedule: 0 6 * * * (daily at 6 AM UTC)
Synthetic server workload telemetry with 300 records and 6 features:
| Feature | Description |
|---|---|
cpu_percent |
CPU utilization (%) |
memory_percent |
Memory utilization (%) |
disk_io_mbps |
Disk I/O throughput (MB/s) |
network_mbps |
Network throughput (MB/s) |
runtime_seconds |
Job runtime (seconds) |
gpu_util_percent |
GPU utilization (%) |
The data contains 4 natural clusters: CPU-bound, memory-bound, GPU-intensive, and idle/lightweight jobs. To regenerate the dataset, run python3 generate_server_workloads.py from the dags/data/ directory.
K-Means clustering with the elbow method (via kneed) to determine optimal cluster count.
load_data()— Loads CSV, serializes via pickle+base64 for XCom transport.data_preprocessing(data_b64)— Deserializes, selects features, applies MinMaxScaler.build_save_model(data_b64, filename)— Fits K-Means for k=1..20, saves model, returns SSE list.load_model_elbow(filename, sse)— Finds optimal k via elbow, classifies test workloads.
airflow-workload-clustering/
├── dags/
│ ├── airflow.py # DAG definition
│ ├── data/
│ │ ├── server_workloads.csv # Training data (300 records)
│ │ ├── test.csv # Test workloads (5 records)
│ │ └── generate_server_workloads.py # Dataset generator
│ ├── model/
│ │ └── workload_model.sav # Saved model (generated by pipeline)
│ └── src/
│ ├── __init__.py
│ └── lab.py # ML functions
├── logs/
├── plugins/
├── config/
├── .env
├── docker-compose.yaml
├── setup.sh
└── README.md
- Docker Desktop installed and running (4GB+ memory)
-
Clone the repo and navigate to it:
git clone https://github.com/tengli-alaska/airflow-workload-clustering.git cd airflow-workload-clustering -
Run the setup script:
chmod +x setup.sh ./setup.sh
This creates the required directories, sets the Airflow user, and verifies all files are in place.
-
Initialize the database:
docker compose up airflow-init
-
Start Airflow:
docker compose up
Wait for the health check output:
airflow-webserver-1 | ... "GET /health HTTP/1.1" 200 ... -
Open the UI: Visit
localhost:8080and log in withairflow2/airflow2. -
Trigger the DAG: Toggle Server_Workload_Clustering on and click the play button.
-
Check results: Click
load_model_task→ Logs to see optimal cluster count and workload classifications. -
Stop Airflow:
docker compose down