This repository contains the experimental infrastructure, benchmark data and analysis scripts for my Master's thesis on sustainability smells in microservice architectures. The thesis defines sustainability smells as recurring architectural or deployment patterns that lead to measurable, unnecessary energy consumption, and validates them through controlled energy experiments.
The System Under Test (SUT) is the OpenTelemetry Demo v2.0.2 ("Astronomy Shop"), a microservice-based e-commerce application with 14 primary services in 11 programming languages.
| ID | Smell | Injection | Validation |
|---|---|---|---|
| C1 | Chatty Services | Replace batch gRPC call with 10 individual calls in the recommendation service | ✅ Validated (system-level) |
| C3 | Missing Caching | Remove TTL-based in-memory cache from the recommendation service | ✅ Validated (system-level) |
| C5 | Nobody Home | Deploy 4 additional idle services that receive no traffic | ❌ Not validated |
| C6 | Missing Resource Limits | Remove resources.limits from all 20 service containers |
Each smell is implemented on a dedicated Git branch (smell/<name>) with a corresponding baseline branch (baseline/<name>).
├── benchmark/
│ ├── scripts/ # Cluster lifecycle & benchmark automation
│ │ ├── createKindCluster.sh
│ │ ├── destroyKindCluster.sh
│ │ ├── installOpenTelemetry.sh
│ │ ├── execute_benchmark.sh # Main benchmark runner (interleaved pairs)
│ │ ├── collect_metrics.sh # Prometheus metric export
│ │ └── metrics_config.sh # PromQL queries for all metrics
│ ├── locust/ # Load generator (TPC-W Shopping Mix)
│ │ ├── locustfile.py
│ │ ├── people.json # Fixed checkout dataset
│ │ └── Dockerfile
│ ├── analysis/ # Statistical analysis pipeline
│ │ ├── 01_load_and_transform.py # Load CSVs → compute 4-component energy model
│ │ ├── 02_statistical_tests.py # Wilcoxon signed-rank + Cohen's d
│ │ ├── 03_visualize.py # Generate thesis charts
│ │ ├── 04_cross_smell_summary.py # Cross-smell comparison table
│ │ └── requirements.txt
│ ├── results/ # Raw + analyzed experiment data
│ │ ├── c1-low-full/ # 10 interleaved pairs per smell × load
│ │ ├── c1-medium-full/
│ │ ├── ...
│ │ └── cross_smell/
│ └── ramp-up/ # Capacity test protocol
├── kubernetes/
│ ├── otel-demo-manifest.yaml # Pre-rendered SUT manifest
│ └── kepler-values.yaml # Kepler Helm values (fake-CPU-meter mode)
├── src/ # OpenTelemetry Demo source (smell injections on branches)
└── thesis.txt # Thesis draft (LaTeX source)
| Branch | Description |
|---|---|
main |
Baseline OpenTelemetry Demo + all benchmark infrastructure |
baseline/chatty-services |
Baseline for C1 (original batch call) |
smell/chatty-services |
C1 injection (10× individual gRPC calls) |
baseline/missing-caching |
Baseline for C3 (cache enabled) |
smell/missing-caching |
C3 injection (cache removed) |
baseline/nobody-home |
Baseline for C5 (no extra services) |
smell/nobody-home |
C5 injection (4 idle services added) |
baseline/missing-resource-limits |
Baseline for C6 (limits set) |
smell/missing-resource-limits |
C6 injection (all limits removed) |
- Hardware: MacBook Pro 14" (Nov 2024), Apple M4, 32 GB RAM
- Cluster: Single-node Kind (Kubernetes in Docker)
- Energy: Kepler v0.9.0 in
KEPLER_FAKE_CPU_METERmode (model-based estimation) - Monitoring: kube-prometheus stack (cAdvisor, node-exporter, Grafana) + OTel Demo Prometheus
- Load: Locust (Docker container, outside cluster)
- Runs: 10 interleaved baseline–smell pairs per experiment = 20 runs × 11 experiments = 220 runs total
The analysis pipeline processes the raw experiment data and produces statistical results and charts.
cd benchmark/analysis
pip install -r requirements.txt
# Step 1: Load raw CSVs and compute the 4-component energy model
python 01_load_and_transform.py ../results/c1-low-full
# Step 2: Run Wilcoxon signed-rank tests + Cohen's d
python 02_statistical_tests.py ../results/c1-low-full
# Step 3: Generate charts for the thesis
python 03_visualize.py ../results/c1-low-full
# Step 4: Cross-smell comparison table
python 04_cross_smell_summary.py ../resultsEach script reads from and writes to the analysis/ subfolder inside each result directory.
Note: A single smell × load experiment takes approximately 8 hours (20 runs × ~25 min each). A full smell across all three load levels takes about 24 hours. The script handles the full lifecycle for each run: cluster creation, SUT deployment, warm-up, measurement, metric export and teardown. Locust must already be running as a Docker container (started once via
createLocustBenchmark.sh).
cd benchmark/scripts
# Example: C1 (Chatty Services), low load, 10 interleaved pairs
./execute_benchmark.sh \
--smell-branch "smell/chatty-services" \
--baseline-branch "baseline/chatty-services" \
--load-profile low \
--pairs 10 \
--output-dir "../results/c1-low-full"The script alternates baseline and smell runs within each pair. For every run it creates a fresh Kind cluster, deploys the SUT from the specified branch, runs the measurement and tears down the cluster again.
This thesis builds on the OpenTelemetry Demo project and on the energy measurement methodology by Legler et al. The experimental infrastructure was adapted from prior work at the ISE research group at TU Berlin.
The OpenTelemetry Demo source code is licensed under the Apache License 2.0. The benchmark scripts, analysis code and thesis content in this repository are part of a Master's thesis at TU Berlin.