Model Server Discussion

Power models on Kepler

Ratio power model
- use one representative metric for power consumption division on each component (CPU, DRAM, GPU).
Parameterized power model
- apply linear regression weights to multiple relevant metrics learned from empirical results for each component.
ML power model
- apply learning/regression model to multiple relevant metrics for estimating each component power in the profiled conditions (e.g., architecture).
Generalized dynamic power model
- apply learning/regression model to available metrics for estimating total power from all components regardless of running environments.

ratio power model to present pod power share on each component
parameterized power model to estimate node-level power for each component when no power measurement is available
generalized dynamic power model to present total dynamic power according to only pod resource usage

⚠️ Are we going to change parameterized power model to ML power model? (pros/cons)

pros:

cons:

purposes:

providing weights for parameterized power model (or ML power model) and model for generalized dynamic power
performing online training

flowchart LR;
   exporter -- unix.sock --> estimator -- ? --> model-server

⚠️ How to share models between kepler (and estimator) and model server?

flowchart LR;
   exporter -- API routes --> model-server

ML power model and generalized dynamic power model:

with API routes

   flowchart LR;
   exporter -- unix.sock --> estimator -- API routes --> model-server

with shared storage system (NFS, ceph, fuse, rook.io?)

   flowchart LR;
   exporter -- unix.sock --> estimator --> model-storage ;
   model-server --> model-storage;

approach	pros	cons
API Route	- no third-party dependency	- need to handle synchronization ourselve for up-to-date model after trainning
	- need to optimize data (model) transmission ourselves
	- data passing through the network stack even within the same node
Shared storage	- leave file synchronisation works to mature shared storage system	- require third-party setup (and might have potential overhead on cluster)

What is it? What are the differences between each one? How to develop a custom pipeline?

flowchart LR;
   query-data --> pipelines --> models

Each pipeline is composed of different

input: query metric (e.g., node_energy_stat), target columns (e.g., cpu_cycles, pkg_energy_in_joules)
training function: Keras layers, scikit regressor, ...
output: pod dynamic power, node power in core/dram

⚠️ What are the use cases?

all information is provided: all pipelines are activated
frequency and architecture is not reported: ML power model pipelines cannot be trained
only some usage measurements are enabled: only the pipelines that relies on those measurements are activated

To develop a new custom pipeline,

define pipeline
- input (usage metrics (e.g., cpu_cycles) + system metrics (e.g., frequency, architecture))
- output (parameterized model, ML power model, dynamic power model)
implement train function
```
    def train(self, prom_client):
```
prom_client[query]