-
Notifications
You must be signed in to change notification settings - Fork 26
Model Server Discussion
Sunyanan Choochotkaew edited this page Sep 13, 2022
·
7 revisions
- Ratio power model
- use one representative metric for power consumption division on each component (CPU, DRAM, GPU).
- Parameterized power model
- apply linear regression weights to multiple relevant metrics learned from empirical results for each component.
- ML power model
- apply learning/regression model to multiple relevant metrics for estimating each component power in the profiled conditions (e.g., architecture).
- Generalized dynamic power model
- apply learning/regression model to available metrics for estimating total power from all components regardless of running environments.
- ratio power model to present pod power share on each component
- parameterized power model to estimate node-level power for each component when no power measurement is available
- generalized dynamic power model to present total dynamic power according to only pod resource usage
pros:
- more complicated model, discover hidden information
cons:
- require enough learning data for each profile
purposes:
- providing weights for parameterized power model (or ML power model) and model for generalized dynamic power
- performing online training
flowchart LR;
exporter -- unix.sock --> estimator -- ? --> model-server
- parameterized power model:
flowchart LR;
exporter -- API routes --> model-server
- ML power model and generalized dynamic power model:
-
with API routes
flowchart LR; exporter -- unix.sock --> estimator -- API routes --> model-server
-
with shared storage system (NFS, ceph, fuse, rook.io?)
flowchart LR; exporter -- unix.sock --> estimator --> model-storage ; model-server --> model-storage;
approach pros cons API Route - no third-party dependency - need to handle synchronization ourselve for up-to-date model after trainning - need to optimize data (model) transmission ourselves - data passing through the network stack even within the same node Shared storage - leave file synchronisation works to mature shared storage system - require third-party setup (and might have potential overhead on cluster)
-
What is it? What are the differences between each one? How to develop a custom pipeline?
flowchart LR;
query-data --> pipelines --> models
Each pipeline is composed of different
- input: query metric (e.g., node_energy_stat), target columns (e.g., cpu_cycles, pkg_energy_in_joules)
- training function: Keras layers, scikit regressor, ...
- output: pod dynamic power, node power in core/dram
- all information is provided: all pipelines are activated
- frequency and architecture is not reported: ML power model pipelines cannot be trained
- only some usage measurements are enabled: only the pipelines that relies on those measurements are activated
To develop a new custom pipeline,
- define pipeline
- input (usage metrics (e.g., cpu_cycles) + system metrics (e.g., frequency, architecture))
- output (parameterized model, ML power model, dynamic power model)
- implement train function
prom_client[query]
def train(self, prom_client):