-
Notifications
You must be signed in to change notification settings - Fork 196
feat(disable-power-models): Add DisablePowerModels flag #1946
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(disable-power-models): Add DisablePowerModels flag #1946
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1946 +/- ##
==========================================
- Coverage 51.53% 51.31% -0.22%
==========================================
Files 39 39
Lines 3522 3539 +17
==========================================
+ Hits 1815 1816 +1
- Misses 1555 1571 +16
Partials 152 152 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
@@ -41,11 +43,14 @@ func EnergyMetricsPromDesc(context string) (descriptions map[string]*prometheus. | |||
} | |||
} else if strings.Contains(name, config.PLATFORM) && platform.IsSystemCollectionSupported() { | |||
source = platform.GetSourceName() | |||
} else if components.IsSystemCollectionSupported() { | |||
} else if strings.Contains(allComponents, name) && components.IsSystemCollectionSupported() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be replaced with name != config..OTHER || name != config.UNCORE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it can be replaced with name != config.Other. Not name != config.Uncore because that is part of node components. I thought this was more clear in saying just for core, dram, pkg, uncore. I am not sure what the source label should be for other. Prior to my implementation OTHER would always be trained_power_model which I think makes sense since it is estimation by doing platform - pkg - dram. If both platform and component power was available via sensors, should Other's source still be trained_power_model?
@KaiyiLiu1234 When Attaching logs for reference:
|
@vprashar2929 My apologies. I forgot to include all my changes. The error should be fixed now. I also cleaned some unnecessary code. For this PR, I think the most straightforward solution is to disabled metrics with source "trained_power_model" from being exported to prometheus. Previous changes included disabling models from being created with DisablePowerModel is enabled which given the current code base is quite complicated and difficult. This PR also fixes the error where metrics that have source "rapl-sysfs" when they should instead have source "trained_power_model". |
@sthaha Another issue I found is that kepler still exports these two metrics: kepler_process_joules_total and kepler_container_joules_total. Not only should these two metrics be removed and not used, but they also are very outdated. Past changes for disabling idle energy do not effect these metrics. These metrics also don't have a source. |
@KaiyiLiu1234 By disabling the power model I don't see any metric that has the source as |
@vprashar2929 Yeah those metrics are very outdated. To give you an idea, if you disable idle energy metrics, kepler_container_joules_total and kepler_process_joules_total will still show their idle metrics even when all other energy metrics do not show idle metrics. Those metrics do not even have a source either. Edit: I think we can fix those two metrics in a different PR. This PR looks to fix rapl and platform. Thoughts? |
After discussing with @sthaha, I updated the total and other energy metrics to not replace energy metrics that use models with 0 (when DisablePowerModels is enabled). @vprashar2929 So basically kepler_container_joules_total and process_joules_total and other will still appear but they will not include any energy metrics that use models. If all energy metrics used to calculate these fabricated metrics use models, then it will just output 0. Note for other energy, if platform is not available, then the output will be 0 (because otherwise 0 - pkg energy - dram energy will be negative) |
Added a config flag to prevent Kepler from resorting to models when power meters like acpi and rapl are not available. Issues: While Node level metrics that rely on models are fully removed, due to the current setup of process metrics, process metrics which rely on node metrics that rely on models are not removed and instead output 0 in prometheus. Signed-off-by: Kaiyi <[email protected]>
…ntainer metrics produced with models Fixed source label to show trained_power_models when models are in use and fully removed metrics which use models when disable models field is turned on. Signed-off-by: Kaiyi <[email protected]>
exported to Prometheus Metrics with source "trained-power-model" are not exported to prometheus when DisablePowerModels flag is enabled. This PR resolves the panic error that occurs when enabling DisablePowerModels. Signed-off-by: Kaiyi <[email protected]>
Metrics that have source "trained_power_model" can be disabled by preventing them from being exported to prometheus. This PR removes verbose code that disables models from being created when DisablePowerModel is enabled as this is unnecessary. Signed-off-by: Kaiyi <[email protected]>
…ulations If DisablePowerModels is enabled, any metrics with source "trained_power_model" that are used in Other and Total Energy Calculations (ex. kepler_node_other_joules_total, kepler_process_other_joules_total, kepler_container_other_joules_total, kepler_container_joules_total, kepler_process_joules_total) are removed from the calculation (by replacing the metric value with 0). Negative values will be replaced with 0. If all metrics used for Other and Total Energy Calculations are from source "trained_power_model", then the Other and Total Energy Calculations will export 0 (they will not be removed from prometheus). Signed-off-by: Kaiyi Liu <[email protected]>
cfd1552
to
f9acea5
Compare
tested and validated. LGTM |
|
Added a config flag to prevent Kepler from resorting to models when power meters like acpi and rapl are not available.
Issues: While Node level metrics that rely on models are fully removed, due to the current setup of process metrics, process metrics which rely on node metrics that rely on models are not removed and instead output 0 in prometheus.