Skip to content

feat(disable-power-models): Add DisablePowerModels flag #1946

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

KaiyiLiu1234
Copy link
Collaborator

Added a config flag to prevent Kepler from resorting to models when power meters like acpi and rapl are not available.

Issues: While Node level metrics that rely on models are fully removed, due to the current setup of process metrics, process metrics which rely on node metrics that rely on models are not removed and instead output 0 in prometheus.

Copy link

codecov bot commented Mar 11, 2025

Codecov Report

Attention: Patch coverage is 4.16667% with 23 lines in your changes missing coverage. Please review.

Project coverage is 51.31%. Comparing base (9bdb601) to head (f9acea5).

Files with missing lines Patch % Lines
pkg/model/node_component_energy.go 0.00% 14 Missing ⚠️
pkg/model/node_platform_energy.go 0.00% 6 Missing ⚠️
pkg/config/config.go 25.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1946      +/-   ##
==========================================
- Coverage   51.53%   51.31%   -0.22%     
==========================================
  Files          39       39              
  Lines        3522     3539      +17     
==========================================
+ Hits         1815     1816       +1     
- Misses       1555     1571      +16     
  Partials      152      152              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@@ -41,11 +43,14 @@ func EnergyMetricsPromDesc(context string) (descriptions map[string]*prometheus.
}
} else if strings.Contains(name, config.PLATFORM) && platform.IsSystemCollectionSupported() {
source = platform.GetSourceName()
} else if components.IsSystemCollectionSupported() {
} else if strings.Contains(allComponents, name) && components.IsSystemCollectionSupported() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be replaced with name != config..OTHER || name != config.UNCORE

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it can be replaced with name != config.Other. Not name != config.Uncore because that is part of node components. I thought this was more clear in saying just for core, dram, pkg, uncore. I am not sure what the source label should be for other. Prior to my implementation OTHER would always be trained_power_model which I think makes sense since it is estimation by doing platform - pkg - dram. If both platform and component power was available via sensors, should Other's source still be trained_power_model?

@vprashar2929
Copy link
Collaborator

vprashar2929 commented Mar 17, 2025

@KaiyiLiu1234 When DISABLE_POWER_MODELS is set to true kepler crashes

Attaching logs for reference:

kepler-dev-1  | + /usr/bin/kepler -address 0.0.0.0:8888 -disable-power-meter=false -v 8 -enable-gpu=false
kepler-dev-1  | Starting kepler
kepler-dev-1  | WARNING: failed to read int from file: open /sys/devices/system/cpu/cpu0/online: no such file or directory
kepler-dev-1  | I0317 16:56:34.349942 1402069 exporter.go:121] Kepler running on version: v0.7.12-111-g03cab9d9
kepler-dev-1  | I0317 16:56:34.350058 1402069 config.go:319] using gCgroup ID in the BPF program: true
kepler-dev-1  | I0317 16:56:34.350092 1402069 config.go:321] kernel version: 6.8
kepler-dev-1  | I0317 16:56:34.350115 1402069 config.go:297] config-dir: /etc/kepler/kepler.config
kepler-dev-1  | I0317 16:56:34.350146 1402069 config.go:282] ENABLE_EBPF_CGROUPID: true
kepler-dev-1  | I0317 16:56:34.350161 1402069 config.go:283] ENABLE_GPU: false
kepler-dev-1  | I0317 16:56:34.350176 1402069 config.go:284] ENABLE_PROCESS_METRICS: true
kepler-dev-1  | I0317 16:56:34.350190 1402069 config.go:285] EXPOSE_HW_COUNTER_METRICS: true
kepler-dev-1  | I0317 16:56:34.350202 1402069 config.go:286] EXPOSE_IRQ_COUNTER_METRICS: true
kepler-dev-1  | I0317 16:56:34.350220 1402069 config.go:287] EXPOSE_BPF_METRICS: true
kepler-dev-1  | I0317 16:56:34.350232 1402069 config.go:288] EXPOSE_COMPONENT_POWER: true
kepler-dev-1  | I0317 16:56:34.350243 1402069 config.go:289] EXPOSE_ESTIMATED_IDLE_POWER_METRICS: false. This only impacts when the power is estimated using pre-prained models. Estimated idle power is meaningful only when Kepler is running on bare-metal or with a single virtual machine (VM) on the node.
kepler-dev-1  | I0317 16:56:34.350262 1402069 config.go:290] EXPERIMENTAL_BPF_SAMPLE_RATE: 0
kepler-dev-1  | I0317 16:56:34.350276 1402069 config.go:291] EXCLUDE_SWAPPER_PROCESS: false
kepler-dev-1  | I0317 16:56:34.350289 1402069 config.go:292] DISABLE_POWER_MODELS: true
kepler-dev-1  | I0317 16:56:34.350373 1402069 power.go:59] use sysfs to obtain power
kepler-dev-1  | I0317 16:56:34.350409 1402069 redfish.go:167] failed to get redfish credential file path
kepler-dev-1  | I0317 16:56:34.350750 1402069 acpi.go:71] Could not find any ACPI power meter path. Is it a VM?

.....

kepler-dev-1  |
kepler-dev-1  | I0317 16:54:58.131398 1398263 metric_collector.go:294] energy from pod/container: name: system_processes/system_processes namespace: system containerid:3a08e89229b4bbd542a3f87ce580e971b9ea7aae91fae9641a8713967b6b8b27
kepler-dev-1  |         Dyn ePkg (mJ): 75 (75) (eCore: 36 (36) eDram: 12 (12) eUncore: 0 (0)) eGPU (mJ): 0 (0) eOther (mJ): 0 (0) platform (mJ): 0 (0)
kepler-dev-1  |         Idle ePkg (mJ): 0 (0) (eCore: 0 (0) eDram: 0 (0) eUncore: 0 (0)) eGPU (mJ): 0 (0) eOther (mJ): 0 (0) platform (mJ): 0 (0)
kepler-dev-1  |         ResUsage: map[bpf_block_irq:0 (0) bpf_cpu_time_ms:7 (7) bpf_net_rx_irq:2 (2) bpf_net_tx_irq:0 (0) bpf_page_cache_hit:0 (0) cache_miss:3615473 (3615473) cpu_cycles:15963881 (15963881) cpu_instructions:20659057 (20659057) cpu_ref_cycles:15963881 (15963881)]
kepler-dev-1  |
kepler-dev-1  | I0317 16:54:58.131452 1398263 metric_collector.go:297] node energy (mJ):
kepler-dev-1  |         Dyn ePkg (mJ): 41634 (41634) (eCore: 19295 (19295) eDram: 5487 (5487) eUncore: 0 (0)) eGPU (mJ): 0 (0) eOther (mJ): 0 (0) platform (mJ): 0 (0)
kepler-dev-1  |         Idle ePkg (mJ): 0 (0) (eCore: 0 (0) eDram: 0 (0) eUncore: 0 (0)) eGPU (mJ): 0 (0) eOther (mJ): 0 (0) platform (mJ): 0 (0)
kepler-dev-1  |         ResUsage: map[bpf_block_irq:2939 (5927) bpf_cpu_time_ms:3965 (8007) bpf_net_rx_irq:562 (1038) bpf_net_tx_irq:0 (0) bpf_page_cache_hit:0 (0) cache_miss:163420118 (269898961) cpu_cycles:792542481 (1329698513) cpu_instructions:891725299 (1380155980) cpu_ref_cycles:792542481 (1329698513)]
kepler-dev-1  |
kepler-dev-1  |
kepler-dev-1  | I0317 16:54:58.131459 1398263 metric_collector.go:102] Collector Update elapsed time: 11.377281ms
kepler-dev-1  | panic: runtime error: invalid memory address or nil pointer dereference
kepler-dev-1  | [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x15d2fd7]
kepler-dev-1  |
kepler-dev-1  | goroutine 30 [running]:
kepler-dev-1  | github.com/sustainable-computing-io/kepler/pkg/metrics/utils.collect(...)
kepler-dev-1  |         /workspace/pkg/metrics/utils/utils.go:122
kepler-dev-1  | github.com/sustainable-computing-io/kepler/pkg/metrics/utils.collectEnergy(0xc002ecfd40, {0x1907b60, 0xc0002b6460}, {0x19e2fc2, 0x13}, {0x19d3ab6, 0x7}, {0x0, 0x0})
kepler-dev-1  |         /workspace/pkg/metrics/utils/utils.go:154 +0x697
kepler-dev-1  | github.com/sustainable-computing-io/kepler/pkg/metrics/utils.CollectEnergyMetrics(0xc002ecfd40, {0x1907b60, 0xc0002b6460}, 0xc000336d80)
kepler-dev-1  |         /workspace/pkg/metrics/utils/utils.go:42 +0x153
kepler-dev-1  | github.com/sustainable-computing-io/kepler/pkg/metrics/node.(*collector).Collect(0xc00460fd20, 0xc002ecfd40)
kepler-dev-1  |         /workspace/pkg/metrics/node/metrics.go:81 +0x6f
kepler-dev-1  | github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
kepler-dev-1  |         /workspace/vendor/github.com/prometheus/client_golang/prometheus/registry.go:456 +0x105
kepler-dev-1  | created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather in goroutine 26
kepler-dev-1  |         /workspace/vendor/github.com/prometheus/client_golang/prometheus/registry.go:548 +0xbab
kepler-dev-1 exited with code 2


@KaiyiLiu1234 KaiyiLiu1234 added the kind/feature New feature or request label Mar 17, 2025
@KaiyiLiu1234
Copy link
Collaborator Author

@vprashar2929 My apologies. I forgot to include all my changes. The error should be fixed now. I also cleaned some unnecessary code. For this PR, I think the most straightforward solution is to disabled metrics with source "trained_power_model" from being exported to prometheus. Previous changes included disabling models from being created with DisablePowerModel is enabled which given the current code base is quite complicated and difficult. This PR also fixes the error where metrics that have source "rapl-sysfs" when they should instead have source "trained_power_model".

@KaiyiLiu1234
Copy link
Collaborator Author

@sthaha Another issue I found is that kepler still exports these two metrics: kepler_process_joules_total and kepler_container_joules_total. Not only should these two metrics be removed and not used, but they also are very outdated. Past changes for disabling idle energy do not effect these metrics. These metrics also don't have a source.

@vprashar2929
Copy link
Collaborator

@KaiyiLiu1234 By disabling the power model I don't see any metric that has the source as trained_power_model but I think kepler_container_joules_total and process_joules_total makes use of the power that is accounted from power model

Screenshot 2025-03-18 at 12 52 30 PM

@KaiyiLiu1234
Copy link
Collaborator Author

KaiyiLiu1234 commented Mar 18, 2025

@vprashar2929 Yeah those metrics are very outdated. To give you an idea, if you disable idle energy metrics, kepler_container_joules_total and kepler_process_joules_total will still show their idle metrics even when all other energy metrics do not show idle metrics. Those metrics do not even have a source either. Edit: I think we can fix those two metrics in a different PR. This PR looks to fix rapl and platform. Thoughts?

@KaiyiLiu1234
Copy link
Collaborator Author

KaiyiLiu1234 commented Mar 19, 2025

After discussing with @sthaha, I updated the total and other energy metrics to not replace energy metrics that use models with 0 (when DisablePowerModels is enabled). @vprashar2929 So basically kepler_container_joules_total and process_joules_total and other will still appear but they will not include any energy metrics that use models. If all energy metrics used to calculate these fabricated metrics use models, then it will just output 0. Note for other energy, if platform is not available, then the output will be 0 (because otherwise 0 - pkg energy - dram energy will be negative)
Screenshot From 2025-03-19 16-46-28

KaiyiLiu1234 and others added 5 commits March 20, 2025 08:08
Added a config flag to prevent Kepler from resorting to models when
power meters like acpi and rapl are not available.

Issues: While Node level metrics that rely on models are fully removed,
due to the current setup of process metrics, process metrics which rely
on node metrics that rely on models are not removed and instead output 0
in prometheus.

Signed-off-by: Kaiyi <[email protected]>
…ntainer metrics produced with models

Fixed source label to show trained_power_models when models are in use and fully
removed metrics which use models when disable models field is turned on.

Signed-off-by: Kaiyi <[email protected]>
exported to Prometheus

Metrics with source "trained-power-model" are not exported to prometheus
when DisablePowerModels flag is enabled. This PR resolves the panic error
that occurs when enabling DisablePowerModels.

Signed-off-by: Kaiyi <[email protected]>
Metrics that have source "trained_power_model" can be disabled by
preventing them from being exported to prometheus. This PR removes
verbose code that disables models from being created when
DisablePowerModel is enabled as this is unnecessary.

Signed-off-by: Kaiyi <[email protected]>
…ulations

If DisablePowerModels is enabled, any metrics with source "trained_power_model" that
are used in Other and Total Energy Calculations (ex. kepler_node_other_joules_total,
kepler_process_other_joules_total, kepler_container_other_joules_total,
kepler_container_joules_total, kepler_process_joules_total) are removed from the
calculation (by replacing the metric value with 0). Negative values will be replaced with
0. If all metrics used for Other and Total Energy Calculations are from source "trained_power_model",
then the Other and Total Energy Calculations will export 0 (they will not be removed from prometheus).

Signed-off-by: Kaiyi Liu <[email protected]>
@vprashar2929
Copy link
Collaborator

tested and validated. LGTM

@KaiyiLiu1234
Copy link
Collaborator Author

@sthaha

tested and validated. LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants