-
Notifications
You must be signed in to change notification settings - Fork 222
feat(hwmon): implement device reader for architectures with hwmon sensors #2372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(hwmon): implement device reader for architectures with hwmon sensors #2372
Conversation
|
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2372 +/- ##
==========================================
- Coverage 90.71% 89.52% -1.19%
==========================================
Files 44 45 +1
Lines 3865 4163 +298
==========================================
+ Hits 3506 3727 +221
- Misses 269 323 +54
- Partials 90 113 +23 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
8e0a139 to
f7da3ac
Compare
|
|
|
TODO: Update documentation to include experimental hwmon enable feature. |
|
TODO: Add unit tests for config.go |
|
📊 Profiling reports are ready to be viewed
💻 CPU Comparison with base Kepler💾 Memory Comparison with base Kepler (Inuse)💾 Memory Comparison with base Kepler (Alloc)⬇️ Download the Profiling artifacts from the Actions Summary page 📦 Artifact name: 🔧 Or use GitHub CLI to download artifacts: gh run download 20350200717 -n profile-artifacts-2372 |
|
📊 Profiling reports are ready to be viewed
💻 CPU Comparison with base Kepler💾 Memory Comparison with base Kepler (Inuse)💾 Memory Comparison with base Kepler (Alloc)⬇️ Download the Profiling artifacts from the Actions Summary page 📦 Artifact name: 🔧 Or use GitHub CLI to download artifacts: gh run download 20350294686 -n profile-artifacts-2372 |
f7da3ac to
8b27076
Compare
|
|
|
📊 Profiling reports are ready to be viewed
💻 CPU Comparison with base Kepler💾 Memory Comparison with base Kepler (Inuse)💾 Memory Comparison with base Kepler (Alloc)⬇️ Download the Profiling artifacts from the Actions Summary page 📦 Artifact name: 🔧 Or use GitHub CLI to download artifacts: gh run download 20350934438 -n profile-artifacts-2372 |
…er sensors Implemented hwmon device reader to allow Kepler to acquire power metrics (watts) from hwmon subsystem. Hwmon enablement added as an experimental feature in config. Signed-off-by: Kaiyi Liu <[email protected]>
8b27076 to
0fc4ee5
Compare
|
|
|
📊 Profiling reports are ready to be viewed
💻 CPU Comparison with base Kepler💾 Memory Comparison with base Kepler (Inuse)💾 Memory Comparison with base Kepler (Alloc)⬇️ Download the Profiling artifacts from the Actions Summary page 📦 Artifact name: 🔧 Or use GitHub CLI to download artifacts: gh run download 20361710986 -n profile-artifacts-2372 |
| Zones []string `yaml:"zones"` | ||
| Chips []string `yaml:"chips"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add comments briefly explaining what are zones and chips in hwmon.
are these hwmon concepts or arm concepts?
| func sysReadFile(file string) ([]byte, error) { | ||
| f, err := os.Open(file) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
| defer func() { _ = f.Close() }() | ||
|
|
||
| // On some machines, hwmon drivers are broken and return EAGAIN. This causes | ||
| // Go's os.ReadFile implementation to poll forever. | ||
| // | ||
| // Since we either want to read data or bail immediately, do the simplest | ||
| // possible read using system call directly. | ||
| b := make([]byte, 128) | ||
| n, err := unix.Read(int(f.Fd()), b) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
| if n < 0 { | ||
| return nil, fmt.Errorf("failed to read file: %q, read returned negative bytes value: %d", file, n) | ||
| } | ||
|
|
||
| return b[:n], nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we do simple ioutil.ReadFile
in rapl support, procfs also uses filepath and os.ReadFile
https://github.com/prometheus/procfs/blob/master/sysfs/class_powercap.go#L90-L94
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is quite interesting. Counterintuitively, you actually should not use os.ReadFile. I initially did this, but according to node exporter themselves, for reading hwmon drivers, you should use system call directly:
// On some machines, hwmon drivers are broken and return EAGAIN. This causes
// Go's os.ReadFile implementation to poll forever.
//
// Since we either want to read data or bail immediately, do the simplest
// possible read using system call directly.
| func (z *hwmonPowerZone) MaxEnergy() Energy { | ||
| // No maximum for power sensors | ||
| return 0 | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this function used anywhere? what is the implication of returning 0 as the MaxPower
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not MaxPower it is MaxEnergy. This is for satisfying the EnergyZone interface. Essentially, it means that because this zone only exposes Power, Energy and MaxEnergy will return 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is lebel being read in hwmon_power_meter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes absolutely. We need the label to determine what kind of "zone" this sensor satisfies at the node level (ex. cpu_power, io_power)
| // Returns a map of sensor number -> sensor files | ||
| func (r *sysfsHwmonReader) findSensorsByType(files []os.DirEntry, sensorType string) map[int]map[string]string { | ||
| sensors := make(map[int]map[string]string) | ||
| pattern := regexp.MustCompile(fmt.Sprintf(`^%s(\d+)_(.+)$`, sensorType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this get compiled only at startup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better move it to package level
| var zones []EnergyZone | ||
| for _, entry := range hwmonDirs { | ||
| // check for valid hwmon devices | ||
| if !entry.IsDir() && !isSymlink(filepath.Join(r.basePath, entry.Name())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it needs to be a symlink?
| // Scan for power sensors | ||
| files, err := os.ReadDir(hwmonPath) | ||
| if err != nil { | ||
| return nil, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need better error message
|
|
||
| func (r *sysfsHwmonReader) discoverZones(hwmonPath string) ([]EnergyZone, error) { | ||
| // Get chip name | ||
| chipName, err := r.getChipName(hwmonPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not filter the chips here itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be put there, but I followed rapl's setup which instead placed it in Zones() to be consistent. Please refer to rapl_sysfs_power_meter.go
|
|
||
| // WithChipFilter sets chip names to include for monitoring | ||
| // If empty, all chips are included | ||
| func WithChipFilter(chips []string) HwmonOptionFn { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how will the users decide which chip names are to be included here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the config:
experimental:
hwmon:
enabled: true
zones: ["package", "core"] # Filter by zone names
chips: ["k10temp", "amdgpu"] # Filter by chip names
Empty means everything will be included. Chip names are determined based on how node rapl determines them:
- Try to construct name from device path
- Fall back to /sys/class/hwmon/hwmonX/name file
- Fall back to hwmon directory name (hwmon0, hwmon1, etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets clearly document what we mean by a "zone". does hwmon_power_meter use sensor as zone?
hwmon0 << chip
├── power1_input << sensor
└── power2_input << sensor
hwmon1
├── power1_input
└── power2_input
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update the documentation
Implemented hwmon device reader to allow Kepler to acquire power metrics (watts) from hwmon subsystem. Hwmon enablement added as an experimental feature in config.
Precise changes: