[WIP] Add PSI and Lock Contention kernel metrics for observer anomaly detection#47821
Draft
mbertrone wants to merge 4 commits intoq-branch-observerfrom
Draft
[WIP] Add PSI and Lock Contention kernel metrics for observer anomaly detection#47821mbertrone wants to merge 4 commits intoq-branch-observerfrom
mbertrone wants to merge 4 commits intoq-branch-observerfrom
Conversation
Add a host-level Pressure Stall Information (PSI) core check that reads
/proc/pressure/{cpu,memory,io} and emits system.pressure.* metrics.
- Parses avg10, avg60, avg300 and total stall microseconds
- Emits both "some" and "full" variants for memory and io
- Gracefully skips on kernels without PSI support (< 4.20)
- Includes unit tests with fixture-based /proc/pressure parsing
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add an eBPF-based kernel lock contention check that attaches to lock_contention_begin/end tracepoints and measures per-lock hold times. - eBPF program tracks lock acquire/release timestamps per TID - System-probe module exposes aggregated lock contention stats - Agent check queries system-probe and emits ebpf.lock_contention_ns - Graceful degradation via IgnoreStartupError for missing tracepoints - Includes per-CPU array optimization and FD mapping diagnostics Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolve build issues from cherry-picking lock contention onto the observer branch: - Fix WriteAsJSON signature (no request param on this branch) - Remove noisyneighbor/injector references not present on this branch - Keep lock_contention_check module registration and config Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
Go Package Import DifferencesBaseline: 681899c
|
Contributor
Static quality checks❌ Please find below the results from static quality gates Error
Gate failure full details
Static quality gates prevent the PR to merge! Successful checksInfo
On-wire sizes (compressed)
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
/proc/pressure/{cpu,memory,io}and emittingsystem.pressure.*metricslock_contention_begin/endtracepoints, emittingsystem.lock_contention.{count,wait_time,max_wait}per lock typeq-branch-observerbranchThese metrics feed into the observer detection engine (BOCPD/RRCF), enabling anomaly detection on kernel-level resource pressure and lock contention — signals that have clear health semantics for identifying degraded hosts.
New Metrics
PSI (
system.pressure.*):system.pressure.cpu.some.totalsystem.pressure.memory.some.totalsystem.pressure.memory.full.totalsystem.pressure.io.some.totalsystem.pressure.io.full.totalLock Contention (
system.lock_contention.*):system.lock_contention.countlock_typesystem.lock_contention.wait_timelock_typesystem.lock_contention.max_waitlock_typeLock types:
spinlock,mutex,rwsem_read,rwsem_write,rwlock_read,rwlock_writeArchitecture
IgnoreStartupErrorfor graceful degradation.conf.yaml.defaultfiles and run at 15s intervals.Tested
ubuntu-24VM (kernel 6.8, arm64)stress-ngload:system.lock_contention.wait_time:avg— changepoint detectedsystem.lock_contention.count:avg— changepoint detectedsystem.pressure.memory.some.total:avg— changepoint detectedsystem.pressure.memory.full.total:avg— changepoint detectedTest plan
pressure_linux_test.go, 302 lines)ebpf_types_linux_test.go)