Implement polling-based BMC metrics collection#837
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (16)
💤 Files with no reviewable changes (9)
🚧 Files skipped from review as they are similar to previous changes (7)
📝 WalkthroughWalkthroughThis PR implements polling-based BMC monitoring in place of subscription-based delivery: it removes subscription fields and helpers, adds BMC GetMetricReport/GetEventLog methods and types with Redfish implementations, refactors serverevents into a ticker-driven polling server with a shared Prometheus collector, and updates CLI, CRD, controller, tests, and docs. ChangesPolling-Based Monitoring Transition
Sequence Diagram(s)sequenceDiagram
participant Manager
participant PollingServer
participant KubeClient
participant BMC
participant Collector
Manager->>PollingServer: Start(ctx)
PollingServer->>PollingServer: initialize ticker
loop Every interval
PollingServer->>KubeClient: List BMC CRs
KubeClient-->>PollingServer: BMC list
par Bounded concurrent polls
PollingServer->>BMC: GetMetricReport()
BMC-->>PollingServer: MetricsReport
PollingServer->>Collector: UpdateFromMetricsReport()
PollingServer->>BMC: GetEventLog()
BMC-->>PollingServer: []Event
PollingServer->>Collector: UpdateFromEvent()
end
end
Manager->>PollingServer: ctx.Done()
PollingServer-->>Manager: Stop
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
maxmoehl
left a comment
There was a problem hiding this comment.
Overall the change looks good to me. Please also address the failing checks.
| return nil, fmt.Errorf("no systems found") | ||
| } | ||
|
|
||
| system := systems[0] |
There was a problem hiding this comment.
In general we do support multiple systems, but with #825 we are filtering out non-bootable systems. This should be extended to include all bootable systems, maybe a shared helper makes sense to not duplicate the condition for a valid system if we extend that in the future.
/cc @afritzler
| for { | ||
| select { | ||
| case <-ticker.C: | ||
| s.pollAllBMCs(ctx) |
There was a problem hiding this comment.
We should consider limiting for how long the poll runs, if it is slow and the interval is low we might be overwhelming BMCs. So something like this:
| s.pollAllBMCs(ctx) | |
| s.pollAllBMCs(context.WithTimeout(ctx, s.interval)) |
But you will have to handle the cancel function properly.
f237cc1 to
4e30b43
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
bmc/redfish_metrics_test.go (1)
1-322:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winFix formatting issue flagged by pipeline.
The pipeline failure indicates uncommitted formatting changes detected after build. The spacing/alignment of
@odata.idfields in the test's JSON mock responses differs from the expected format.Run
make lint-fixto apply the correct formatting before committing.As per coding guidelines: "Run
make lint-fixandmake testafter editing Go source files"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@bmc/redfish_metrics_test.go` around lines 1 - 322, The pipeline is failing due to formatting differences in the test JSON responses (the "@odata.id" keys/spacing) in the Redfish tests; open the test file and normalize formatting for the JSON map literals (e.g., the map entries in the handlers that include "@odata.id" such as in the handlers for "/redfish/v1/", "/redfish/v1/TelemetryService", "/redfish/v1/TelemetryService/MetricReports", "/redfish/v1/TelemetryService/MetricReports/Report1", and the various "/redfish/v1/Systems..." handlers), then run the repository formatter/linter fix command (make lint-fix) and re-run tests (make test) before committing to ensure the formatting changes are applied and the pipeline passes.
🧹 Nitpick comments (2)
bmc/redfish.go (1)
1236-1237: 💤 Low valueConsider making the event log time window configurable.
The 10-minute cutoff is hardcoded. Depending on the polling interval (configurable via
--metrics-polling-interval, default 120s), a 10-minute window may collect many duplicate events or miss recent events if the interval is much longer. Consider making this configurable or deriving it from the polling interval (e.g.,2 * pollingInterval).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@bmc/redfish.go` around lines 1236 - 1237, The hardcoded 10-minute cutoff (cutoff := time.Now().Add(-10 * time.Minute)) should be made configurable or derived from the polling interval; update the logic in bmc/redfish.go that computes cutoff to use a configurable duration (e.g., add an EventLogWindow setting) or compute window := 2 * metricsPollingInterval and then cutoff := time.Now().Add(-window). Modify the code paths that call this cutoff computation (look for the cutoff variable and the function/method that filters recent entries) to read the new config or polling interval value and validate a sane minimum and maximum.internal/serverevents/server.go (1)
127-149: ⚡ Quick winAdd context cancellation checks in polling loop.
When the context is cancelled (e.g., during shutdown), the spawned goroutines will continue running until
pollBMCcompletes. This could delay shutdown if BMC operations are slow. Consider checkingctx.Done()before spawning each goroutine or passing the context check into the semaphore acquisition.🛑 Proposed fix to respect context cancellation
for i := range bmcList.Items { bmcObj := &bmcList.Items[i] if !bmcObj.DeletionTimestamp.IsZero() { continue } + + // Check if context is cancelled before spawning more goroutines + select { + case <-ctx.Done(): + break + default: + } wg.Add(1) go func(bmc *metalv1alpha1.BMC) { defer wg.Done() semaphore <- struct{}{} defer func() { <-semaphore }() s.pollBMC(ctx, bmc) }(bmcObj) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/serverevents/server.go` around lines 127 - 149, The polling loop should respect context cancellation: before scheduling a goroutine for bmcObj check ctx.Done()/ctx.Err() and skip starting the worker if cancelled, and inside the goroutine use a cancellable semaphore acquire (select on sending to semaphore vs <-ctx.Done()) so you can return early instead of blocking; keep using wg to track started goroutines and ensure you call wg.Done() on early exit. Specifically update the loop around wg, semaphore and the anonymous goroutine that calls s.pollBMC(ctx, bmc) to perform the pre-spawn context check and make semaphore acquisition context-aware so shutdown can proceed promptly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@bmc/redfish.go`:
- Around line 1241-1243: The code silently ignores errors from time.Parse when
converting entry.Created to entryTime (entryTime, _ = time.Parse(time.RFC3339,
entry.Created)), which lets malformed timestamps bypass the 10-minute filter;
update the parsing logic in the same function to check and handle the parse
error: call time.Parse and capture the error, and on error either log the
failure (using the existing logger) and skip the entry or treat the entry as
invalid by continuing to the next record; ensure you still use
entryTime.IsZero() only for genuinely empty timestamps and not for parse
failures so the 10-minute filter behaves correctly.
- Around line 1176-1187: Update the loop that builds your local MetricValue
slice from report.MetricValues to extract structured fields instead of
stringifying: for each entry in report.MetricValues, attempt to read the metric
identifier, property path, and value (e.g., use report.MetricId / MetricID,
report.MetricProperty, and report.MetricValue or the corresponding keys on the
reported element) and populate MetricID, MetricProperty and MetricValue with
those extracted values (fall back to the current fmt.Sprintf behavior only if
the typed fields are absent); modify the code inside the existing for i := 0; i
< len(report.MetricValues); i++ loop (and the MetricValue struct population) to
prefer the real fields from the gofish MetricReport element instead of hardcoded
"Metric%d", report.ODataID, and fmt.Sprintf("%v", ...).
---
Outside diff comments:
In `@bmc/redfish_metrics_test.go`:
- Around line 1-322: The pipeline is failing due to formatting differences in
the test JSON responses (the "@odata.id" keys/spacing) in the Redfish tests;
open the test file and normalize formatting for the JSON map literals (e.g., the
map entries in the handlers that include "@odata.id" such as in the handlers for
"/redfish/v1/", "/redfish/v1/TelemetryService",
"/redfish/v1/TelemetryService/MetricReports",
"/redfish/v1/TelemetryService/MetricReports/Report1", and the various
"/redfish/v1/Systems..." handlers), then run the repository formatter/linter fix
command (make lint-fix) and re-run tests (make test) before committing to ensure
the formatting changes are applied and the pipeline passes.
---
Nitpick comments:
In `@bmc/redfish.go`:
- Around line 1236-1237: The hardcoded 10-minute cutoff (cutoff :=
time.Now().Add(-10 * time.Minute)) should be made configurable or derived from
the polling interval; update the logic in bmc/redfish.go that computes cutoff to
use a configurable duration (e.g., add an EventLogWindow setting) or compute
window := 2 * metricsPollingInterval and then cutoff := time.Now().Add(-window).
Modify the code paths that call this cutoff computation (look for the cutoff
variable and the function/method that filters recent entries) to read the new
config or polling interval value and validate a sane minimum and maximum.
In `@internal/serverevents/server.go`:
- Around line 127-149: The polling loop should respect context cancellation:
before scheduling a goroutine for bmcObj check ctx.Done()/ctx.Err() and skip
starting the worker if cancelled, and inside the goroutine use a cancellable
semaphore acquire (select on sending to semaphore vs <-ctx.Done()) so you can
return early instead of blocking; keep using wg to track started goroutines and
ensure you call wg.Done() on early exit. Specifically update the loop around wg,
semaphore and the anonymous goroutine that calls s.pollBMC(ctx, bmc) to perform
the pre-spawn context check and make semaphore acquisition context-aware so
shutdown can proceed promptly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9e4c9647-59cc-4742-92a4-80dd82bcd3c0
📒 Files selected for processing (14)
api/v1alpha1/bmc_types.gobmc/bmc.gobmc/mock/server/data/index.jsonbmc/redfish.gobmc/redfish_metrics_test.gocmd/main.goconfig/crd/bases/metal.ironcore.dev_bmcs.yamlinternal/controller/bmc_controller.gointernal/controller/bmc_controller_test.gointernal/controller/suite_test.gointernal/serverevents/metrics.gointernal/serverevents/server.gointernal/serverevents/subscription.gotest/serverevents/main.go
💤 Files with no reviewable changes (7)
- internal/controller/suite_test.go
- api/v1alpha1/bmc_types.go
- test/serverevents/main.go
- internal/serverevents/subscription.go
- config/crd/bases/metal.ironcore.dev_bmcs.yaml
- internal/controller/bmc_controller_test.go
- internal/controller/bmc_controller.go
- Implement multi-system support in GetEventLog: now queries all bootable systems (those with BootSourceOverrideTarget) instead of just the first system, consistent with GetSystems filtering from PR #825 - Simplify goroutine in pollAllBMCs: move WaitGroup and semaphore management into pollBMC function signature - Add poll timeout protection: wrap pollAllBMCs calls with context.WithTimeout to prevent slow polls from overwhelming BMCs - Fix test mock to include BootSourceOverrideTarget field - Fix linter errors: explicitly ignore json.Encode errors in test mocks Addresses review comments from @maxmoehl on PR #837. Signed-off-by: Stefan Hipfel <stefan.hipfel@sap.com>
4e30b43 to
e138e0e
Compare
|
Fixed metric value extraction to use structured fields from Redfish MetricReport instead of hardcoded values. Changes:
All tests pass (26/26) and no lint issues. |
Closes #813 Signed-off-by: Stefan Hipfel <stefan.hipfel@sap.com>
Remove subscription link fields from generated manifests and docs. Fix whitespace formatting in test file. Signed-off-by: Stefan Hipfel <stefan.hipfel@sap.com>
- Implement multi-system support in GetEventLog: now queries all bootable systems (those with BootSourceOverrideTarget) instead of just the first system, consistent with GetSystems filtering from PR #825 - Simplify goroutine in pollAllBMCs: move WaitGroup and semaphore management into pollBMC function signature - Add poll timeout protection: wrap pollAllBMCs calls with context.WithTimeout to prevent slow polls from overwhelming BMCs - Fix test mock to include BootSourceOverrideTarget field - Fix linter errors: explicitly ignore json.Encode errors in test mocks Addresses review comments from @maxmoehl on PR #837. Signed-off-by: Stefan Hipfel <stefan.hipfel@sap.com>
Replace hardcoded metric identifiers with actual field extraction from gofish MetricReport. Now extracts MetricID, MetricProperty, MetricValue, and Timestamp from Redfish response instead of using generic values. Preserves backward compatibility with defensive fallbacks for missing fields. Signed-off-by: Stefan Hipfel <stefan.hipfel@sap.com>
Regenerate BMCStatus applyconfiguration to remove subscription link fields that were removed in earlier commits of this PR. Signed-off-by: Stefan Hipfel <stefan.hipfel@sap.com>
7ddbf85 to
9a81d7e
Compare
|
Rebased on latest main to fix codegen check. The branch was rebased on main (which includes PR #860 for applyconfiguration generation) and the generated applyconfiguration files have been updated to reflect the removal of subscription link fields from BMCStatus. All checks should now pass:
|
|
I'm not sure I agree with the decision to remove the existing Event Subscription - is there a reason? As far as I am concerned the implementation of it is correct and it is a more desirable way to get metrics than polling. I do acknowledge the problem that vendors are not supporting it very well, but that is a vendor problem and not ours. If a user has a server that implements it correctly we are effectively stopping them from using a better way and therefore I personally would keep it as a configurable option. The control over telemetry streaming is also going to be improved in redfish from looking at the work in progress document "Redfish Telemetry Streaming and Reporting Bundle" and the Open Compute Project Conference Talk |
maybe we discuss this briefly in our next meeting, to know how I should move forward |
Closes #813
Summary by CodeRabbit
New Features
Bug Fixes
Refactor
Tests
Chores
Documentation