MGMT-19704: Separate pod for monitor #8492

maorfr · 2025-12-07T07:37:19Z

testing #7896

Currently we run the API server and the background monitors in the same processes and pods. This makes it more difficult to understand the performance characteristics of those two things, and it also means that when the monitors need to be restarted the in-flight API requests need to be aborted. To improve that this moves the monitors to a different pod. This is done trying to minimize code changes:

A new START_MONITORS environment variable is added to control if the monitors are started. This is set to false in the existing pods pods, and to true in the new ones.
The leader election used for the monitors is only started in the new pods, i.e., when START_MONITORS is true.
The leader election used for applying migrations is only started in the old pods, i.e., when START_MONITORS is false.

The new pods are started by a new assisted-monitors deployment that is almost identical to the existing deployment, it even starts the API lister, but it doesn't have an Envoy sidecar and it the API port isn't exposed.

In the future the logic of main.go should probably be also separated: one for the API server and another for the monitors, but that will require much larger changes that we don't want to do now.

Related: https://issues.redhat.com/browse/MGMT-19704

List all the issues related to this PR

What environments does this code impact?

Automation (CI, tools, etc)
Cloud
Operator Managed Deployments
None

How was this code tested?

assisted-test-infra environment
dev-scripts environment
Reviewer's test appreciated
Waiting for CI to do a full test run
Manual (Elaborate on how it was tested)
No tests needed

Checklist

Title and description added to both, commit and PR.
Relevant issues have been associated (see CONTRIBUTING guide)
This change does not require a documentation update (docstring, docs, README, etc)
Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

Are the title and description (in both PR and commit) meaningful and clear?
Is there a bug required (and linked) for this change?
Should this PR be backported?

openshift-ci-robot · 2025-12-07T07:37:23Z

coderabbitai · 2025-12-07T07:37:30Z

Walkthrough

Adds an Options.StartMonitors flag to conditionally start background monitors and alter leader-election/migration startup behavior; adds an assisted-monitors Deployment and ServiceMonitor plus new resource parameters and START_MONITORS env wiring in OpenShift templates.

Changes

Cohort / File(s)	Summary
Application entrypoint `cmd/main.go`	Added `StartMonitors bool` to `Options`. Conditional startup: when true, create/start monitor elector and register cluster/host monitors into `backgroundThreads` and wire health middleware to it; when false, skip monitor elector, initialize `startupLeader` differently and run `autoMigrationWithLeader(startupLeader, db, log)`.
OpenShift monitoring manifest `openshift/template-monitoring.yaml`	Added a new `ServiceMonitor` resource `servicemonitor-assisted-monitors-${NAMESPACE}` (monitoring.coreos.com/v1) with endpoints (interval 30s, path `/metrics`, port `assisted-svc`, scheme `http`), namespaceSelector `${NAMESPACE}`, and selector `app: assisted-monitor`.
OpenShift template and deployment `openshift/template.yaml`	Added parameters for monitors resource sizing: `MONITORS_REPLICAS_COUNT`, `MONITORS_MEMORY_LIMIT`, `MONITORS_CPU_LIMIT`, `MONITORS_EPHEMERAL_STORAGE_LIMIT`, `MONITORS_MEMORY_REQUEST`, `MONITORS_CPU_REQUEST`, `MONITORS_EPHEMERAL_STORAGE_REQUEST`. Introduced `assisted-monitors` Deployment with resources, probes, env (including `START_MONITORS`), volumes and mounts; injected `START_MONITORS` env into assisted-installer template sections.
Manifest file listing `go.mod`	No substantive code changes; listed in manifest.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Inspect conditional startup paths and lifecycle management in cmd/main.go (monitor elector vs startupLeader, backgroundThreads registration, error handling).
Verify auto-migration invocation placement and correctness when StartMonitors=false.
Validate ServiceMonitor selector/namespace and scrape configuration in openshift/template-monitoring.yaml.
Review assisted-monitors Deployment resource requests/limits, probes, env consistency (START_MONITORS), volume/secret handling, and template parameter usage in openshift/template.yaml.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 22689de and 1457324.

📒 Files selected for processing (1)

cmd/main.go (5 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

cmd/main.go

🔇 Additional comments (5)

cmd/main.go (5)

188-188: LGTM! Clean feature flag addition.

The StartMonitors flag provides a clear mechanism to control monitor behavior, with a sensible default that maintains backward compatibility.

586-598: Previous nil pointer issue has been correctly resolved.

The conditional logic now properly initializes both lead and startupLeader in all code paths:

When StartMonitors=true: real leader elector for monitors, dummy for migrations

When StartMonitors=false: real leader elector for migrations, dummy for monitors

This cleanly separates the leader election responsibilities between monitor pods and API pods as intended.

641-655: LGTM! Clean conditional monitor instantiation.

The state monitors are correctly created and started only when StartMonitors=true, achieving the PR objective of isolating these workloads into dedicated pods. The backgroundThreads collection clearly tracks only these monitors, maintaining a clean separation from other background workers.

791-792: LGTM! Health check correctly scoped to state monitors.

The health middleware now receives backgroundThreads, which contains only the state monitors (when running) or is empty (in API pods). This is consistent with the previous behavior of monitoring only the state monitors' health, not all background workers.

662-662: Verify leader elector handling in release syncer.

The RunOpenshiftReleaseSyncerIfNeeded function receives both lead and startupLeader, where one will always be a DummyElector depending on the StartMonitors flag. Ensure the function correctly handles this scenario—likely using startupLeader for initial synchronization and lead for ongoing background sync.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2025-12-07T07:37:51Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: maorfr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [maorfr]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Currently we run the API server and the background monitors in the same processes and pods. This makes it more difficult to understand the performance characteristics of those two things, and it also means that when the monitors need to be restarted the in-flight API requests need to be aborted. To improve that this moves the monitors to a different pod. This is done trying to minimize code changes: - A new `START_MONITORS` environment variable is added to control if the monitors are started. This is set to `false` in the existing pods pods, and to `true` in the new ones. - The leader election used for the monitors is only started in the new pods, i.e., when `START_MONITORS` is `true`. - The leader election used for applying migrations is only started in the old pods, i.e., when `START_MONITORS` is `false`. The new pods are started by a new `assisted-monitors` deployment that is almost identical to the existing deployment, it even starts the API lister, but it doesn't have an Envoy sidecar and it the API port isn't exposed. In the future the logic of `main.go` should probably be also separated: one for the API server and another for the monitors, but that will require much larger changes that we don't want to do now. Related: https://issues.redhat.com/browse/MGMT-19704 Signed-off-by: Juan Hernandez <[email protected]>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

cmd/main.go (1)
639-653: Consider adding clusterEventsUploader to health monitoring.

The clusterEventsUploader (line 634) is started unconditionally but not added to backgroundThreads, meaning its health is not monitored via the health endpoint. If this thread fails, the failure won't be detected by health checks, potentially affecting observability.

If health monitoring is desired for the events uploader, apply this diff:
+	backgroundThreads = append(backgroundThreads, clusterEventsUploader)
+
 	if Options.StartMonitors {
 		clusterStateMonitor := thread.New(

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 9657937 and 22689de.

📒 Files selected for processing (1)

cmd/main.go (5 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

cmd/main.go

🧬 Code graph analysis (1)

cmd/main.go (2)

internal/cluster/cluster.go (1)

Config (147-158)

pkg/app/middleware.go (1)

WithHealthMiddleware (39-59)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Red Hat Konflux / assisted-service-rhel9-acm-ds-main-on-pull-request
GitHub Check: Red Hat Konflux / assisted-service-saas-main-on-pull-request

🔇 Additional comments (2)

cmd/main.go (2)

187-188: LGTM!

The StartMonitors field is correctly added with an appropriate default value that maintains backward compatibility.

789-790: Verify health check behavior with empty backgroundThreads.

When StartMonitors=false (API server pods), backgroundThreads will be empty. Based on the middleware implementation, an empty thread list means health checks will always return OK without validating any background processes. Confirm this is the intended behavior for separating API and monitor concerns.

cmd/main.go

codecov · 2025-12-07T08:27:09Z

Codecov Report

❌ Patch coverage is 0% with 26 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.48%. Comparing base (48fa982) to head (1457324).
⚠️ Report is 6 commits behind head on master.

Files with missing lines	Patch %	Lines
cmd/main.go	0.00%	26 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #8492      +/-   ##
==========================================
- Coverage   43.49%   43.48%   -0.02%     
==========================================
  Files         411      411              
  Lines       71076    71085       +9     
==========================================
- Hits        30917    30913       -4     
- Misses      37406    37417      +11     
- Partials     2753     2755       +2

Files with missing lines	Coverage Δ
cmd/main.go	`0.00% <0.00%> (ø)`

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

openshift-ci · 2025-12-07T12:58:18Z

@maorfr: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/edge-subsystem-kubeapi-aws	`1457324`	link	true	`/test edge-subsystem-kubeapi-aws`
ci/prow/edge-subsystem-aws	`1457324`	link	true	`/test edge-subsystem-aws`
ci/prow/edge-e2e-metal-assisted-4-20	`1457324`	link	true	`/test edge-e2e-metal-assisted-4-20`
ci/prow/edge-e2e-ai-operator-ztp	`1457324`	link	true	`/test edge-e2e-ai-operator-ztp`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Dec 7, 2025

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 7, 2025

openshift-ci bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 7, 2025

openshift-ci bot requested review from danielerez and eranco74 December 7, 2025 07:37

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 7, 2025

maorfr force-pushed the separate_pod_for_cluster_monitor_2 branch from 9657937 to 91ceb5f Compare December 7, 2025 07:44

maorfr force-pushed the separate_pod_for_cluster_monitor_2 branch from 91ceb5f to 22689de Compare December 7, 2025 07:48

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 7, 2025

coderabbitai bot reviewed Dec 7, 2025

View reviewed changes

cmd/main.go Show resolved Hide resolved

implement ai suggestion

1457324

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MGMT-19704: Separate pod for monitor #8492

MGMT-19704: Separate pod for monitor #8492

Uh oh!

maorfr commented Dec 7, 2025

Uh oh!

openshift-ci-robot commented Dec 7, 2025 •

edited by openshift-ci bot

Loading

List all the issues related to this PR

What environments does this code impact?

How was this code tested?

Checklist

Reviewers Checklist

Uh oh!

coderabbitai bot commented Dec 7, 2025 •

edited

Loading

Uh oh!

openshift-ci bot commented Dec 7, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

codecov bot commented Dec 7, 2025 •

edited

Loading

Uh oh!

openshift-ci bot commented Dec 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MGMT-19704: Separate pod for monitor #8492

Are you sure you want to change the base?

MGMT-19704: Separate pod for monitor #8492

Uh oh!

Conversation

maorfr commented Dec 7, 2025

List all the issues related to this PR

What environments does this code impact?

How was this code tested?

Checklist

Reviewers Checklist

Uh oh!

openshift-ci-robot commented Dec 7, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

List all the issues related to this PR

What environments does this code impact?

How was this code tested?

Checklist

Reviewers Checklist

Uh oh!

coderabbitai bot commented Dec 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

openshift-ci bot commented Dec 7, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Dec 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

openshift-ci bot commented Dec 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

openshift-ci-robot commented Dec 7, 2025 •

edited by openshift-ci bot

Loading

coderabbitai bot commented Dec 7, 2025 •

edited

Loading

codecov bot commented Dec 7, 2025 •

edited

Loading