Skip to content

Conversation

@maorfr
Copy link
Member

@maorfr maorfr commented Dec 7, 2025

testing #7896

Currently we run the API server and the background monitors in the same processes and pods. This makes it more difficult to understand the performance characteristics of those two things, and it also means that when the monitors need to be restarted the in-flight API requests need to be aborted. To improve that this moves the monitors to a different pod. This is done trying to minimize code changes:

  • A new START_MONITORS environment variable is added to control if the monitors are started. This is set to false in the existing pods pods, and to true in the new ones.

  • The leader election used for the monitors is only started in the new pods, i.e., when START_MONITORS is true.

  • The leader election used for applying migrations is only started in the old pods, i.e., when START_MONITORS is false.

The new pods are started by a new assisted-monitors deployment that is almost identical to the existing deployment, it even starts the API lister, but it doesn't have an Envoy sidecar and it the API port isn't exposed.

In the future the logic of main.go should probably be also separated: one for the API server and another for the monitors, but that will require much larger changes that we don't want to do now.

Related: https://issues.redhat.com/browse/MGMT-19704

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Dec 7, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 7, 2025

@maorfr: This pull request references MGMT-19704 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

Details

In response to this:

testing #7896

Currently we run the API server and the background monitors in the same processes and pods. This makes it more difficult to understand the performance characteristics of those two things, and it also means that when the monitors need to be restarted the in-flight API requests need to be aborted. To improve that this moves the monitors to a different pod. This is done trying to minimize code changes:

  • A new START_MONITORS environment variable is added to control if the monitors are started. This is set to false in the existing pods pods, and to true in the new ones.

  • The leader election used for the monitors is only started in the new pods, i.e., when START_MONITORS is true.

  • The leader election used for applying migrations is only started in the old pods, i.e., when START_MONITORS is false.

The new pods are started by a new assisted-monitors deployment that is almost identical to the existing deployment, it even starts the API lister, but it doesn't have an Envoy sidecar and it the API port isn't exposed.

In the future the logic of main.go should probably be also separated: one for the API server and another for the monitors, but that will require much larger changes that we don't want to do now.

Related: https://issues.redhat.com/browse/MGMT-19704

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 7, 2025
@coderabbitai
Copy link

coderabbitai bot commented Dec 7, 2025

Walkthrough

Adds an Options.StartMonitors flag to conditionally start background monitors and alter leader-election/migration startup behavior; adds an assisted-monitors Deployment and ServiceMonitor plus new resource parameters and START_MONITORS env wiring in OpenShift templates.

Changes

Cohort / File(s) Summary
Application entrypoint
cmd/main.go
Added StartMonitors bool to Options. Conditional startup: when true, create/start monitor elector and register cluster/host monitors into backgroundThreads and wire health middleware to it; when false, skip monitor elector, initialize startupLeader differently and run autoMigrationWithLeader(startupLeader, db, log).
OpenShift monitoring manifest
openshift/template-monitoring.yaml
Added a new ServiceMonitor resource servicemonitor-assisted-monitors-${NAMESPACE} (monitoring.coreos.com/v1) with endpoints (interval 30s, path /metrics, port assisted-svc, scheme http), namespaceSelector ${NAMESPACE}, and selector app: assisted-monitor.
OpenShift template and deployment
openshift/template.yaml
Added parameters for monitors resource sizing: MONITORS_REPLICAS_COUNT, MONITORS_MEMORY_LIMIT, MONITORS_CPU_LIMIT, MONITORS_EPHEMERAL_STORAGE_LIMIT, MONITORS_MEMORY_REQUEST, MONITORS_CPU_REQUEST, MONITORS_EPHEMERAL_STORAGE_REQUEST. Introduced assisted-monitors Deployment with resources, probes, env (including START_MONITORS), volumes and mounts; injected START_MONITORS env into assisted-installer template sections.
Manifest file listing
go.mod
No substantive code changes; listed in manifest.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Inspect conditional startup paths and lifecycle management in cmd/main.go (monitor elector vs startupLeader, backgroundThreads registration, error handling).
  • Verify auto-migration invocation placement and correctness when StartMonitors=false.
  • Validate ServiceMonitor selector/namespace and scrape configuration in openshift/template-monitoring.yaml.
  • Review assisted-monitors Deployment resource requests/limits, probes, env consistency (START_MONITORS), volume/secret handling, and template parameter usage in openshift/template.yaml.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 22689de and 1457324.

📒 Files selected for processing (1)
  • cmd/main.go (5 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • cmd/main.go
🔇 Additional comments (5)
cmd/main.go (5)

188-188: LGTM! Clean feature flag addition.

The StartMonitors flag provides a clear mechanism to control monitor behavior, with a sensible default that maintains backward compatibility.


586-598: Previous nil pointer issue has been correctly resolved.

The conditional logic now properly initializes both lead and startupLeader in all code paths:

  • When StartMonitors=true: real leader elector for monitors, dummy for migrations
  • When StartMonitors=false: real leader elector for migrations, dummy for monitors

This cleanly separates the leader election responsibilities between monitor pods and API pods as intended.


641-655: LGTM! Clean conditional monitor instantiation.

The state monitors are correctly created and started only when StartMonitors=true, achieving the PR objective of isolating these workloads into dedicated pods. The backgroundThreads collection clearly tracks only these monitors, maintaining a clean separation from other background workers.


791-792: LGTM! Health check correctly scoped to state monitors.

The health middleware now receives backgroundThreads, which contains only the state monitors (when running) or is empty (in API pods). This is consistent with the previous behavior of monitoring only the state monitors' health, not all background workers.


662-662: Verify leader elector handling in release syncer.

The RunOpenshiftReleaseSyncerIfNeeded function receives both lead and startupLeader, where one will always be a DummyElector depending on the StartMonitors flag. Ensure the function correctly handles this scenario—likely using startupLeader for initial synchronization and lead for ongoing background sync.


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 7, 2025
@openshift-ci openshift-ci bot requested review from danielerez and eranco74 December 7, 2025 07:37
@openshift-ci
Copy link

openshift-ci bot commented Dec 7, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: maorfr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 7, 2025
@maorfr maorfr force-pushed the separate_pod_for_cluster_monitor_2 branch from 9657937 to 91ceb5f Compare December 7, 2025 07:44
Currently we run the API server and the background monitors in the same
processes and pods. This makes it more difficult to understand the
performance characteristics of those two things, and it also means that
when the monitors need to be restarted the in-flight API requests need
to be aborted. To improve that this moves the monitors to a different
pod. This is done trying to minimize code changes:

- A new `START_MONITORS` environment variable is added to control if the
  monitors are started. This is set to `false` in the existing pods
  pods, and to `true` in the new ones.

- The leader election used for the monitors is only started in the new
  pods, i.e., when `START_MONITORS` is `true`.

- The leader election used for applying migrations is only started in
  the old pods, i.e., when `START_MONITORS` is `false`.

The new pods are started by a new `assisted-monitors` deployment that is
almost identical to the existing deployment, it even starts the API
lister, but it doesn't have an Envoy sidecar and it the API port isn't
exposed.

In the future the logic of `main.go` should probably be also separated:
one for the API server and another for the monitors, but that will
require much larger changes that we don't want to do now.

Related: https://issues.redhat.com/browse/MGMT-19704
Signed-off-by: Juan Hernandez <[email protected]>
@maorfr maorfr force-pushed the separate_pod_for_cluster_monitor_2 branch from 91ceb5f to 22689de Compare December 7, 2025 07:48
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 7, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
cmd/main.go (1)

639-653: Consider adding clusterEventsUploader to health monitoring.

The clusterEventsUploader (line 634) is started unconditionally but not added to backgroundThreads, meaning its health is not monitored via the health endpoint. If this thread fails, the failure won't be detected by health checks, potentially affecting observability.

If health monitoring is desired for the events uploader, apply this diff:

+	backgroundThreads = append(backgroundThreads, clusterEventsUploader)
+
 	if Options.StartMonitors {
 		clusterStateMonitor := thread.New(
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 9657937 and 22689de.

📒 Files selected for processing (1)
  • cmd/main.go (5 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • cmd/main.go
🧬 Code graph analysis (1)
cmd/main.go (2)
internal/cluster/cluster.go (1)
  • Config (147-158)
pkg/app/middleware.go (1)
  • WithHealthMiddleware (39-59)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Red Hat Konflux / assisted-service-rhel9-acm-ds-main-on-pull-request
  • GitHub Check: Red Hat Konflux / assisted-service-saas-main-on-pull-request
🔇 Additional comments (2)
cmd/main.go (2)

187-188: LGTM!

The StartMonitors field is correctly added with an appropriate default value that maintains backward compatibility.


789-790: Verify health check behavior with empty backgroundThreads.

When StartMonitors=false (API server pods), backgroundThreads will be empty. Based on the middleware implementation, an empty thread list means health checks will always return OK without validating any background processes. Confirm this is the intended behavior for separating API and monitor concerns.

@codecov
Copy link

codecov bot commented Dec 7, 2025

Codecov Report

❌ Patch coverage is 0% with 26 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.48%. Comparing base (48fa982) to head (1457324).
⚠️ Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
cmd/main.go 0.00% 26 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8492      +/-   ##
==========================================
- Coverage   43.49%   43.48%   -0.02%     
==========================================
  Files         411      411              
  Lines       71076    71085       +9     
==========================================
- Hits        30917    30913       -4     
- Misses      37406    37417      +11     
- Partials     2753     2755       +2     
Files with missing lines Coverage Δ
cmd/main.go 0.00% <0.00%> (ø)

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@openshift-ci
Copy link

openshift-ci bot commented Dec 7, 2025

@maorfr: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/edge-subsystem-kubeapi-aws 1457324 link true /test edge-subsystem-kubeapi-aws
ci/prow/edge-subsystem-aws 1457324 link true /test edge-subsystem-aws
ci/prow/edge-e2e-metal-assisted-4-20 1457324 link true /test edge-e2e-metal-assisted-4-20
ci/prow/edge-e2e-ai-operator-ztp 1457324 link true /test edge-e2e-ai-operator-ztp

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants