Skip to content

Conversation

@zaneb
Copy link
Member

@zaneb zaneb commented Nov 4, 2025

Fix a problem where a partial failure of creating the Cluster object in the agent-based installer client could result in an inconsistent cluster config.

Because the client exits on failure and relies on systemd to restart it, it effectively operates like a distributed system. Since the cluster creation has 3 steps - creating the Cluster object, applying the install-config overrides, and adding each additional manifest - we must retry idempotently if any of these steps fail. This was not happening previously: any failure after the first step would result in no retries, as the new instance of the client would see that the Cluster exists and not continue with the other operations. This could result in us progressing to install a cluster with only part of the configuration supplied by the user applied.

This change fixes that so that we always either eventually apply the full config as provided or never progress.

List all the issues related to this PR

  • OCPBUGS-56913

  • New Feature

  • Enhancement

  • Bug fix

  • Tests

  • Documentation

  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 4, 2025
@openshift-ci-robot
Copy link

@zaneb: This pull request references Jira Issue OCPBUGS-56913, which is invalid:

  • expected the bug to target either version "4.21." or "openshift-4.21.", but it targets "4.20" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Fix a problem where a partial failure of creating the Cluster object in the agent-based installer client could result in an inconsistent cluster config.

Because the client exits on failure and relies on systemd to restart it, it effectively operates like a distributed system. Since the cluster creation has 3 steps - creating the Cluster object, applying the install-config overrides, and adding each additional manifest - we must retry idempotently if any of these steps fail. This was not happening previously: any failure after the first step would result in no retries, as the new instance of the client would see that the Cluster exists and not continue with the other operations. This could result in us progressing to install a cluster with only part of the configuration supplied by the user applied.

This change fixes that so that we always either eventually apply the full config as provided or never progress.

List all the issues related to this PR

  • OCPBUGS-56913

  • New Feature

  • Enhancement

  • Bug fix

  • Tests

  • Documentation

  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Nov 4, 2025
@openshift-ci openshift-ci bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Nov 4, 2025
@coderabbitai
Copy link

coderabbitai bot commented Nov 4, 2025

Walkthrough

Adds idempotent application of AgentClusterInstall installConfig overrides and idempotent registration of extra manifests; adjusts client flow to continue when an existing cluster is found; and expands unit and subsystem tests to cover these behaviors and error cases.

Changes

Cohort / File(s) Summary
Client flow update
cmd/agentbasedinstaller/client/main.go
Continue processing when an existing cluster is found; assign existing or newly registered cluster to a local modelsCluster; call installConfig overrides application and extra-manifests registration; update logging and control flow.
Overrides & manifests logic
cmd/agentbasedinstaller/register.go
Add ApplyInstallConfigOverrides(ctx, log, bmInventory, cluster, agentClusterInstallPath) (*models.Cluster, error) to read AgentClusterInstall annotations, normalize JSON for semantic comparison, no-op if identical, and update via V2UpdateClusterInstallConfig when needed. Make RegisterExtraManifests idempotent by listing existing manifests, downloading content, comparing (normalized) content, skipping identical entries, and erroring on conflicts. Add normalizeJSON helper and related error handling and logging.
Unit tests for new behavior
cmd/agentbasedinstaller/register_test.go
Add comprehensive tests and mocks for ApplyInstallConfigOverrides and RegisterExtraManifests, covering valid/invalid overrides, idempotency, manifest equality/differences, and API error paths; introduce NewMockInstallConfigTransport and NewMockManifestTransport helpers and in-memory filesystem usage.
Subsystem tests
subsystem/agent_based_installer_client_test.go
Add test cases exercising retry/idempotent behavior for installConfig overrides and extra manifests across restarts and applying overrides to existing clusters; import os.
Test-suite transport case
cmd/agentbasedinstaller/agentbasedinstaller_suite_test.go
Extend mock transport Submit to handle manifests.V2ListClusterManifestsParams and return an empty manifests payload to support list paths in tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Pay attention to JSON normalization and comparison correctness in ApplyInstallConfigOverrides (including redaction/logging).
  • Verify idempotency and conflict detection in RegisterExtraManifests (listing, download, content map, create behavior).
  • Review new tests and mock transports for accurate simulation of API behaviors and edge cases.
  • Confirm client/main.go control-flow changes correctly propagate updated models.Cluster in all paths.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot requested review from andfasano and javipolo November 4, 2025 08:27
@openshift-ci
Copy link

openshift-ci bot commented Nov 4, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: zaneb
Once this PR has been reviewed and has the lgtm label, please assign adriengentil for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@zaneb
Copy link
Member Author

zaneb commented Nov 4, 2025

/jira refresh
/cc @bfournie

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Nov 4, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
cmd/agentbasedinstaller/register.go (1)

152-156: Normalize overrides before comparing

Right now we compare the annotation string to cluster.InstallConfigOverrides verbatim. The API often normalizes JSON (e.g., compacts whitespace or reorders keys), so semantically identical overrides can compare unequal. On a restart, that forces us to re-run V2UpdateClusterInstallConfig every time, defeating the idempotency this PR is aiming for and causing unnecessary writes. Please canonicalize both values before comparing—e.g., unmarshal into map[string]any (or use json.Compact on both) and then compare—so we only post when the payload truly changes.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between f0b43a5 and 2dffb8c.

📒 Files selected for processing (4)
  • cmd/agentbasedinstaller/client/main.go (2 hunks)
  • cmd/agentbasedinstaller/register.go (4 hunks)
  • cmd/agentbasedinstaller/register_test.go (1 hunks)
  • subsystem/agent_based_installer_client_test.go (2 hunks)

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Nov 4, 2025
@openshift-ci-robot
Copy link

@zaneb: This pull request references Jira Issue OCPBUGS-56913, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @mhanss

In response to this:

/jira refresh
/cc @bfournie

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from bfournie and mhanss November 4, 2025 08:34
@zaneb zaneb force-pushed the register-cluster-retry branch from 2dffb8c to 3ae2e77 Compare November 4, 2025 09:07
@openshift-ci-robot
Copy link

@zaneb: This pull request references Jira Issue OCPBUGS-56913, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @mhanss

In response to this:

Fix a problem where a partial failure of creating the Cluster object in the agent-based installer client could result in an inconsistent cluster config.

Because the client exits on failure and relies on systemd to restart it, it effectively operates like a distributed system. Since the cluster creation has 3 steps - creating the Cluster object, applying the install-config overrides, and adding each additional manifest - we must retry idempotently if any of these steps fail. This was not happening previously: any failure after the first step would result in no retries, as the new instance of the client would see that the Cluster exists and not continue with the other operations. This could result in us progressing to install a cluster with only part of the configuration supplied by the user applied.

This change fixes that so that we always either eventually apply the full config as provided or never progress.

List all the issues related to this PR

  • OCPBUGS-56913

  • New Feature

  • Enhancement

  • Bug fix

  • Tests

  • Documentation

  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@codecov
Copy link

codecov bot commented Nov 4, 2025

Codecov Report

❌ Patch coverage is 71.11111% with 26 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.24%. Comparing base (3266b6d) to head (1e01d58).
⚠️ Report is 24 commits behind head on master.

Files with missing lines Patch % Lines
cmd/agentbasedinstaller/client/main.go 0.00% 18 Missing ⚠️
cmd/agentbasedinstaller/register.go 88.88% 7 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8241      +/-   ##
==========================================
+ Coverage   43.22%   43.24%   +0.02%     
==========================================
  Files         404      404              
  Lines       69935    70371     +436     
==========================================
+ Hits        30226    30430     +204     
- Misses      37003    37220     +217     
- Partials     2706     2721      +15     
Files with missing lines Coverage Δ
cmd/agentbasedinstaller/register.go 35.49% <88.88%> (+25.40%) ⬆️
cmd/agentbasedinstaller/client/main.go 0.00% <0.00%> (ø)

... and 21 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

zaneb added 3 commits November 5, 2025 00:51
Extract the installConfig override application logic into a separate,
reusable function ApplyInstallConfigOverrides(). This preserves existing
behavior where overrides are applied within RegisterCluster(), but makes
the logic testable and reusable.

Includes comprehensive unit tests covering:
- Applying overrides to cluster without overrides
- Idempotent behavior when overrides already applied
- Re-applying when overrides differ from manifest
- Error handling for API failures
- Handling clusters without override annotations
- Validation of manifest file errors
- Normalization of JSON with different whitespace
- Normalization of JSON with different key ordering
- Handling of empty strings
- Error handling for invalid JSON in new overrides
- Recovery from invalid JSON in existing cluster overrides
- Consistency of normalization output

Assisted-by: Claude Code
Make RegisterExtraManifests idempotent by checking for existing
manifests before attempting to create them. This prevents failures when
the registration process is retried (e.g., after a service restart).

Add comprehensive unit tests that verify:
- Creating new manifests when none exist
- Skipping manifests with identical content
- Returning error when content differs
- Full idempotency across multiple calls
- Proper error handling for API failures

This ensures safe retry of the registration process.

Assisted-by: Claude Code
Fix the bug where installConfig overrides and extra manifests are not
applied when the service restarts after finding an existing cluster.

Previously, the registerCluster() function would immediately return if
a cluster already existed, skipping the steps to apply installConfig
overrides and register extra manifests. This meant that if the service
crashed or was restarted after cluster registration but before these
steps completed, the configuration would be incomplete.

Now, registerCluster() unconditionally calls both
ApplyInstallConfigOverrides() and RegisterExtraManifests() after
obtaining the cluster (whether newly created or existing). Since both
functions are idempotent, this is safe to retry and ensures all
configuration steps complete successfully.

Add subsystem tests to verify:
- Retry of installConfig overrides on restart (idempotent)
- Application of missing overrides to existing cluster
- Retry of extra manifest registration (idempotent)

Assisted-by: Claude Code
@zaneb zaneb force-pushed the register-cluster-retry branch from 3ae2e77 to 1e01d58 Compare November 4, 2025 12:03
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
cmd/agentbasedinstaller/register_test.go (1)

33-33: Fix typo in variable name.

The variable aciWithOveride should be aciWithOverride (note the missing 'r' in "Override").

Apply this diff:

-	aciWithOveride string
+	aciWithOverride string

And:

-	aciWithOveride = `apiVersion: extensions.hive.openshift.io/v1beta1
+	aciWithOverride = `apiVersion: extensions.hive.openshift.io/v1beta1

Also update all references to this variable throughout the test file (lines 121, 138, 153, 169, 186, 242).

Also applies to: 62-62

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 3ae2e77 and 1e01d58.

📒 Files selected for processing (5)
  • cmd/agentbasedinstaller/agentbasedinstaller_suite_test.go (1 hunks)
  • cmd/agentbasedinstaller/client/main.go (2 hunks)
  • cmd/agentbasedinstaller/register.go (5 hunks)
  • cmd/agentbasedinstaller/register_test.go (1 hunks)
  • subsystem/agent_based_installer_client_test.go (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • subsystem/agent_based_installer_client_test.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Red Hat Konflux / assisted-service-rhel9-acm-ds-main-on-pull-request
  • GitHub Check: Red Hat Konflux / assisted-service-saas-main-on-pull-request
🔇 Additional comments (13)
cmd/agentbasedinstaller/agentbasedinstaller_suite_test.go (1)

197-200: LGTM! Mock support for manifest listing.

The addition of the V2ListClusterManifestsParams case to the mock transport properly supports test scenarios that exercise manifest-related flows introduced in this PR.

cmd/agentbasedinstaller/client/main.go (3)

38-38: LGTM! Required import for cluster model.


126-142: LGTM! Critical fix for incomplete cluster registration.

The refactored control flow ensures that installConfig overrides and extra manifests are applied even when restarting after a partial failure. This directly addresses OCPBUGS-56913 where the client would exit early upon finding an existing cluster, leaving configuration incomplete.


144-162: LGTM! Idempotent application of overrides and manifests.

The sequential application of installConfig overrides and extra manifest registration ensures complete cluster configuration. Error handling is appropriate, and the updated cluster state is correctly propagated.

cmd/agentbasedinstaller/register_test.go (4)

118-200: LGTM! Comprehensive test coverage for installConfig overrides.

The test cases thoroughly validate idempotent behavior, error handling, and various override scenarios. The tests ensure that overrides are only applied when necessary and that errors are properly propagated.


202-329: LGTM! Robust edge case coverage and well-designed mock.

The tests cover important edge cases including missing overrides, invalid YAML, and invalid JSON in both existing and new configurations. The mock transport provides clean error injection for testing failure scenarios.


331-384: LGTM! Thorough validation of JSON normalization.

The tests ensure that the normalizeJSON helper correctly handles whitespace differences, key ordering, empty strings, and invalid JSON. This is critical for the idempotent behavior of installConfig override application.


386-667: LGTM! Excellent test coverage for manifest idempotency.

The tests comprehensively validate the idempotent behavior of RegisterExtraManifests, including:

  • Creating manifests when none exist
  • Skipping manifests with matching content
  • Erroring when content differs
  • Handling API failures during list, download, and create operations

The mock transport properly simulates the manifest lifecycle with realistic response handling.

cmd/agentbasedinstaller/register.go (5)

4-4: LGTM! Required imports for new functionality.

The bytes and encoding/json imports support the manifest content comparison and JSON normalization features.

Also applies to: 7-7


125-135: LGTM! Proper integration of installConfig override application.

The call to ApplyInstallConfigOverrides is correctly positioned after cluster registration, with appropriate error handling and state propagation.


137-198: LGTM! Well-designed idempotent override application.

The function correctly implements idempotent behavior by:

  • Using JSON normalization to detect semantic equivalence
  • Gracefully handling invalid JSON in existing cluster state
  • Properly validating new overrides before applying
  • Redacting sensitive information from logs
  • Refetching cluster state to ensure accurate return value

The error handling and logging are appropriate throughout.


264-328: LGTM! Correct idempotent manifest registration.

The enhanced implementation properly ensures idempotency by:

  • Listing existing manifests before attempting creation
  • Downloading and comparing content to detect matches
  • Skipping creation when content is identical
  • Erroring when content differs (preventing silent overwrite of user modifications)

The approach of downloading all existing manifests could be inefficient with a large number of manifests, but it's the only reliable way to ensure idempotency through content comparison.


454-473: LGTM! Clean JSON normalization helper.

The function correctly normalizes JSON strings for semantic comparison by:

  • Handling empty strings appropriately
  • Parsing and re-marshaling to standardize formatting
  • Propagating errors for invalid JSON

This enables reliable detection of semantically identical JSON despite formatting differences.

@openshift-ci
Copy link

openshift-ci bot commented Nov 4, 2025

@zaneb: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/edge-subsystem-kubeapi-aws 1e01d58 link true /test edge-subsystem-kubeapi-aws
ci/prow/okd-scos-e2e-aws-ovn 1e01d58 link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants