-
Notifications
You must be signed in to change notification settings - Fork 244
OCPBUGS-56913: Retry incomplete cluster registration in ABI #8241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@zaneb: This pull request references Jira Issue OCPBUGS-56913, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughAdds idempotent application of AgentClusterInstall installConfig overrides and idempotent registration of extra manifests; adjusts client flow to continue when an existing cluster is found; and expands unit and subsystem tests to cover these behaviors and error cases. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
✨ Finishing touches
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: zaneb The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/jira refresh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
cmd/agentbasedinstaller/register.go (1)
152-156: Normalize overrides before comparingRight now we compare the annotation string to
cluster.InstallConfigOverridesverbatim. The API often normalizes JSON (e.g., compacts whitespace or reorders keys), so semantically identical overrides can compare unequal. On a restart, that forces us to re-runV2UpdateClusterInstallConfigevery time, defeating the idempotency this PR is aiming for and causing unnecessary writes. Please canonicalize both values before comparing—e.g., unmarshal intomap[string]any(or usejson.Compacton both) and then compare—so we only post when the payload truly changes.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (4)
cmd/agentbasedinstaller/client/main.go(2 hunks)cmd/agentbasedinstaller/register.go(4 hunks)cmd/agentbasedinstaller/register_test.go(1 hunks)subsystem/agent_based_installer_client_test.go(2 hunks)
|
@zaneb: This pull request references Jira Issue OCPBUGS-56913, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
2dffb8c to
3ae2e77
Compare
|
@zaneb: This pull request references Jira Issue OCPBUGS-56913, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #8241 +/- ##
==========================================
+ Coverage 43.22% 43.24% +0.02%
==========================================
Files 404 404
Lines 69935 70371 +436
==========================================
+ Hits 30226 30430 +204
- Misses 37003 37220 +217
- Partials 2706 2721 +15
🚀 New features to boost your workflow:
|
Extract the installConfig override application logic into a separate, reusable function ApplyInstallConfigOverrides(). This preserves existing behavior where overrides are applied within RegisterCluster(), but makes the logic testable and reusable. Includes comprehensive unit tests covering: - Applying overrides to cluster without overrides - Idempotent behavior when overrides already applied - Re-applying when overrides differ from manifest - Error handling for API failures - Handling clusters without override annotations - Validation of manifest file errors - Normalization of JSON with different whitespace - Normalization of JSON with different key ordering - Handling of empty strings - Error handling for invalid JSON in new overrides - Recovery from invalid JSON in existing cluster overrides - Consistency of normalization output Assisted-by: Claude Code
Make RegisterExtraManifests idempotent by checking for existing manifests before attempting to create them. This prevents failures when the registration process is retried (e.g., after a service restart). Add comprehensive unit tests that verify: - Creating new manifests when none exist - Skipping manifests with identical content - Returning error when content differs - Full idempotency across multiple calls - Proper error handling for API failures This ensures safe retry of the registration process. Assisted-by: Claude Code
Fix the bug where installConfig overrides and extra manifests are not applied when the service restarts after finding an existing cluster. Previously, the registerCluster() function would immediately return if a cluster already existed, skipping the steps to apply installConfig overrides and register extra manifests. This meant that if the service crashed or was restarted after cluster registration but before these steps completed, the configuration would be incomplete. Now, registerCluster() unconditionally calls both ApplyInstallConfigOverrides() and RegisterExtraManifests() after obtaining the cluster (whether newly created or existing). Since both functions are idempotent, this is safe to retry and ensures all configuration steps complete successfully. Add subsystem tests to verify: - Retry of installConfig overrides on restart (idempotent) - Application of missing overrides to existing cluster - Retry of extra manifest registration (idempotent) Assisted-by: Claude Code
3ae2e77 to
1e01d58
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
cmd/agentbasedinstaller/register_test.go (1)
33-33: Fix typo in variable name.The variable
aciWithOverideshould beaciWithOverride(note the missing 'r' in "Override").Apply this diff:
- aciWithOveride string + aciWithOverride stringAnd:
- aciWithOveride = `apiVersion: extensions.hive.openshift.io/v1beta1 + aciWithOverride = `apiVersion: extensions.hive.openshift.io/v1beta1Also update all references to this variable throughout the test file (lines 121, 138, 153, 169, 186, 242).
Also applies to: 62-62
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (5)
cmd/agentbasedinstaller/agentbasedinstaller_suite_test.go(1 hunks)cmd/agentbasedinstaller/client/main.go(2 hunks)cmd/agentbasedinstaller/register.go(5 hunks)cmd/agentbasedinstaller/register_test.go(1 hunks)subsystem/agent_based_installer_client_test.go(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- subsystem/agent_based_installer_client_test.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Red Hat Konflux / assisted-service-rhel9-acm-ds-main-on-pull-request
- GitHub Check: Red Hat Konflux / assisted-service-saas-main-on-pull-request
🔇 Additional comments (13)
cmd/agentbasedinstaller/agentbasedinstaller_suite_test.go (1)
197-200: LGTM! Mock support for manifest listing.The addition of the V2ListClusterManifestsParams case to the mock transport properly supports test scenarios that exercise manifest-related flows introduced in this PR.
cmd/agentbasedinstaller/client/main.go (3)
38-38: LGTM! Required import for cluster model.
126-142: LGTM! Critical fix for incomplete cluster registration.The refactored control flow ensures that installConfig overrides and extra manifests are applied even when restarting after a partial failure. This directly addresses OCPBUGS-56913 where the client would exit early upon finding an existing cluster, leaving configuration incomplete.
144-162: LGTM! Idempotent application of overrides and manifests.The sequential application of installConfig overrides and extra manifest registration ensures complete cluster configuration. Error handling is appropriate, and the updated cluster state is correctly propagated.
cmd/agentbasedinstaller/register_test.go (4)
118-200: LGTM! Comprehensive test coverage for installConfig overrides.The test cases thoroughly validate idempotent behavior, error handling, and various override scenarios. The tests ensure that overrides are only applied when necessary and that errors are properly propagated.
202-329: LGTM! Robust edge case coverage and well-designed mock.The tests cover important edge cases including missing overrides, invalid YAML, and invalid JSON in both existing and new configurations. The mock transport provides clean error injection for testing failure scenarios.
331-384: LGTM! Thorough validation of JSON normalization.The tests ensure that the normalizeJSON helper correctly handles whitespace differences, key ordering, empty strings, and invalid JSON. This is critical for the idempotent behavior of installConfig override application.
386-667: LGTM! Excellent test coverage for manifest idempotency.The tests comprehensively validate the idempotent behavior of RegisterExtraManifests, including:
- Creating manifests when none exist
- Skipping manifests with matching content
- Erroring when content differs
- Handling API failures during list, download, and create operations
The mock transport properly simulates the manifest lifecycle with realistic response handling.
cmd/agentbasedinstaller/register.go (5)
4-4: LGTM! Required imports for new functionality.The
bytesandencoding/jsonimports support the manifest content comparison and JSON normalization features.Also applies to: 7-7
125-135: LGTM! Proper integration of installConfig override application.The call to ApplyInstallConfigOverrides is correctly positioned after cluster registration, with appropriate error handling and state propagation.
137-198: LGTM! Well-designed idempotent override application.The function correctly implements idempotent behavior by:
- Using JSON normalization to detect semantic equivalence
- Gracefully handling invalid JSON in existing cluster state
- Properly validating new overrides before applying
- Redacting sensitive information from logs
- Refetching cluster state to ensure accurate return value
The error handling and logging are appropriate throughout.
264-328: LGTM! Correct idempotent manifest registration.The enhanced implementation properly ensures idempotency by:
- Listing existing manifests before attempting creation
- Downloading and comparing content to detect matches
- Skipping creation when content is identical
- Erroring when content differs (preventing silent overwrite of user modifications)
The approach of downloading all existing manifests could be inefficient with a large number of manifests, but it's the only reliable way to ensure idempotency through content comparison.
454-473: LGTM! Clean JSON normalization helper.The function correctly normalizes JSON strings for semantic comparison by:
- Handling empty strings appropriately
- Parsing and re-marshaling to standardize formatting
- Propagating errors for invalid JSON
This enables reliable detection of semantically identical JSON despite formatting differences.
|
@zaneb: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Fix a problem where a partial failure of creating the Cluster object in the agent-based installer client could result in an inconsistent cluster config.
Because the client exits on failure and relies on systemd to restart it, it effectively operates like a distributed system. Since the cluster creation has 3 steps - creating the Cluster object, applying the install-config overrides, and adding each additional manifest - we must retry idempotently if any of these steps fail. This was not happening previously: any failure after the first step would result in no retries, as the new instance of the client would see that the Cluster exists and not continue with the other operations. This could result in us progressing to install a cluster with only part of the configuration supplied by the user applied.
This change fixes that so that we always either eventually apply the full config as provided or never progress.
List all the issues related to this PR
OCPBUGS-56913
New Feature
Enhancement
Bug fix
Tests
Documentation
CI/CD
What environments does this code impact?
How was this code tested?
Checklist
docs, README, etc)Reviewers Checklist