[subcluster-placement-validation] Added code to validate more than 1 partial subclusters #4391
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it:
This PR adds a validation to not allow more than one partial subcluster in subcluster enabled placement. This is done because only one partial subcluster can only be created in placement during scale up and scale down operations but manually editing the placement can still cause more than 1 partial cluster. This validation prevents that from happening. More details about every validation and test scenarios is given below:
1. New Unit Tests Added
File:
src/cluster/placement/placement_test.goAdded new test cases to
TestValidateSubclusteredPlacementEdgeCasescovering all validation paths and edge cases.Validation Logic
The
validateSubclusteredPlacementfunction enforces several critical invariants for subclustered placements.Validation Flow Diagram
Key Definitions
instancesPerSubClusterinstancesinstancesPerSubClusterinstancesLeavingstateLeaving(being moved away from instance)Detailed Test Case Documentation
✅ Valid Placement Scenarios
Basic Valid Placements
valid subclustered placement - single subclustervalid subclustered placement - multiple subclustersempty placementsingle instance placementLeaving Instance/Shard Handling
valid subclustered placement with leaving instancesshard with leaving state ignored in validationall instances in subcluster are leavingPartial Subcluster Scenarios
incomplete subcluster - should not fail validationone full and one partial subcluster is validBoundary Conditions
multiple isolation groups per shardinstancesPerSubcluster equals 1 with single instance✅ Valid Shard Movement Scenarios (Scale Up/Down)
shards in transitionary state while moving to another subclustervalid shard movement from full to partial subclustervalid shard movement from partial to full subclusterKey Insight: During scale operations, a shard can temporarily exist in two subclusters—one giving (with shard in
Leavingstate) and one receiving (with shard inInitializingstate). This is valid only if at least one of the subclusters is partial.❌ Invalid Placement Scenarios
Subcluster Instance Count Violations
subcluster with more instances than instancesPerSubclusterinvalid subcluster %d, expected at most %d instances, actual %dValidation Code (lines 479-483):
Multiple Partial Subclusters
more than one partial subclusterinvalid placement, more than one partial subcluster found: 2three partial subclustersinvalid placement, more than one partial subcluster found: 3Validation Code (lines 489-491):
Rationale: At most one partial subcluster is allowed at any time. This ensures controlled scale-up/down operations where only one subcluster is being built up or torn down.
Shard Distribution Violations
shards in transitionary state - belongs to > 2 subclustersinvalid shard %d, expected at most 2 subclusters (only during shard movement), actual %dshards are shared among multiple complete subclustersinvalid shard %d, expected subcluster id %d, actual %dValidation Code (lines 499-520):
Rationale:
Isolation Group Violations
shard with wrong isolation group countinvalid shard %d, expected %d isolation groups, actual %dshard with insufficient isolation groupsinvalid shard %d, expected 3 isolation groups, actual 2Validation Code (lines 526-529):
Rationale: Each shard replica must be in a unique isolation group for fault tolerance. If
replicaFactor=3, the shard must exist in exactly 3 different isolation groups.Instance Configuration Violations
instance with uninitialized subcluster IDValidate()function)Note: This is validated in the parent
Validate()function beforevalidateSubclusteredPlacementis called.Summary of Validation Rules
instances > instancesPerSubClusterpartialSubclusters > 1subclusters > 2isolationGroups ≠ replicaFactorShard Movement Rules During Scale Operations
Valid Shard Movement
Invalid Shard Movement