-
Notifications
You must be signed in to change notification settings - Fork 40
Automatically update replica count for az compatibility #1394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatically update replica count for az compatibility #1394
Conversation
Signed-off-by: Mikayla Thompson <[email protected]>
Signed-off-by: Mikayla Thompson <[email protected]>
Signed-off-by: Mikayla Thompson <[email protected]>
Signed-off-by: Mikayla Thompson <[email protected]>
Signed-off-by: Mikayla Thompson <[email protected]>
Signed-off-by: Mikayla Thompson <[email protected]>
Signed-off-by: Mikayla Thompson <[email protected]>
.exitCode(INVALID_PARAMETER_CODE) | ||
.errorMessage("Invalid parameter: " + pe.getMessage()) | ||
.build(); | ||
.exitCode(INVALID_PARAMETER_CODE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, this is my IDE being out of sync with the usual settings. I'll go through and clean up the indents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a PR once upon a time for an intellij formatter [1], there is a VS code in the project too. Maybe that could help. 🤷
// Check whether any indices failed with an incompatibleReplicaCount | ||
boolean incompatibleReplicaCountSeen = indexResults.getIndexes().stream().anyMatch(result -> result.wasFatal() && result.getFailureType().equals(CreationResult.CreationFailureType.INCOMPATIBLE_REPLICA_COUNT_FAILURE)); | ||
if (incompatibleReplicaCountSeen) { | ||
if (presumedClusterDimensionality >= arguments.minNumberOfReplicas + MAX_REPLICA_ADJUSTMENT_LOOPS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given cluster.routing.allocation.awareness.force.zone.values can be arbritraily high, can we parse out the response and skip straight to the next-higher valid value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we definitely could. I went with the approach that we defined in the ticket refinement, but this would also be possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for revisiting this, couple of thought
List<CreationResult> creationResults; | ||
if (skipCreation.test(index.getName())) { | ||
log.atInfo() | ||
for (SnapshotRepo.Index index : repoDataProvider.getIndicesInSnapshot(snapshotName)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it shifted from a stream -> for loop, is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whitespace only, lets revert?
} | ||
presumedClusterDimensionality++; | ||
log.warn("Incompatible replica count seen for the cluster dimensionality. Retrying with an assumed cluster dimensionality of {}", presumedClusterDimensionality); | ||
transformer = selectTransformer(clusters, presumedClusterDimensionality); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of spooky - we theoretically 'switch' which transformer is being used min-flight based on behavior. Can we replumb how this value get passed around to avoid this high level swap?
ALREADY_EXISTS(false, "already exists"), | ||
UNABLE_TO_TRANSFORM_FAILURE(true, "failed to transform to the target version"), | ||
TARGET_CLUSTER_FAILURE(true, "failed on target cluster"), | ||
INCOMPATIBLE_REPLICA_COUNT_FAILURE(true, "failed due to incompatible replica count for awareness attribute count"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we include the CLI value to add to resolve this for our users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revisiting after finishing the review, maybe this doesn't get sent to users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will only get sent to users if we can't find a working replica count after 4 increments. By the CLI value to add
-- what do you mean? The right replica count to set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test case doesn't seem to catch this case, but also this seems highly unlikely since AWS AZ are going to be a number between 2-6.
this.containerVersion = version; | ||
} | ||
|
||
public SearchClusterContainer(final ContainerVersion version, Map<String, String> supplementaryEnvVariables) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is replacing the other constructor implementation causing code duplicate, lets instead have the previous version call the new constructor with an empty map.
for instance), but the type is different (either `validation_exception` or `illegal_argument_exception`). For this reason, | ||
we're matching based on a regex against the message instead of also checking the type. | ||
**/ | ||
public Optional<String> containsAwarenessAttributeException() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the message need to be parsed and returned by this function, that would aid the user if we displayed it to them? If not, this function could be reduced to a return a boolean
value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should revisit exactly what we're doing with it down the line, but the reason I kept the whole thing is because it looks like Validation Failed: 1: expected total copies needs to be a multiple of total awareness attributes [3];
and that number at the end shows the number of awareness attributes, which could be very useful for a user trying to troubleshoot/understand this behavior.
result.failureType(CreationFailureType.ALREADY_EXISTS); | ||
} | ||
} catch (InvalidResponse invalidResponse) { | ||
var potentialAwarenessAttributeException = invalidResponse.containsAwarenessAttributeException(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming is a little messy, not sure if this was caught in a bulk rename.
var potentialAwarenessAttributeException = invalidResponse.containsAwarenessAttributeException(); | |
var hasInvalidReplicaCountForZoneSize = invalidResponse.containsAwarenessAttributeException(); |
private static Stream<Arguments> scenarios() { | ||
return Stream.of( | ||
Arguments.of(1), | ||
Arguments.of(2), | ||
Arguments.of(3) | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is a test coverage benefit to running this more than once, seems like one test that detect this scenario would be good enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reason to do it is primarily that these match the scenarios customers may have and the behavior--at least for 1 AZ vs more-than-one is different. I'm open to removing the 2 AZ case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 AZ doesn't seem like an interesting case, because any number will match, no?
} | ||
} | ||
|
||
void verifyIndexesExistOnTargetCluster(List<String> indexNames) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't look like the new INCOMPATIBLE_REPLICA_COUNT_FAILURE is encountered during this test, is that scenario possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for the number of zones + max number of loops we're testing here. I could set up a test with 5+ zones that would cause it to be emitted, but I didn't because 1/ that's going to be a very slow test, and 2/ while a possible situation for a self-managed service cluster, I don't think it's possible in AOS (and unlikely no matter what)
public class ReplicaCountWithAZsTest extends BaseMigrationTest{ | ||
|
||
private static final SearchClusterContainer.ContainerVersion SOURCE_VERSION = SearchClusterContainer.ES_V7_10_2; | ||
private static final SearchClusterContainer.ContainerVersion TARGET_VERSION = SearchClusterContainer.OS_LATEST; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its strange that this is unused, maybe the zoned search cluster is a static class for only this test case
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1394 +/- ##
============================
============================
☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Description
This updates how the metadata migration tool sets replica counts in cases where the cluster in question (usually, but not necessarily in the AWS managed service) has enforced balancing across awareness attributes. Usually, awareness attributes means availability zones, though it could be other factors. See this page for more information on how it works in opensearch: https://opensearch.org/docs/latest/tuning-your-cluster/#advanced-step-6-configure-shard-allocation-awareness-or-forced-awareness
In this PR, we take an approach to updating replica counts where we begin with the value from the metadata command arguments. If we see an incompatible replica count error, we increment the presumed dimensionality used by the transformer (which sets the minimum replica count for each index to dimensionality - 1) and try migrating the indices again. We repeat this up to 4 times, if necessary, but exit as soon as we no longer see an incompatible replica count error.
Issues Resolved
https://opensearch.atlassian.net/browse/MIGRATIONS-2185
Testing
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.