Skip to content

Conversation

mikaylathompson
Copy link
Collaborator

@mikaylathompson mikaylathompson commented Mar 27, 2025

Description

This updates how the metadata migration tool sets replica counts in cases where the cluster in question (usually, but not necessarily in the AWS managed service) has enforced balancing across awareness attributes. Usually, awareness attributes means availability zones, though it could be other factors. See this page for more information on how it works in opensearch: https://opensearch.org/docs/latest/tuning-your-cluster/#advanced-step-6-configure-shard-allocation-awareness-or-forced-awareness

In this PR, we take an approach to updating replica counts where we begin with the value from the metadata command arguments. If we see an incompatible replica count error, we increment the presumed dimensionality used by the transformer (which sets the minimum replica count for each index to dimensionality - 1) and try migrating the indices again. We repeat this up to 4 times, if necessary, but exit as soon as we no longer see an incompatible replica count error.

Issues Resolved

https://opensearch.atlassian.net/browse/MIGRATIONS-2185

Testing

Check List

  • New functionality includes testing
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Mikayla Thompson <[email protected]>
Signed-off-by: Mikayla Thompson <[email protected]>
Signed-off-by: Mikayla Thompson <[email protected]>
Signed-off-by: Mikayla Thompson <[email protected]>
.exitCode(INVALID_PARAMETER_CODE)
.errorMessage("Invalid parameter: " + pe.getMessage())
.build();
.exitCode(INVALID_PARAMETER_CODE)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, this is my IDE being out of sync with the usual settings. I'll go through and clean up the indents.

Copy link
Member

@peternied peternied Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a PR once upon a time for an intellij formatter [1], there is a VS code in the project too. Maybe that could help. 🤷

// Check whether any indices failed with an incompatibleReplicaCount
boolean incompatibleReplicaCountSeen = indexResults.getIndexes().stream().anyMatch(result -> result.wasFatal() && result.getFailureType().equals(CreationResult.CreationFailureType.INCOMPATIBLE_REPLICA_COUNT_FAILURE));
if (incompatibleReplicaCountSeen) {
if (presumedClusterDimensionality >= arguments.minNumberOfReplicas + MAX_REPLICA_ADJUSTMENT_LOOPS) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given cluster.routing.allocation.awareness.force.zone.values can be arbritraily high, can we parse out the response and skip straight to the next-higher valid value?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we definitely could. I went with the approach that we defined in the ticket refinement, but this would also be possible.

Copy link
Member

@peternied peternied left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for revisiting this, couple of thought

List<CreationResult> creationResults;
if (skipCreation.test(index.getName())) {
log.atInfo()
for (SnapshotRepo.Index index : repoDataProvider.getIndicesInSnapshot(snapshotName)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it shifted from a stream -> for loop, is this needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whitespace only, lets revert?

}
presumedClusterDimensionality++;
log.warn("Incompatible replica count seen for the cluster dimensionality. Retrying with an assumed cluster dimensionality of {}", presumedClusterDimensionality);
transformer = selectTransformer(clusters, presumedClusterDimensionality);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of spooky - we theoretically 'switch' which transformer is being used min-flight based on behavior. Can we replumb how this value get passed around to avoid this high level swap?

ALREADY_EXISTS(false, "already exists"),
UNABLE_TO_TRANSFORM_FAILURE(true, "failed to transform to the target version"),
TARGET_CLUSTER_FAILURE(true, "failed on target cluster"),
INCOMPATIBLE_REPLICA_COUNT_FAILURE(true, "failed due to incompatible replica count for awareness attribute count"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we include the CLI value to add to resolve this for our users?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revisiting after finishing the review, maybe this doesn't get sent to users?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will only get sent to users if we can't find a working replica count after 4 increments. By the CLI value to add -- what do you mean? The right replica count to set?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test case doesn't seem to catch this case, but also this seems highly unlikely since AWS AZ are going to be a number between 2-6.

this.containerVersion = version;
}

public SearchClusterContainer(final ContainerVersion version, Map<String, String> supplementaryEnvVariables) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is replacing the other constructor implementation causing code duplicate, lets instead have the previous version call the new constructor with an empty map.

for instance), but the type is different (either `validation_exception` or `illegal_argument_exception`). For this reason,
we're matching based on a regex against the message instead of also checking the type.
**/
public Optional<String> containsAwarenessAttributeException() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the message need to be parsed and returned by this function, that would aid the user if we displayed it to them? If not, this function could be reduced to a return a boolean value.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should revisit exactly what we're doing with it down the line, but the reason I kept the whole thing is because it looks like Validation Failed: 1: expected total copies needs to be a multiple of total awareness attributes [3]; and that number at the end shows the number of awareness attributes, which could be very useful for a user trying to troubleshoot/understand this behavior.

result.failureType(CreationFailureType.ALREADY_EXISTS);
}
} catch (InvalidResponse invalidResponse) {
var potentialAwarenessAttributeException = invalidResponse.containsAwarenessAttributeException();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming is a little messy, not sure if this was caught in a bulk rename.

Suggested change
var potentialAwarenessAttributeException = invalidResponse.containsAwarenessAttributeException();
var hasInvalidReplicaCountForZoneSize = invalidResponse.containsAwarenessAttributeException();

Comment on lines +26 to +32
private static Stream<Arguments> scenarios() {
return Stream.of(
Arguments.of(1),
Arguments.of(2),
Arguments.of(3)
);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a test coverage benefit to running this more than once, seems like one test that detect this scenario would be good enough?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason to do it is primarily that these match the scenarios customers may have and the behavior--at least for 1 AZ vs more-than-one is different. I'm open to removing the 2 AZ case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 AZ doesn't seem like an interesting case, because any number will match, no?

}
}

void verifyIndexesExistOnTargetCluster(List<String> indexNames) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like the new INCOMPATIBLE_REPLICA_COUNT_FAILURE is encountered during this test, is that scenario possible?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for the number of zones + max number of loops we're testing here. I could set up a test with 5+ zones that would cause it to be emitted, but I didn't because 1/ that's going to be a very slow test, and 2/ while a possible situation for a self-managed service cluster, I don't think it's possible in AOS (and unlikely no matter what)

public class ReplicaCountWithAZsTest extends BaseMigrationTest{

private static final SearchClusterContainer.ContainerVersion SOURCE_VERSION = SearchClusterContainer.ES_V7_10_2;
private static final SearchClusterContainer.ContainerVersion TARGET_VERSION = SearchClusterContainer.OS_LATEST;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its strange that this is unused, maybe the zoned search cluster is a static class for only this test case

Copy link

codecov bot commented Apr 1, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 0.00%. Comparing base (d6fd172) to head (8b535f2).
Report is 16 commits behind head on main.

Additional details and impacted files
@@     Coverage Diff      @@
##   main   #1394   +/-   ##
============================
============================

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants