Addition of Optional Parameter - allow_Red_cluster to bypass ISM to run on a Red Opensearch Cluster #1189

skumarp7 · 2024-06-13T12:59:58Z

Issue #, if available:

#1127

Description of changes:

This PR involves the provision for ISM to run on a red opensearch cluster.

Behaviour before:

The current implementation of the managedIndexRunner job prohibits any calls to the cluster when the cluster is red.

index-management/src/main/kotlin/org/opensearch/indexmanagement/indexstatemanagement/ManagedIndexRunner.kt

Line 254 in a049fdb

if (clusterIsRed()) {
In case if the cluster goes to red state due to resource crunch, ISM cannot delete any indices due to the above constraint.
Consider each index is of size 1 GB and the persistent volume of the opensearch cluster is 10GB.

ISM on a Green/Yellow Opensearch Cluster:

When the cluster volume reaches 95% of the available volume, the cluster goes to red state.

ISM on a Red Opensearch Cluster:

Behaviour Post the Changes:

The user has the option to set "allow_red_cluster" for each state where all the operations on that state can run on a red cluster if the "allow_red_cluster" is set to true. This parameter is optional.
If the User sets the parameter to "true", the ISM job will be bypassed to run on the particular state on a red Cluster. Any transitions, Actions under that state will be executed.
By default, the value of "allow_red_cluster" is false. Hence, the current behaviour of the API is also unchanged if the user doesnt want to bypass ISM to be run on a red cluster.

ISM on a Green/Yellow Opensearch Cluster:

ISM on a Red Opensearch Cluster with the optional parameter:

In case if the cluster goes to red state due to the volume getting consumed more than 95%, the ISM will still execute the delete action because of the param allowRedCluster being set true, which will delete the index which is of size 1 GB. This would help us to get the cluster back to yellow/green state and the ISM will continue to run successfully for the previous stages.

CheckList:

Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…un on a red cluster Signed-off-by: Sanjay Kumar <[email protected]>

Signed-off-by: Sanjay Kumar <[email protected]>

skumarp7 · 2024-06-13T13:24:52Z

Hi ,

Request feedback from the maintainers :)

Signed-off-by: Sanjay Kumar <[email protected]>

skumarp7 · 2024-06-24T06:51:12Z

Hi All,

I was trying to write test for the above feature and facing issues to simulate a red cluster. I tried to bring down the data node to make the cluster go read but since the integTest clusters are immutable, that was not possible.

Is there any other approach that i can make use of to simulate the red cluster ?

Hailong-am · 2024-06-24T09:31:16Z

Hi All,

I was trying to write test for the above feature and facing issues to simulate a red cluster. I tried to bring down the data node to make the cluster go read but since the integTest clusters are immutable, that was not possible.

Is there any other approach that i can make use of to simulate the red cluster ?

you can specify allocation policy for a index, An example

PUT test
{
  "settings": {
    "index.routing.allocation.require.size": "big"
  }
}

it will create index test and allocate shards on node with attribute size is big. As we don't have such node, the primary shards will not able to allocate, that will cause cluster health is red.

Signed-off-by: Sanjay Kumar <[email protected]>

bowenlan-amzn · 2024-07-12T15:44:43Z

src/main/kotlin/org/opensearch/indexmanagement/indexstatemanagement/ManagedIndexRunner.kt

@@ -250,11 +250,6 @@ object ManagedIndexRunner :
    @Suppress("ReturnCount", "ComplexMethod", "LongMethod", "ComplexCondition", "NestedBlockDepth")
    private suspend fun runManagedIndexConfig(managedIndexConfig: ManagedIndexConfig, jobContext: JobExecutionContext) {
        logger.debug("Run job for index ${managedIndexConfig.index}")
-        // doing a check of local cluster health as we do not want to overload cluster manager node with potentially a lot of calls


We shouldn't move this red cluster check down. Because it helps to not making 1 or more calls below which potentially worsen a red cluster.

Instead, we would need to rely on the ManagedIndexConfig which we directly have here. It has a snapshot of the policy. Only when the policy has allow_red_cluster, then we will bypass this check.

The problem is ManagedIndexConfig doesn't know which state or action is running, only metadata knows, so we won't be able to just bypass when specific state or action is running. To do that, more complex logic would be needed to update ManagedIndexConfig with required info...

Hi @bowenlan-amzn,

The calls we're making to the cluster are mostly GET operations, to fetch state etc, so I'm not sure how they would worsen a red cluster. I can still try to avoid them, but not sure of the right approach --

This is my understanding of your suggestions. Please correct where required :

Move the check to original location

Do some parsing of the policy to figure out if "allow_red_cluster" is present anywhere in the policy. If it does not, return like we do in existing code.

Else allow to run further
(At this stage, we don't know for which state this is going to be executed, as we cannot run any get calls).

Challenges:

allow_red_cluster flag has been added at state level i.e. some states in a policy can have it enabled true and some might not.

After we have allowed it to bypass the initial policy check, we again require a check at the state/action level if parameter is set to 'true or false' to run/skip that state/action.

That means, for a state that didn't have allow_red_cluster true, we would still execute those 1-2 GET calls before skipping it - which again could potentially worsen the cluster.

Ques:

How would be able to get the value of "allow_red_cluster" parameter at each state before even we get that metadata? can you throw some light to help me implement this logic in ManagedIndexConfig?

If we still need to place the check at top level only, we can move the parameter at the policy level itself instead of state level (as that can be fetched from the ManagedIndexConfig).
That would mean run all state/actions (if flag is true), or run nothing (if flag is false).
But with this, the concern rises that - any ism action that doesnt support red cluster or harm a red cluster will also be bypassed to run on a red cluster.

The calls we're making to the cluster are mostly GET operations, to fetch state etc, so I'm not sure how they would worsen a red cluster.

Right, these are get calls, but consider this small change will affect all the ISM jobs which can be 10k+ in a cluster, we should still keep the original logic — don't do anything if cluster is red.

That would mean run all state/actions (if flag is true), or run nothing (if flag is false).
But with this, the concern rises that - any ism action that doesnt support red cluster or harm a red cluster will also be bypassed to run on a red cluster.

We don't need new logic in ManagedIndexConfig. ManagedIndexConfig has policy in it

index-management/src/main/kotlin/org/opensearch/indexmanagement/indexstatemanagement/model/ManagedIndexConfig.kt

Line 38 in dbd2bc2

val policy: Policy,

I'm thinking a little hacky solution maybe having sth like allow_red_cluster_after: 3d, then compare the current time with jobEnabledTime which seems to be the time when the job is created. So we can calculate when to start to allow red cluster, before then, we still block execution by red cluster.

Hi @bowenlan-amzn,

I am guessing the idea here being suggested is we give a time-window to the user to decide - such that if the cluster continues to remain in red state for that long, then ism would be allowed to run.

Few follow up questions:

Let's say that the policy has triggered the job on day 1 and the cluster state goes to red after 10 days. The job would immediately get triggered on the red-cluster as the difference between the jobEnabledTime and current time would already have exceeded 3 days. In this case, the red cluster will be bypassed regardless of the time duration of 3days set in the parameter and it would not wait in red state for 3d. Is this understanding correct?

Is there any way to retrieve information for how many days the cluster state has been in Red?
Just thinking out loud on whether we could instead check the duration of how much time the cluster state has been red and compare that with allow_red_cluster_after: 3d.

Hi @bowenlan-amzn
I checked a bit on this, sharing my observations on this.
jobEnabledTime is the time when the ism job for the index gets created. It does not get updated as per the running state/action and remains the same until the job gets completed for an index (i.e. when the policy is marked complete).

Lets say the policy is created that deletes indices after they reach age > 5d and allow_red_cluster_after: 3d is also set. On day 1, job starts and jobEnabledTime gets set to this time. On day 4, cluster goes to red. On day 5, it has already exceeded 3d from jobEnabledTime and the red cluster check would get bypassed.

So if the intention was to make the job wait in the red state for configured time (3d), comparison with jobEnabledTime is not working as expected.
We could however make use of action startTime from the metadata.
For ex. On day 5, delete action would get started, but it would get skipped as cluster is in red state. This action's start time could be checked with current time, and after this exceeds 3d, this action could be allowed further.

Any thoughts on this...?

Hi @bowenlan-amzn
Following up on my previous comment, I have tried this approach ->

-- Job starts -- If cluster health is red if allow_cluster_health_after is enabled and (current time - actionstarttime) > allow_red_cluster_after: proceed else return, skip execution. At state execution: if cluster is red if action is delete proceed else return, skip execution.

This will basically mean if health is red and the action is stuck, it shall wait for configured duration (3d) in that situation and only then allow further execution if the flag is set.
Also, at the action execution, it will recheck that it allows execution only for delete action.
Code snippets for reference->

val indexMetadata = client.getManagedIndexMetadata(managedIndexConfig.indexUuid) val actionStartTime = (indexMetadata.first?.let { getActionStartTime(it) }) if (clusterIsRed()) { if (allowDeleteOnRedEnabled and Instant.now().isAfter(actionStartTime?.plusSeconds(allowRedClusterAfter.seconds))) { logger.info("Allowing to run ism after waiting for more than " + allowRedClusterAfter + " + on red state") } else { logger.info("Skipping current execution of ${managedIndexConfig.index} because of red cluster health") return } } ..... ..... // At the state execution phase val action: Action? = state?.getActionToExecute(managedIndexMetaData, indexMetadataProvider) if (clusterIsRed()) { // If it has come till here in red-state, it means allow_red_cluster_after duration has passed. if (action is DeleteAction) { logger.info("Allowing only DeleteAction during red status") } else { logger.debug("Skipping current execution of ${managedIndexConfig.index} because of red cluster health") return } }

With this approach, any other action or transition does not happen on red-state.
If an index was already transitioned and almost ready to execute the delete action when the cluster became red, only that undergoes delete action, rest of them remain stuck until cluster health gets better.

Please share your thoughts on this..

Hi @bowenlan-amzn
Any thoughts on my previous comment? 🙂

bowenlan-amzn · 2024-07-12T15:59:02Z

...st/kotlin/org/opensearch/indexmanagement/indexstatemanagement/runner/ManagedIndexRunnerIT.kt

+
+        val managedIndexConfig = getExistingManagedIndexConfig(indexName)
+
+        val endpoint = "sim-red"


Would it make sense to not use a different index, but just use val indexName = "test-index-2" to simulate the red cluster situation?

vikasvb90

@skumarp7
Instead of allowing all ISM actions to bypass red cluster, can we enable it specifically on actions which potentially move towards reducing cluster redness? For instance, delete index moves towards reducing red cluster.
For some actions to not worsen the cluster, we probably need scoping and some like create index, rollover, etc will always add to red state. So, lets even not add any means to allow these actions in case of red cluster state.
There are cases where shard assignments are blocked because of some specific reasons, but handling those specific scenarios and evaluating and allowing actions may be quite complex to achieve.

To summarize, I am aligned to allowing executions of ISM actions which can either improve cluster state or are neutral but definitely not on actions which can make it worse.

skumarp7 added 2 commits June 13, 2024 17:48

Addition of Optional Parameter - allow_Red_cluster to bypass ISM to r…

3ed3edb

…un on a red cluster Signed-off-by: Sanjay Kumar <[email protected]>

Indendation Fix

f19c0b5

Signed-off-by: Sanjay Kumar <[email protected]>

skumarp7 and others added 2 commits June 24, 2024 12:14

Merge branch 'opensearch-project:main' into ismRedClusterFeature

8fe97f1

Overriding state with default values of allowredcluster

c7adb0b

Signed-off-by: Sanjay Kumar <[email protected]>

skumarp7 marked this pull request as ready for review June 24, 2024 06:49

skumarp7 requested review from bowenlan-amzn, getsaurabh02, lezzago, praveensameneni, xluo-aws, gaobinlong, Hailong-am, amsiglan, sbcd90, eirsep, AWSHurneyt, jowg-amazon, r1walz and vikasvb90 as code owners June 24, 2024 06:49

skumarp7 added 2 commits June 25, 2024 17:22

Fix missing brackets

0977318

Signed-off-by: Sanjay Kumar <[email protected]>

Addition of test case

ada089e

Signed-off-by: Sanjay Kumar <[email protected]>

bowenlan-amzn reviewed Jul 12, 2024

View reviewed changes

vikasvb90 reviewed Jan 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Addition of Optional Parameter - allow_Red_cluster to bypass ISM to run on a Red Opensearch Cluster #1189

Addition of Optional Parameter - allow_Red_cluster to bypass ISM to run on a Red Opensearch Cluster #1189

Uh oh!

skumarp7 commented Jun 13, 2024 •

edited

Loading

Uh oh!

skumarp7 commented Jun 13, 2024

Uh oh!

skumarp7 commented Jun 24, 2024

Uh oh!

Hailong-am commented Jun 24, 2024

Uh oh!

bowenlan-amzn Jul 12, 2024

Uh oh!

skumarp7 Jul 17, 2024 •

edited

Loading

Uh oh!

bowenlan-amzn Jul 17, 2024

Uh oh!

skumarp7 Jul 22, 2024

Uh oh!

aggarwalShivani Mar 12, 2025 •

edited

Loading

Uh oh!

aggarwalShivani Mar 17, 2025

Uh oh!

aggarwalShivani Mar 30, 2025

Uh oh!

bowenlan-amzn Jul 12, 2024

Uh oh!

vikasvb90 left a comment

Uh oh!

Uh oh!


		val managedIndexConfig = getExistingManagedIndexConfig(indexName)

		val endpoint = "sim-red"

Addition of Optional Parameter - allow_Red_cluster to bypass ISM to run on a Red Opensearch Cluster #1189

Are you sure you want to change the base?

Addition of Optional Parameter - allow_Red_cluster to bypass ISM to run on a Red Opensearch Cluster #1189

Uh oh!

Conversation

skumarp7 commented Jun 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skumarp7 commented Jun 13, 2024

Uh oh!

skumarp7 commented Jun 24, 2024

Uh oh!

Hailong-am commented Jun 24, 2024

Uh oh!

bowenlan-amzn Jul 12, 2024

Choose a reason for hiding this comment

Uh oh!

skumarp7 Jul 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bowenlan-amzn Jul 17, 2024

Choose a reason for hiding this comment

Uh oh!

skumarp7 Jul 22, 2024

Choose a reason for hiding this comment

Uh oh!

aggarwalShivani Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aggarwalShivani Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

aggarwalShivani Mar 30, 2025

Choose a reason for hiding this comment

Uh oh!

bowenlan-amzn Jul 12, 2024

Choose a reason for hiding this comment

Uh oh!

vikasvb90 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

skumarp7 commented Jun 13, 2024 •

edited

Loading

skumarp7 Jul 17, 2024 •

edited

Loading

aggarwalShivani Mar 12, 2025 •

edited

Loading