Security init improvements #5625

nagarajg17 · 2025-09-14T10:47:01Z

Description

[Describe what this change achieves]

Category (Enhancement, New feature, Bug fix, Test fix, Refactoring, Maintenance, Documentation): Enhancement
Why these changes are required?
- Security initialization is not time bound
- Fatal errors is silently caught without being throw.
- Indexing activity stuck due to no timeout on overloaded cluster
What is the old behavior before changes and new behavior after changes?

Fatal errors even if thrown it doesn't stop/fail ES process, node will be healthy with no security module initialized. This is because we execute security init process in separate thread here, so even if we throw fatal errors here instead of silently catching and ignoring them like currently, we can't fail loading of security plugin since this executes on separate thread. However throwing error atleast helps in understanding why it failed from logs

Adding timeout, making infinite loops time bound (5 mins) makes it more predictable, reliable and increases visibility of initialization process. Improving health check API and having a watcher for it and when it fails having auto cut mechanism and auto remediation (ES process restart) will help in resolving this issue

Issues Resolved

5603

Is this a backport? If so, please add backport PR # and/or commits #, and remove backport-failed label from the original PR.

Do these changes introduce new permission(s) to be displayed in the static dropdown on the front-end? If so, please open a draft PR in the security dashboards plugin and link the draft PR here

Testing

Tested in local cluster security initialization is successful and induced timeout errors and verified error stack is present in logs

Check List

New functionality includes testing
New functionality has been documented
New Roles/Permissions have a corresponding security dashboards plugin PR
API changes companion pull request created
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

codecov · 2025-09-14T11:10:26Z

Codecov Report

❌ Patch coverage is 25.92593% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.88%. Comparing base (259103a) to head (c0401fe).

Files with missing lines	Patch %	Lines
...ecurity/configuration/ConfigurationRepository.java	22.72%	15 Missing and 2 partials ⚠️
.../org/opensearch/security/support/ConfigHelper.java	25.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5625      +/-   ##
==========================================
- Coverage   72.95%   72.88%   -0.07%     
==========================================
  Files         408      408              
  Lines       25386    25399      +13     
  Branches     3862     3863       +1     
==========================================
- Hits        18520    18512       -8     
- Misses       4991     5011      +20     
- Partials     1875     1876       +1

Files with missing lines	Coverage Δ
...g/opensearch/security/support/ConfigConstants.java	`95.45% <100.00%> (+0.21%)`	⬆️
.../org/opensearch/security/support/ConfigHelper.java	`83.33% <25.00%> (-4.91%)`	⬇️
...ecurity/configuration/ConfigurationRepository.java	`76.59% <22.72%> (-4.00%)`	⬇️

... and 8 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Nagaraj G <[email protected]>

shikharj05 · 2025-09-16T16:35:13Z

Thanks, can you add tests to verify the behavior?

kumjiten · 2025-09-16T17:47:18Z

src/main/java/org/opensearch/security/configuration/ConfigurationRepository.java

                    break;
                } catch (Exception e) {
-                    LOGGER.debug("Unable to load configuration due to {}", String.valueOf(ExceptionUtils.getRootCause(e)));
+                    LOGGER.error("Unable to load configuration due to {}", String.valueOf(ExceptionUtils.getRootCause(e)));


Thanks, @nagarajg17, for the improvement.

please keep the logging format consistent for logging error, i.e.,

LOGGER.error("Unable to load configuration due to", e);

We are here in a retry loop, which is potentially executed every 3 seconds. Thus, we might log an error message every 3 seconds. Are we sure this won't spam the logs in some cases?

The problem we see is since it is set to debug logs, when we try to investigate during initialization failures there won't be any debug logs and becomes hard to debug. Since it is in catch block, do you see it can get logged every 3 secs? Ideally we don't expect code flow to come here right?

I cannot really say if we expect control flow to reach this place. I guess this needs to be found out by testing.

kumjiten · 2025-09-16T17:48:52Z

src/main/java/org/opensearch/security/configuration/ConfigurationRepository.java

+                    .actionGet();
            } catch (Exception e) {
-                LOGGER.debug("Caught again a {} but we just try again ...", e.toString());
+                LOGGER.error("Caught again a {} but we just try again ...", e.toString());


please keep the logging format consistent for logging error, i.e.,

LOGGER.error("Caught again but we just try again ...", e);

We are here in a retry loop

while (response == null || response.isTimedOut() || response.getStatus() == ClusterHealthStatus.RED) {

That means that potentially we will log error messages repeatedly. Are we sure this won't be spamming the logs with bogus error messages in "good" cases?

Should we use warn level? Since its infinite while loop, how do you suggest to get visibility if code is stuck in this block?

Maybe: Only start logging errors after being in this loop for more than 2 minutes? I think one has to experiment a bit (with big clusters) to find a good value for this.

If you had issues with stuck threads: Did you collect thread dumps? That is usually a good way to see what's going on.

kumjiten · 2025-09-16T17:51:07Z

src/main/java/org/opensearch/security/configuration/ConfigurationRepository.java


 public class ConfigurationRepository implements ClusterStateListener, IndexEventListener {
    private static final Logger LOGGER = LogManager.getLogger(ConfigurationRepository.class);
+    private static final int INITIALIZATION_RETRY_INTERVAL_MS = 3000;


good to move these constants in src/main/java/org/opensearch/security/support/ConfigConstants.java

kumjiten · 2025-09-16T18:23:22Z

src/main/java/org/opensearch/security/support/ConfigHelper.java

                final IndexRequest indexRequest = new IndexRequest(index).id(configType)
                    .opType(OpType.CREATE)
                    .setRefreshPolicy(RefreshPolicy.IMMEDIATE)
+                    .timeout(DEFAULT_TIMEOUT)


What happens when a cluster is large and overloaded, taking more than 30 seconds for indexing? Shouldn't the timeout value be configurable like a normal indexing API?

What is the timeout for normal indexing API?

kumjiten · 2025-09-16T18:28:24Z

src/main/java/org/opensearch/security/configuration/ConfigurationRepository.java

                .actionGet();
        } catch (Exception e) {
-            LOGGER.debug("Caught a {} but we just try again ...", e.toString());
+            LOGGER.error("Caught a {} but we just try again ...", e.toString());


please keep the logging format consistent for logging error, i.e.,

LOGGER.error("Caught exception but we just try again ...", e);

nibix

Thank you for the work!

On a high level, could you explain the approach? It is mostly just making the init procedure in case of an absent security index time bound, correct?

Additional note: These changes only apply for the cases that the security index does not exist yet, i.e. a fresh cluster or a cluster where security is enabled at some point.

From the description of #5603, it was not exactly clear to me that this was only about this case. Are we also going to look at the case that the index does already exist or is this not a concern here?

nibix · 2025-09-17T02:48:39Z

src/main/java/org/opensearch/security/configuration/ConfigurationRepository.java

                } catch (Exception e) {
-                    LOGGER.error("Cannot apply default config (this is maybe not an error!)", e);
+                    LOGGER.error("Cannot apply default config", e);
+                    throw e;


Could you elaborate on why it is ok to abort the process here by throwing an exception?

The error is due to security index creation/default docs indexation failure (Excluding doc already exists error). So it might be partially indexed or not indexed at all depending on failure point. So there is no good benefit to allow the code to process further when indexing itself is not complete and more than that it can leave system in partially initialized state and customer API calls can receive 4xx/5xx errors which isn't desired behaviour

From my experience, during the early node startup phase, OpenSearch can throw sometimes exceptions you do not see otherwise (something like cluster state not intialized or similar).

I cannot give you any concrete examples, but I would recommend to thoroughly test this whether this might trip over such exceptions.

Got it. We can bake this change for a week and see if there is any unexpected regression?

With "bake" do you mean comitting it to the repo and see if regressions show up in the nightlies of this or other repos? IMHO, that would be too little exposure to get an reliable impression. Or did you think of something else?

nibix · 2025-09-17T02:51:56Z

src/main/java/org/opensearch/security/configuration/ConfigurationRepository.java

+
+            if (!dynamicConfigFactory.isInitialized()) {
+                LOGGER.error("Node '{}' failed to initialize", clusterService.localNode().getName());
+                throw new IllegalStateException(String.format("Node '%s' failed to initialize", clusterService.localNode().getName()));


This will be caught and logged in line 311:

} catch (Exception e) { LOGGER.error("Unexpected exception while initializing node " + e, e); }

Thus, basically you will get the log message "Node '{}' failed to initialize twice in a row.

Ack, will remove line 303 where we are logging first time

nibix · 2025-09-17T02:54:30Z

src/main/java/org/opensearch/security/support/ConfigConstants.java

    public static final String OPENSEARCH_RESOURCE_SHARING_ENABLED = "plugins.security.experimental.resource_sharing.enabled";
    public static final boolean OPENSEARCH_RESOURCE_SHARING_ENABLED_DEFAULT = false;

+    public static final TimeValue DEFAULT_TIMEOUT = TimeValue.timeValueSeconds(30);


I think we should not have a general purpose globall default timeout.

Timeout handling is tricky, thus we should look on a case by case basis what timeouts are suitable. Initialization processes might have a much longer timeout than for example API calls triggered by users.

Agree, can remove this. Timeout currently used is for index creation and default docs indexation, do you see any scenario taking more than 30 sec? What is your recommendation for ideal timeout value here?

In the loop inside waitForSecurityIndexToBeAtLeastYellow() it does not matter so much IMHO, as it is retrying indefinitely. For the IndexRequest, one might consider increasing it to something more than the default of 1 minute, as reaching the timeout will be a fatal failure. Yet, this is all only an "IMHO", experiences have to be collected by testing.

nibix · 2025-09-17T02:58:30Z

src/main/java/org/opensearch/security/configuration/ConfigurationRepository.java

            response = client.admin()
                .cluster()
-                .health(new ClusterHealthRequest(securityIndex).waitForActiveShards(1).waitForYellowStatus())
+                .health(new ClusterHealthRequest(securityIndex).waitForActiveShards(1).waitForYellowStatus().timeout(DEFAULT_TIMEOUT))


ClusterHealthRequest and CreateIndexRequest already use 30s as a default value for timeouts.

Could you explain the rationale for setting the timeout value explicitly?

We have seen good number of times where these operations are stuck like below. In this case during indexing of 'rolesmapping' doc. Are we sure it is having default value of 30 secs? We have seen many such tickets

3603:[2025-06-09T14:32:01,165][INFO ][o.o.s.s.ConfigHelper ] [bd87625b46a927da71043d943ba615a0] Will update 'config' with /apollo/_env/OS_2_19AMI-ES2-p001-d0db_OS_219/opensearch-security/config.yml and populate it with empty doc if file missing and populateEmptyIfFileMissing=false 4031:[2025-06-09T14:32:01,620][INFO ][o.o.s.s.ConfigHelper ] [bd87625b46a927da71043d943ba615a0] Will update 'roles' with /apollo/_env/OS_2_19AMI-ES2-p001-d0db_OS_219/opensearch-security/roles.yml and populate it with empty doc if file missing and populateEmptyIfFileMissing=false 4038:[2025-06-09T14:32:01,759][INFO ][o.o.s.s.ConfigHelper ] [bd87625b46a927da71043d943ba615a0] Will update 'rolesmapping' with /apollo/_env/OS_2_19AMI-ES2-p001-d0db_OS_219/opensearch-security/roles_mapping.yml and populate it with empty doc if file missing and populateEmptyIfFileMissing=false 4030:[2025-06-09T14:32:01,620][INFO ][o.o.s.s.ConfigHelper ] [bd87625b46a927da71043d943ba615a0] Index .opendistro_security already contains doc with id config, skipping update. 4037:[2025-06-09T14:32:01,759][INFO ][o.o.s.s.ConfigHelper ] [bd87625b46a927da71043d943ba615a0] Index .opendistro_security already contains doc with id roles, skipping update

The default timeout value is here:

https://github.com/opensearch-project/OpenSearch/blob/6fdb010ef91b075f4060a6d87da9beae681c2c37/server/src/main/java/org/opensearch/action/admin/cluster/health/ClusterHealthRequest.java#L65

I am not sure if I understand the logs correctly. Are these from two different nodes? Nothing follows after these logs?

No, logs are from same domain. Node id - bd87625b46a927da71043d943ba615a0

This is for ClusterHealthRequest only?
https://github.com/opensearch-project/OpenSearch/blob/6fdb010ef91b075f4060a6d87da9beae681c2c37/server/src/main/java/org/opensearch/action/admin/cluster/health/ClusterHealthRequest.java#L65

yes, 30 seconds is the default for ClusterHealthRequest. For indexing documents, the default is 1 minute:

https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/action/support/replication/ReplicationRequest.java

nibix · 2025-09-17T03:03:49Z

src/main/java/org/opensearch/security/configuration/ConfigurationRepository.java

+                    .actionGet();
            } catch (Exception e) {
-                LOGGER.debug("Caught again a {} but we just try again ...", e.toString());
+                LOGGER.error("Caught again a {} but we just try again ...", e.toString());


We are here in a retry loop

while (response == null || response.isTimedOut() || response.getStatus() == ClusterHealthStatus.RED) {

That means that potentially we will log error messages repeatedly. Are we sure this won't be spamming the logs with bogus error messages in "good" cases?

nibix · 2025-09-17T03:06:20Z

src/main/java/org/opensearch/security/configuration/ConfigurationRepository.java

                    break;
                } catch (Exception e) {
-                    LOGGER.debug("Unable to load configuration due to {}", String.valueOf(ExceptionUtils.getRootCause(e)));
+                    LOGGER.error("Unable to load configuration due to {}", String.valueOf(ExceptionUtils.getRootCause(e)));


We are here in a retry loop, which is potentially executed every 3 seconds. Thus, we might log an error message every 3 seconds. Are we sure this won't spam the logs in some cases?

nagarajg17 · 2025-09-24T02:43:51Z

@nibix

On a high level, could you explain the approach? It is mostly just making the init procedure in case of an absent security index time bound, correct?

Yes making init procedure time bound

Additional note: These changes only apply for the cases that the security index does not exist yet, i.e. a fresh cluster or a cluster where security is enabled at some point.

Yes for new cluster/security enabled at some point

From the description of #5603, it was not exactly clear to me that this was only about this case. Are we also going to look at the case that the index does already exist or is this not a concern here?

installDefaultConfig will be still true for a non fresh cluster. When say ex: ES process is restarted on node, it goes through same flow but since security index, docs are already present, its a no-op

nibix · 2025-09-26T12:31:43Z

installDefaultConfig will be still true for a non fresh cluster.

I guess this is the case for the managed service? From a general POV, this cannot be assumed IMHO.

nagarajg17 force-pushed the security-init-fixes branch 2 times, most recently from 6951689 to 2b611cf Compare September 16, 2025 15:50

Security init improvements

f33bd8e

Signed-off-by: Nagaraj G <[email protected]>

nagarajg17 force-pushed the security-init-fixes branch from 2b611cf to f33bd8e Compare September 16, 2025 15:51

nagarajg17 marked this pull request as ready for review September 16, 2025 15:52

nagarajg17 requested review from DarshitChanpura, RyanL1997, cwperks, derek-ho, nibix, peternied, reta, shikharj05 and willyborankin as code owners September 16, 2025 15:52

kumjiten reviewed Sep 16, 2025

View reviewed changes

nibix reviewed Sep 17, 2025

View reviewed changes

Security init improvements #5625

Are you sure you want to change the base?

Security init improvements #5625

Uh oh!

Conversation

nagarajg17 commented Sep 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issues Resolved

Testing

Check List

Uh oh!

codecov bot commented Sep 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

shikharj05 commented Sep 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nibix left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nagarajg17 commented Sep 24, 2025

nagarajg17 commented Sep 14, 2025 •

edited

Loading

codecov bot commented Sep 14, 2025 •

edited

Loading