Skip to content

Conversation

nagarajg17
Copy link
Contributor

@nagarajg17 nagarajg17 commented Sep 14, 2025

Description

[Describe what this change achieves]

  • Category (Enhancement, New feature, Bug fix, Test fix, Refactoring, Maintenance, Documentation): Enhancement

  • Why these changes are required?

    • Security initialization is not time bound
    • Fatal errors is silently caught without being throw.
    • Indexing activity stuck due to no timeout on overloaded cluster
  • What is the old behavior before changes and new behavior after changes?

Fatal errors even if thrown it doesn't stop/fail ES process, node will be healthy with no security module initialized. This is because we execute security init process in separate thread here, so even if we throw fatal errors here instead of silently catching and ignoring them like currently, we can't fail loading of security plugin since this executes on separate thread. However throwing error atleast helps in understanding why it failed from logs

Adding timeout, making infinite loops time bound (5 mins) makes it more predictable, reliable and increases visibility of initialization process. Improving health check API and having a watcher for it and when it fails having auto cut mechanism and auto remediation (ES process restart) will help in resolving this issue

Issues Resolved

5603

Is this a backport? If so, please add backport PR # and/or commits #, and remove backport-failed label from the original PR.

Do these changes introduce new permission(s) to be displayed in the static dropdown on the front-end? If so, please open a draft PR in the security dashboards plugin and link the draft PR here

Testing

Tested in local cluster security initialization is successful and induced timeout errors and verified error stack is present in logs

Check List

  • New functionality includes testing
  • New functionality has been documented
  • New Roles/Permissions have a corresponding security dashboards plugin PR
  • API changes companion pull request created
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link

codecov bot commented Sep 14, 2025

Codecov Report

❌ Patch coverage is 25.92593% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.88%. Comparing base (259103a) to head (c0401fe).

Files with missing lines Patch % Lines
...ecurity/configuration/ConfigurationRepository.java 22.72% 15 Missing and 2 partials ⚠️
.../org/opensearch/security/support/ConfigHelper.java 25.00% 3 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5625      +/-   ##
==========================================
- Coverage   72.95%   72.88%   -0.07%     
==========================================
  Files         408      408              
  Lines       25386    25399      +13     
  Branches     3862     3863       +1     
==========================================
- Hits        18520    18512       -8     
- Misses       4991     5011      +20     
- Partials     1875     1876       +1     
Files with missing lines Coverage Δ
...g/opensearch/security/support/ConfigConstants.java 95.45% <100.00%> (+0.21%) ⬆️
.../org/opensearch/security/support/ConfigHelper.java 83.33% <25.00%> (-4.91%) ⬇️
...ecurity/configuration/ConfigurationRepository.java 76.59% <22.72%> (-4.00%) ⬇️

... and 8 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@nagarajg17 nagarajg17 force-pushed the security-init-fixes branch 2 times, most recently from 6951689 to 2b611cf Compare September 16, 2025 15:50
@shikharj05
Copy link
Collaborator

Thanks, can you add tests to verify the behavior?

break;
} catch (Exception e) {
LOGGER.debug("Unable to load configuration due to {}", String.valueOf(ExceptionUtils.getRootCause(e)));
LOGGER.error("Unable to load configuration due to {}", String.valueOf(ExceptionUtils.getRootCause(e)));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @nagarajg17, for the improvement.

please keep the logging format consistent for logging error, i.e.,

LOGGER.error("Unable to load configuration due to", e);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are here in a retry loop, which is potentially executed every 3 seconds. Thus, we might log an error message every 3 seconds. Are we sure this won't spam the logs in some cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem we see is since it is set to debug logs, when we try to investigate during initialization failures there won't be any debug logs and becomes hard to debug. Since it is in catch block, do you see it can get logged every 3 secs? Ideally we don't expect code flow to come here right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot really say if we expect control flow to reach this place. I guess this needs to be found out by testing.

.actionGet();
} catch (Exception e) {
LOGGER.debug("Caught again a {} but we just try again ...", e.toString());
LOGGER.error("Caught again a {} but we just try again ...", e.toString());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please keep the logging format consistent for logging error, i.e.,

 LOGGER.error("Caught again but we just try again ...", e);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are here in a retry loop

        while (response == null || response.isTimedOut() || response.getStatus() == ClusterHealthStatus.RED) {

That means that potentially we will log error messages repeatedly. Are we sure this won't be spamming the logs with bogus error messages in "good" cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use warn level? Since its infinite while loop, how do you suggest to get visibility if code is stuck in this block?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe: Only start logging errors after being in this loop for more than 2 minutes? I think one has to experiment a bit (with big clusters) to find a good value for this.

If you had issues with stuck threads: Did you collect thread dumps? That is usually a good way to see what's going on.


public class ConfigurationRepository implements ClusterStateListener, IndexEventListener {
private static final Logger LOGGER = LogManager.getLogger(ConfigurationRepository.class);
private static final int INITIALIZATION_RETRY_INTERVAL_MS = 3000;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to move these constants in src/main/java/org/opensearch/security/support/ConfigConstants.java

final IndexRequest indexRequest = new IndexRequest(index).id(configType)
.opType(OpType.CREATE)
.setRefreshPolicy(RefreshPolicy.IMMEDIATE)
.timeout(DEFAULT_TIMEOUT)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when a cluster is large and overloaded, taking more than 30 seconds for indexing? Shouldn't the timeout value be configurable like a normal indexing API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the timeout for normal indexing API?

.actionGet();
} catch (Exception e) {
LOGGER.debug("Caught a {} but we just try again ...", e.toString());
LOGGER.error("Caught a {} but we just try again ...", e.toString());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please keep the logging format consistent for logging error, i.e.,

LOGGER.error("Caught exception but we just try again ...", e);

Copy link
Collaborator

@nibix nibix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the work!

On a high level, could you explain the approach? It is mostly just making the init procedure in case of an absent security index time bound, correct?

Additional note: These changes only apply for the cases that the security index does not exist yet, i.e. a fresh cluster or a cluster where security is enabled at some point.

From the description of #5603, it was not exactly clear to me that this was only about this case. Are we also going to look at the case that the index does already exist or is this not a concern here?

} catch (Exception e) {
LOGGER.error("Cannot apply default config (this is maybe not an error!)", e);
LOGGER.error("Cannot apply default config", e);
throw e;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate on why it is ok to abort the process here by throwing an exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error is due to security index creation/default docs indexation failure (Excluding doc already exists error). So it might be partially indexed or not indexed at all depending on failure point. So there is no good benefit to allow the code to process further when indexing itself is not complete and more than that it can leave system in partially initialized state and customer API calls can receive 4xx/5xx errors which isn't desired behaviour

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my experience, during the early node startup phase, OpenSearch can throw sometimes exceptions you do not see otherwise (something like cluster state not intialized or similar).

I cannot give you any concrete examples, but I would recommend to thoroughly test this whether this might trip over such exceptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. We can bake this change for a week and see if there is any unexpected regression?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With "bake" do you mean comitting it to the repo and see if regressions show up in the nightlies of this or other repos? IMHO, that would be too little exposure to get an reliable impression. Or did you think of something else?


if (!dynamicConfigFactory.isInitialized()) {
LOGGER.error("Node '{}' failed to initialize", clusterService.localNode().getName());
throw new IllegalStateException(String.format("Node '%s' failed to initialize", clusterService.localNode().getName()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be caught and logged in line 311:

        } catch (Exception e) {
            LOGGER.error("Unexpected exception while initializing node " + e, e);
        }

Thus, basically you will get the log message "Node '{}' failed to initialize twice in a row.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, will remove line 303 where we are logging first time

public static final String OPENSEARCH_RESOURCE_SHARING_ENABLED = "plugins.security.experimental.resource_sharing.enabled";
public static final boolean OPENSEARCH_RESOURCE_SHARING_ENABLED_DEFAULT = false;

public static final TimeValue DEFAULT_TIMEOUT = TimeValue.timeValueSeconds(30);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not have a general purpose globall default timeout.

Timeout handling is tricky, thus we should look on a case by case basis what timeouts are suitable. Initialization processes might have a much longer timeout than for example API calls triggered by users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, can remove this. Timeout currently used is for index creation and default docs indexation, do you see any scenario taking more than 30 sec? What is your recommendation for ideal timeout value here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the loop inside waitForSecurityIndexToBeAtLeastYellow() it does not matter so much IMHO, as it is retrying indefinitely. For the IndexRequest, one might consider increasing it to something more than the default of 1 minute, as reaching the timeout will be a fatal failure. Yet, this is all only an "IMHO", experiences have to be collected by testing.

response = client.admin()
.cluster()
.health(new ClusterHealthRequest(securityIndex).waitForActiveShards(1).waitForYellowStatus())
.health(new ClusterHealthRequest(securityIndex).waitForActiveShards(1).waitForYellowStatus().timeout(DEFAULT_TIMEOUT))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClusterHealthRequest and CreateIndexRequest already use 30s as a default value for timeouts.

Could you explain the rationale for setting the timeout value explicitly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have seen good number of times where these operations are stuck like below. In this case during indexing of 'rolesmapping' doc. Are we sure it is having default value of 30 secs? We have seen many such tickets

3603:[2025-06-09T14:32:01,165][INFO ][o.o.s.s.ConfigHelper     ] [bd87625b46a927da71043d943ba615a0] Will update 'config' with /apollo/_env/OS_2_19AMI-ES2-p001-d0db_OS_219/opensearch-security/config.yml and populate it with empty doc if file missing and populateEmptyIfFileMissing=false
4031:[2025-06-09T14:32:01,620][INFO ][o.o.s.s.ConfigHelper     ] [bd87625b46a927da71043d943ba615a0] Will update 'roles' with /apollo/_env/OS_2_19AMI-ES2-p001-d0db_OS_219/opensearch-security/roles.yml and populate it with empty doc if file missing and populateEmptyIfFileMissing=false
4038:[2025-06-09T14:32:01,759][INFO ][o.o.s.s.ConfigHelper     ] [bd87625b46a927da71043d943ba615a0] Will update 'rolesmapping' with /apollo/_env/OS_2_19AMI-ES2-p001-d0db_OS_219/opensearch-security/roles_mapping.yml and populate it with empty doc if file missing and populateEmptyIfFileMissing=false


4030:[2025-06-09T14:32:01,620][INFO ][o.o.s.s.ConfigHelper     ] [bd87625b46a927da71043d943ba615a0] Index .opendistro_security already contains doc with id config, skipping update.
4037:[2025-06-09T14:32:01,759][INFO ][o.o.s.s.ConfigHelper     ] [bd87625b46a927da71043d943ba615a0] Index .opendistro_security already contains doc with id roles, skipping update

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default timeout value is here:

https://github.com/opensearch-project/OpenSearch/blob/6fdb010ef91b075f4060a6d87da9beae681c2c37/server/src/main/java/org/opensearch/action/admin/cluster/health/ClusterHealthRequest.java#L65

I am not sure if I understand the logs correctly. Are these from two different nodes? Nothing follows after these logs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, 30 seconds is the default for ClusterHealthRequest. For indexing documents, the default is 1 minute:

https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/action/support/replication/ReplicationRequest.java

.actionGet();
} catch (Exception e) {
LOGGER.debug("Caught again a {} but we just try again ...", e.toString());
LOGGER.error("Caught again a {} but we just try again ...", e.toString());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are here in a retry loop

        while (response == null || response.isTimedOut() || response.getStatus() == ClusterHealthStatus.RED) {

That means that potentially we will log error messages repeatedly. Are we sure this won't be spamming the logs with bogus error messages in "good" cases?

break;
} catch (Exception e) {
LOGGER.debug("Unable to load configuration due to {}", String.valueOf(ExceptionUtils.getRootCause(e)));
LOGGER.error("Unable to load configuration due to {}", String.valueOf(ExceptionUtils.getRootCause(e)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are here in a retry loop, which is potentially executed every 3 seconds. Thus, we might log an error message every 3 seconds. Are we sure this won't spam the logs in some cases?

@nagarajg17
Copy link
Contributor Author

@nibix

On a high level, could you explain the approach? It is mostly just making the init procedure in case of an absent security index time bound, correct?

Yes making init procedure time bound

Additional note: These changes only apply for the cases that the security index does not exist yet, i.e. a fresh cluster or a cluster where security is enabled at some point.

Yes for new cluster/security enabled at some point

From the description of #5603, it was not exactly clear to me that this was only about this case. Are we also going to look at the case that the index does already exist or is this not a concern here?

installDefaultConfig will be still true for a non fresh cluster. When say ex: ES process is restarted on node, it goes through same flow but since security index, docs are already present, its a no-op

@nibix
Copy link
Collaborator

nibix commented Sep 26, 2025

installDefaultConfig will be still true for a non fresh cluster.

I guess this is the case for the managed service? From a general POV, this cannot be assumed IMHO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants