-
Notifications
You must be signed in to change notification settings - Fork 336
Security init improvements #5625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5625 +/- ##
==========================================
- Coverage 72.95% 72.88% -0.07%
==========================================
Files 408 408
Lines 25386 25399 +13
Branches 3862 3863 +1
==========================================
- Hits 18520 18512 -8
- Misses 4991 5011 +20
- Partials 1875 1876 +1
🚀 New features to boost your workflow:
|
6951689
to
2b611cf
Compare
Signed-off-by: Nagaraj G <[email protected]>
2b611cf
to
f33bd8e
Compare
Thanks, can you add tests to verify the behavior? |
break; | ||
} catch (Exception e) { | ||
LOGGER.debug("Unable to load configuration due to {}", String.valueOf(ExceptionUtils.getRootCause(e))); | ||
LOGGER.error("Unable to load configuration due to {}", String.valueOf(ExceptionUtils.getRootCause(e))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @nagarajg17, for the improvement.
please keep the logging format consistent for logging error, i.e.,
LOGGER.error("Unable to load configuration due to", e);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are here in a retry loop, which is potentially executed every 3 seconds. Thus, we might log an error message every 3 seconds. Are we sure this won't spam the logs in some cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem we see is since it is set to debug logs, when we try to investigate during initialization failures there won't be any debug logs and becomes hard to debug. Since it is in catch block, do you see it can get logged every 3 secs? Ideally we don't expect code flow to come here right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot really say if we expect control flow to reach this place. I guess this needs to be found out by testing.
.actionGet(); | ||
} catch (Exception e) { | ||
LOGGER.debug("Caught again a {} but we just try again ...", e.toString()); | ||
LOGGER.error("Caught again a {} but we just try again ...", e.toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please keep the logging format consistent for logging error, i.e.,
LOGGER.error("Caught again but we just try again ...", e);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are here in a retry loop
while (response == null || response.isTimedOut() || response.getStatus() == ClusterHealthStatus.RED) {
That means that potentially we will log error messages repeatedly. Are we sure this won't be spamming the logs with bogus error messages in "good" cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use warn level? Since its infinite while loop, how do you suggest to get visibility if code is stuck in this block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe: Only start logging errors after being in this loop for more than 2 minutes? I think one has to experiment a bit (with big clusters) to find a good value for this.
If you had issues with stuck threads: Did you collect thread dumps? That is usually a good way to see what's going on.
|
||
public class ConfigurationRepository implements ClusterStateListener, IndexEventListener { | ||
private static final Logger LOGGER = LogManager.getLogger(ConfigurationRepository.class); | ||
private static final int INITIALIZATION_RETRY_INTERVAL_MS = 3000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good to move these constants in src/main/java/org/opensearch/security/support/ConfigConstants.java
final IndexRequest indexRequest = new IndexRequest(index).id(configType) | ||
.opType(OpType.CREATE) | ||
.setRefreshPolicy(RefreshPolicy.IMMEDIATE) | ||
.timeout(DEFAULT_TIMEOUT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when a cluster is large and overloaded, taking more than 30 seconds for indexing? Shouldn't the timeout value be configurable like a normal indexing API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the timeout for normal indexing API?
.actionGet(); | ||
} catch (Exception e) { | ||
LOGGER.debug("Caught a {} but we just try again ...", e.toString()); | ||
LOGGER.error("Caught a {} but we just try again ...", e.toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please keep the logging format consistent for logging error, i.e.,
LOGGER.error("Caught exception but we just try again ...", e);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the work!
On a high level, could you explain the approach? It is mostly just making the init procedure in case of an absent security index time bound, correct?
Additional note: These changes only apply for the cases that the security index does not exist yet, i.e. a fresh cluster or a cluster where security is enabled at some point.
From the description of #5603, it was not exactly clear to me that this was only about this case. Are we also going to look at the case that the index does already exist or is this not a concern here?
} catch (Exception e) { | ||
LOGGER.error("Cannot apply default config (this is maybe not an error!)", e); | ||
LOGGER.error("Cannot apply default config", e); | ||
throw e; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate on why it is ok to abort the process here by throwing an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error is due to security index creation/default docs indexation failure (Excluding doc already exists error). So it might be partially indexed or not indexed at all depending on failure point. So there is no good benefit to allow the code to process further when indexing itself is not complete and more than that it can leave system in partially initialized state and customer API calls can receive 4xx/5xx errors which isn't desired behaviour
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my experience, during the early node startup phase, OpenSearch can throw sometimes exceptions you do not see otherwise (something like cluster state not intialized or similar).
I cannot give you any concrete examples, but I would recommend to thoroughly test this whether this might trip over such exceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. We can bake this change for a week and see if there is any unexpected regression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With "bake" do you mean comitting it to the repo and see if regressions show up in the nightlies of this or other repos? IMHO, that would be too little exposure to get an reliable impression. Or did you think of something else?
|
||
if (!dynamicConfigFactory.isInitialized()) { | ||
LOGGER.error("Node '{}' failed to initialize", clusterService.localNode().getName()); | ||
throw new IllegalStateException(String.format("Node '%s' failed to initialize", clusterService.localNode().getName())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be caught and logged in line 311:
} catch (Exception e) {
LOGGER.error("Unexpected exception while initializing node " + e, e);
}
Thus, basically you will get the log message "Node '{}' failed to initialize
twice in a row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, will remove line 303 where we are logging first time
public static final String OPENSEARCH_RESOURCE_SHARING_ENABLED = "plugins.security.experimental.resource_sharing.enabled"; | ||
public static final boolean OPENSEARCH_RESOURCE_SHARING_ENABLED_DEFAULT = false; | ||
|
||
public static final TimeValue DEFAULT_TIMEOUT = TimeValue.timeValueSeconds(30); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not have a general purpose globall default timeout.
Timeout handling is tricky, thus we should look on a case by case basis what timeouts are suitable. Initialization processes might have a much longer timeout than for example API calls triggered by users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, can remove this. Timeout currently used is for index creation and default docs indexation, do you see any scenario taking more than 30 sec? What is your recommendation for ideal timeout value here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the loop inside waitForSecurityIndexToBeAtLeastYellow()
it does not matter so much IMHO, as it is retrying indefinitely. For the IndexRequest, one might consider increasing it to something more than the default of 1 minute, as reaching the timeout will be a fatal failure. Yet, this is all only an "IMHO", experiences have to be collected by testing.
response = client.admin() | ||
.cluster() | ||
.health(new ClusterHealthRequest(securityIndex).waitForActiveShards(1).waitForYellowStatus()) | ||
.health(new ClusterHealthRequest(securityIndex).waitForActiveShards(1).waitForYellowStatus().timeout(DEFAULT_TIMEOUT)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ClusterHealthRequest and CreateIndexRequest already use 30s as a default value for timeouts.
Could you explain the rationale for setting the timeout value explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have seen good number of times where these operations are stuck like below. In this case during indexing of 'rolesmapping'
doc. Are we sure it is having default value of 30 secs? We have seen many such tickets
3603:[2025-06-09T14:32:01,165][INFO ][o.o.s.s.ConfigHelper ] [bd87625b46a927da71043d943ba615a0] Will update 'config' with /apollo/_env/OS_2_19AMI-ES2-p001-d0db_OS_219/opensearch-security/config.yml and populate it with empty doc if file missing and populateEmptyIfFileMissing=false
4031:[2025-06-09T14:32:01,620][INFO ][o.o.s.s.ConfigHelper ] [bd87625b46a927da71043d943ba615a0] Will update 'roles' with /apollo/_env/OS_2_19AMI-ES2-p001-d0db_OS_219/opensearch-security/roles.yml and populate it with empty doc if file missing and populateEmptyIfFileMissing=false
4038:[2025-06-09T14:32:01,759][INFO ][o.o.s.s.ConfigHelper ] [bd87625b46a927da71043d943ba615a0] Will update 'rolesmapping' with /apollo/_env/OS_2_19AMI-ES2-p001-d0db_OS_219/opensearch-security/roles_mapping.yml and populate it with empty doc if file missing and populateEmptyIfFileMissing=false
4030:[2025-06-09T14:32:01,620][INFO ][o.o.s.s.ConfigHelper ] [bd87625b46a927da71043d943ba615a0] Index .opendistro_security already contains doc with id config, skipping update.
4037:[2025-06-09T14:32:01,759][INFO ][o.o.s.s.ConfigHelper ] [bd87625b46a927da71043d943ba615a0] Index .opendistro_security already contains doc with id roles, skipping update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default timeout value is here:
I am not sure if I understand the logs correctly. Are these from two different nodes? Nothing follows after these logs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, logs are from same domain. Node id - bd87625b46a927da71043d943ba615a0
This is for ClusterHealthRequest only?
https://github.com/opensearch-project/OpenSearch/blob/6fdb010ef91b075f4060a6d87da9beae681c2c37/server/src/main/java/org/opensearch/action/admin/cluster/health/ClusterHealthRequest.java#L65
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, 30 seconds is the default for ClusterHealthRequest. For indexing documents, the default is 1 minute:
.actionGet(); | ||
} catch (Exception e) { | ||
LOGGER.debug("Caught again a {} but we just try again ...", e.toString()); | ||
LOGGER.error("Caught again a {} but we just try again ...", e.toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are here in a retry loop
while (response == null || response.isTimedOut() || response.getStatus() == ClusterHealthStatus.RED) {
That means that potentially we will log error messages repeatedly. Are we sure this won't be spamming the logs with bogus error messages in "good" cases?
break; | ||
} catch (Exception e) { | ||
LOGGER.debug("Unable to load configuration due to {}", String.valueOf(ExceptionUtils.getRootCause(e))); | ||
LOGGER.error("Unable to load configuration due to {}", String.valueOf(ExceptionUtils.getRootCause(e))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are here in a retry loop, which is potentially executed every 3 seconds. Thus, we might log an error message every 3 seconds. Are we sure this won't spam the logs in some cases?
Yes making init procedure time bound
Yes for new cluster/security enabled at some point
|
I guess this is the case for the managed service? From a general POV, this cannot be assumed IMHO. |
Description
[Describe what this change achieves]
Category (Enhancement, New feature, Bug fix, Test fix, Refactoring, Maintenance, Documentation): Enhancement
Why these changes are required?
What is the old behavior before changes and new behavior after changes?
Fatal errors even if thrown it doesn't stop/fail ES process, node will be healthy with no security module initialized. This is because we execute security init process in separate thread here, so even if we throw fatal errors here instead of silently catching and ignoring them like currently, we can't fail loading of security plugin since this executes on separate thread. However throwing error atleast helps in understanding why it failed from logs
Adding timeout, making infinite loops time bound (5 mins) makes it more predictable, reliable and increases visibility of initialization process. Improving health check API and having a watcher for it and when it fails having auto cut mechanism and auto remediation (ES process restart) will help in resolving this issue
Issues Resolved
5603
Is this a backport? If so, please add backport PR # and/or commits #, and remove
backport-failed
label from the original PR.Do these changes introduce new permission(s) to be displayed in the static dropdown on the front-end? If so, please open a draft PR in the security dashboards plugin and link the draft PR here
Testing
Tested in local cluster security initialization is successful and induced timeout errors and verified error stack is present in logs
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.