Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -96,11 +96,15 @@
import org.opensearch.threadpool.ThreadPool;
import org.opensearch.transport.client.Client;

import static org.opensearch.security.support.ConfigConstants.DEFAULT_TIMEOUT;
import static org.opensearch.security.support.ConfigConstants.SECURITY_ALLOW_DEFAULT_INIT_USE_CLUSTER_STATE;
import static org.opensearch.security.support.SnapshotRestoreHelper.isSecurityIndexRestoredFromSnapshot;

public class ConfigurationRepository implements ClusterStateListener, IndexEventListener {
private static final Logger LOGGER = LogManager.getLogger(ConfigurationRepository.class);
private static final int INITIALIZATION_RETRY_INTERVAL_MS = 3000;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to move these constants in src/main/java/org/opensearch/security/support/ConfigConstants.java

private static final int INITIALIZATION_MAX_DURATION_MS = 5 * 60 * 1000;
private static final int MAX_RETRIES = INITIALIZATION_MAX_DURATION_MS / INITIALIZATION_RETRY_INTERVAL_MS;

private final String securityIndex;
private final Client client;
Expand Down Expand Up @@ -273,26 +277,33 @@ private void initalizeClusterConfiguration(final boolean installDefaultConfig) {
LOGGER.error("{} does not exist", confFile.getAbsolutePath());
}
} catch (Exception e) {
LOGGER.error("Cannot apply default config (this is maybe not an error!)", e);
LOGGER.error("Cannot apply default config", e);
throw e;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate on why it is ok to abort the process here by throwing an exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error is due to security index creation/default docs indexation failure (Excluding doc already exists error). So it might be partially indexed or not indexed at all depending on failure point. So there is no good benefit to allow the code to process further when indexing itself is not complete and more than that it can leave system in partially initialized state and customer API calls can receive 4xx/5xx errors which isn't desired behaviour

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my experience, during the early node startup phase, OpenSearch can throw sometimes exceptions you do not see otherwise (something like cluster state not intialized or similar).

I cannot give you any concrete examples, but I would recommend to thoroughly test this whether this might trip over such exceptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. We can bake this change for a week and see if there is any unexpected regression?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With "bake" do you mean comitting it to the repo and see if regressions show up in the nightlies of this or other repos? IMHO, that would be too little exposure to get an reliable impression. Or did you think of something else?

}
}

while (!dynamicConfigFactory.isInitialized()) {
for (int retryCount = 0; retryCount < MAX_RETRIES && !dynamicConfigFactory.isInitialized(); retryCount++) {
try {
LOGGER.debug("Try to load config ...");
LOGGER.info("Try to load config ...");
reloadConfiguration(CType.values(), true);
break;
} catch (Exception e) {
LOGGER.debug("Unable to load configuration due to {}", String.valueOf(ExceptionUtils.getRootCause(e)));
LOGGER.error("Unable to load configuration due to {}", String.valueOf(ExceptionUtils.getRootCause(e)));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @nagarajg17, for the improvement.

please keep the logging format consistent for logging error, i.e.,

LOGGER.error("Unable to load configuration due to", e);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are here in a retry loop, which is potentially executed every 3 seconds. Thus, we might log an error message every 3 seconds. Are we sure this won't spam the logs in some cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem we see is since it is set to debug logs, when we try to investigate during initialization failures there won't be any debug logs and becomes hard to debug. Since it is in catch block, do you see it can get logged every 3 secs? Ideally we don't expect code flow to come here right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot really say if we expect control flow to reach this place. I guess this needs to be found out by testing.

try {
TimeUnit.MILLISECONDS.sleep(3000);
TimeUnit.MILLISECONDS.sleep(INITIALIZATION_RETRY_INTERVAL_MS);
} catch (InterruptedException e1) {
Thread.currentThread().interrupt();
LOGGER.debug("Thread was interrupted so we cancel initialization");
LOGGER.error("Thread was interrupted so we cancel initialization");
break;
}
}
}

if (!dynamicConfigFactory.isInitialized()) {
LOGGER.error("Node '{}' failed to initialize", clusterService.localNode().getName());
throw new IllegalStateException(String.format("Node '%s' failed to initialize", clusterService.localNode().getName()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be caught and logged in line 311:

        } catch (Exception e) {
            LOGGER.error("Unexpected exception while initializing node " + e, e);
        }

Thus, basically you will get the log message "Node '{}' failed to initialize twice in a row.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, will remove line 303 where we are logging first time

}

setupAuditConfigurationIfAny(cl.isAuditConfigDocPresentInIndex());
LOGGER.info("Node '{}' initialized", clusterService.localNode().getName());

Expand Down Expand Up @@ -325,7 +336,8 @@ private void setupAuditConfigurationIfAny(final boolean auditConfigDocPresent) {
private boolean createSecurityIndexIfAbsent() {
try {
final Map<String, Object> indexSettings = ImmutableMap.of("index.number_of_shards", 1, "index.auto_expand_replicas", "0-all");
final CreateIndexRequest createIndexRequest = new CreateIndexRequest(securityIndex).settings(indexSettings);
final CreateIndexRequest createIndexRequest = new CreateIndexRequest(securityIndex).timeout(DEFAULT_TIMEOUT)
.settings(indexSettings);
final boolean ok = client.admin().indices().create(createIndexRequest).actionGet().isAcknowledged();
LOGGER.info("Index {} created?: {}", securityIndex, ok);
return ok;
Expand All @@ -341,14 +353,14 @@ private void waitForSecurityIndexToBeAtLeastYellow() {
try {
response = client.admin()
.cluster()
.health(new ClusterHealthRequest(securityIndex).waitForActiveShards(1).waitForYellowStatus())
.health(new ClusterHealthRequest(securityIndex).waitForActiveShards(1).waitForYellowStatus().timeout(DEFAULT_TIMEOUT))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClusterHealthRequest and CreateIndexRequest already use 30s as a default value for timeouts.

Could you explain the rationale for setting the timeout value explicitly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have seen good number of times where these operations are stuck like below. In this case during indexing of 'rolesmapping' doc. Are we sure it is having default value of 30 secs? We have seen many such tickets

3603:[2025-06-09T14:32:01,165][INFO ][o.o.s.s.ConfigHelper     ] [bd87625b46a927da71043d943ba615a0] Will update 'config' with /apollo/_env/OS_2_19AMI-ES2-p001-d0db_OS_219/opensearch-security/config.yml and populate it with empty doc if file missing and populateEmptyIfFileMissing=false
4031:[2025-06-09T14:32:01,620][INFO ][o.o.s.s.ConfigHelper     ] [bd87625b46a927da71043d943ba615a0] Will update 'roles' with /apollo/_env/OS_2_19AMI-ES2-p001-d0db_OS_219/opensearch-security/roles.yml and populate it with empty doc if file missing and populateEmptyIfFileMissing=false
4038:[2025-06-09T14:32:01,759][INFO ][o.o.s.s.ConfigHelper     ] [bd87625b46a927da71043d943ba615a0] Will update 'rolesmapping' with /apollo/_env/OS_2_19AMI-ES2-p001-d0db_OS_219/opensearch-security/roles_mapping.yml and populate it with empty doc if file missing and populateEmptyIfFileMissing=false


4030:[2025-06-09T14:32:01,620][INFO ][o.o.s.s.ConfigHelper     ] [bd87625b46a927da71043d943ba615a0] Index .opendistro_security already contains doc with id config, skipping update.
4037:[2025-06-09T14:32:01,759][INFO ][o.o.s.s.ConfigHelper     ] [bd87625b46a927da71043d943ba615a0] Index .opendistro_security already contains doc with id roles, skipping update

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default timeout value is here:

https://github.com/opensearch-project/OpenSearch/blob/6fdb010ef91b075f4060a6d87da9beae681c2c37/server/src/main/java/org/opensearch/action/admin/cluster/health/ClusterHealthRequest.java#L65

I am not sure if I understand the logs correctly. Are these from two different nodes? Nothing follows after these logs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, 30 seconds is the default for ClusterHealthRequest. For indexing documents, the default is 1 minute:

https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/action/support/replication/ReplicationRequest.java

.actionGet();
} catch (Exception e) {
LOGGER.debug("Caught a {} but we just try again ...", e.toString());
LOGGER.error("Caught a {} but we just try again ...", e.toString());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please keep the logging format consistent for logging error, i.e.,

LOGGER.error("Caught exception but we just try again ...", e);

}

while (response == null || response.isTimedOut() || response.getStatus() == ClusterHealthStatus.RED) {
LOGGER.debug(
LOGGER.error(
"index '{}' not healthy yet, we try again ... (Reason: {})",
securityIndex,
response == null ? "no response" : (response.isTimedOut() ? "timeout" : "other, maybe red cluster")
Expand All @@ -360,9 +372,12 @@ private void waitForSecurityIndexToBeAtLeastYellow() {
Thread.currentThread().interrupt();
}
try {
response = client.admin().cluster().health(new ClusterHealthRequest(securityIndex).waitForYellowStatus()).actionGet();
response = client.admin()
.cluster()
.health(new ClusterHealthRequest(securityIndex).waitForYellowStatus().timeout(DEFAULT_TIMEOUT))
.actionGet();
} catch (Exception e) {
LOGGER.debug("Caught again a {} but we just try again ...", e.toString());
LOGGER.error("Caught again a {} but we just try again ...", e.toString());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please keep the logging format consistent for logging error, i.e.,

 LOGGER.error("Caught again but we just try again ...", e);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are here in a retry loop

        while (response == null || response.isTimedOut() || response.getStatus() == ClusterHealthStatus.RED) {

That means that potentially we will log error messages repeatedly. Are we sure this won't be spamming the logs with bogus error messages in "good" cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use warn level? Since its infinite while loop, how do you suggest to get visibility if code is stuck in this block?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe: Only start logging errors after being in this loop for more than 2 minutes? I think one has to experiment a bit (with big clusters) to find a good value for this.

If you had issues with stuck threads: Did you collect thread dumps? That is usually a good way to see what's going on.

}
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@
import com.google.common.collect.ImmutableSet;

import org.opensearch.common.settings.Settings;
import org.opensearch.common.unit.TimeValue;
import org.opensearch.security.auditlog.impl.AuditCategory;

import com.password4j.types.Hmac;
Expand Down Expand Up @@ -426,6 +427,8 @@ public enum RolesMappingResolution {
public static final String OPENSEARCH_RESOURCE_SHARING_ENABLED = "plugins.security.experimental.resource_sharing.enabled";
public static final boolean OPENSEARCH_RESOURCE_SHARING_ENABLED_DEFAULT = false;

public static final TimeValue DEFAULT_TIMEOUT = TimeValue.timeValueSeconds(30);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not have a general purpose globall default timeout.

Timeout handling is tricky, thus we should look on a case by case basis what timeouts are suitable. Initialization processes might have a much longer timeout than for example API calls triggered by users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, can remove this. Timeout currently used is for index creation and default docs indexation, do you see any scenario taking more than 30 sec? What is your recommendation for ideal timeout value here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the loop inside waitForSecurityIndexToBeAtLeastYellow() it does not matter so much IMHO, as it is retrying indefinitely. For the IndexRequest, one might consider increasing it to something more than the default of 1 minute, as reaching the timeout will be a fatal failure. Yet, this is all only an "IMHO", experiences have to be collected by testing.


public static Set<String> getSettingAsSet(
final Settings settings,
final String key,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@
import org.opensearch.transport.client.Client;

import static org.opensearch.core.xcontent.DeprecationHandler.THROW_UNSUPPORTED_OPERATION;
import static org.opensearch.security.support.ConfigConstants.DEFAULT_TIMEOUT;

@Deprecated
public class ConfigHelper {
Expand Down Expand Up @@ -95,6 +96,7 @@ public static void uploadFile(
final IndexRequest indexRequest = new IndexRequest(index).id(configType)
.opType(OpType.CREATE)
.setRefreshPolicy(RefreshPolicy.IMMEDIATE)
.timeout(DEFAULT_TIMEOUT)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when a cluster is large and overloaded, taking more than 30 seconds for indexing? Shouldn't the timeout value be configurable like a normal indexing API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the timeout for normal indexing API?

.source(configType, readXContent(reader, XContentType.YAML));
final String res = tc.index(indexRequest).actionGet().getId();

Expand Down
Loading