-
Notifications
You must be signed in to change notification settings - Fork 336
Security init improvements #5625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -96,11 +96,15 @@ | |
import org.opensearch.threadpool.ThreadPool; | ||
import org.opensearch.transport.client.Client; | ||
|
||
import static org.opensearch.security.support.ConfigConstants.DEFAULT_TIMEOUT; | ||
import static org.opensearch.security.support.ConfigConstants.SECURITY_ALLOW_DEFAULT_INIT_USE_CLUSTER_STATE; | ||
import static org.opensearch.security.support.SnapshotRestoreHelper.isSecurityIndexRestoredFromSnapshot; | ||
|
||
public class ConfigurationRepository implements ClusterStateListener, IndexEventListener { | ||
private static final Logger LOGGER = LogManager.getLogger(ConfigurationRepository.class); | ||
private static final int INITIALIZATION_RETRY_INTERVAL_MS = 3000; | ||
private static final int INITIALIZATION_MAX_DURATION_MS = 5 * 60 * 1000; | ||
private static final int MAX_RETRIES = INITIALIZATION_MAX_DURATION_MS / INITIALIZATION_RETRY_INTERVAL_MS; | ||
|
||
private final String securityIndex; | ||
private final Client client; | ||
|
@@ -273,26 +277,33 @@ private void initalizeClusterConfiguration(final boolean installDefaultConfig) { | |
LOGGER.error("{} does not exist", confFile.getAbsolutePath()); | ||
} | ||
} catch (Exception e) { | ||
LOGGER.error("Cannot apply default config (this is maybe not an error!)", e); | ||
LOGGER.error("Cannot apply default config", e); | ||
throw e; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you elaborate on why it is ok to abort the process here by throwing an exception? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The error is due to security index creation/default docs indexation failure (Excluding doc already exists error). So it might be partially indexed or not indexed at all depending on failure point. So there is no good benefit to allow the code to process further when indexing itself is not complete and more than that it can leave system in partially initialized state and customer API calls can receive 4xx/5xx errors which isn't desired behaviour There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From my experience, during the early node startup phase, OpenSearch can throw sometimes exceptions you do not see otherwise (something like cluster state not intialized or similar). I cannot give you any concrete examples, but I would recommend to thoroughly test this whether this might trip over such exceptions. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Got it. We can bake this change for a week and see if there is any unexpected regression? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With "bake" do you mean comitting it to the repo and see if regressions show up in the nightlies of this or other repos? IMHO, that would be too little exposure to get an reliable impression. Or did you think of something else? |
||
} | ||
} | ||
|
||
while (!dynamicConfigFactory.isInitialized()) { | ||
for (int retryCount = 0; retryCount < MAX_RETRIES && !dynamicConfigFactory.isInitialized(); retryCount++) { | ||
try { | ||
LOGGER.debug("Try to load config ..."); | ||
LOGGER.info("Try to load config ..."); | ||
reloadConfiguration(CType.values(), true); | ||
break; | ||
} catch (Exception e) { | ||
LOGGER.debug("Unable to load configuration due to {}", String.valueOf(ExceptionUtils.getRootCause(e))); | ||
LOGGER.error("Unable to load configuration due to {}", String.valueOf(ExceptionUtils.getRootCause(e))); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks, @nagarajg17, for the improvement. please keep the logging format consistent for logging error, i.e.,
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are here in a retry loop, which is potentially executed every 3 seconds. Thus, we might log an error message every 3 seconds. Are we sure this won't spam the logs in some cases? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The problem we see is since it is set to debug logs, when we try to investigate during initialization failures there won't be any debug logs and becomes hard to debug. Since it is in catch block, do you see it can get logged every 3 secs? Ideally we don't expect code flow to come here right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I cannot really say if we expect control flow to reach this place. I guess this needs to be found out by testing. |
||
try { | ||
TimeUnit.MILLISECONDS.sleep(3000); | ||
TimeUnit.MILLISECONDS.sleep(INITIALIZATION_RETRY_INTERVAL_MS); | ||
} catch (InterruptedException e1) { | ||
Thread.currentThread().interrupt(); | ||
LOGGER.debug("Thread was interrupted so we cancel initialization"); | ||
LOGGER.error("Thread was interrupted so we cancel initialization"); | ||
break; | ||
} | ||
} | ||
} | ||
|
||
if (!dynamicConfigFactory.isInitialized()) { | ||
LOGGER.error("Node '{}' failed to initialize", clusterService.localNode().getName()); | ||
throw new IllegalStateException(String.format("Node '%s' failed to initialize", clusterService.localNode().getName())); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This will be caught and logged in line 311:
Thus, basically you will get the log message There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ack, will remove line 303 where we are logging first time |
||
} | ||
|
||
setupAuditConfigurationIfAny(cl.isAuditConfigDocPresentInIndex()); | ||
LOGGER.info("Node '{}' initialized", clusterService.localNode().getName()); | ||
|
||
|
@@ -325,7 +336,8 @@ private void setupAuditConfigurationIfAny(final boolean auditConfigDocPresent) { | |
private boolean createSecurityIndexIfAbsent() { | ||
try { | ||
final Map<String, Object> indexSettings = ImmutableMap.of("index.number_of_shards", 1, "index.auto_expand_replicas", "0-all"); | ||
final CreateIndexRequest createIndexRequest = new CreateIndexRequest(securityIndex).settings(indexSettings); | ||
final CreateIndexRequest createIndexRequest = new CreateIndexRequest(securityIndex).timeout(DEFAULT_TIMEOUT) | ||
.settings(indexSettings); | ||
final boolean ok = client.admin().indices().create(createIndexRequest).actionGet().isAcknowledged(); | ||
LOGGER.info("Index {} created?: {}", securityIndex, ok); | ||
return ok; | ||
|
@@ -341,14 +353,14 @@ private void waitForSecurityIndexToBeAtLeastYellow() { | |
try { | ||
response = client.admin() | ||
.cluster() | ||
.health(new ClusterHealthRequest(securityIndex).waitForActiveShards(1).waitForYellowStatus()) | ||
.health(new ClusterHealthRequest(securityIndex).waitForActiveShards(1).waitForYellowStatus().timeout(DEFAULT_TIMEOUT)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ClusterHealthRequest and CreateIndexRequest already use 30s as a default value for timeouts. Could you explain the rationale for setting the timeout value explicitly? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We have seen good number of times where these operations are stuck like below. In this case during indexing of
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The default timeout value is here: I am not sure if I understand the logs correctly. Are these from two different nodes? Nothing follows after these logs? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, logs are from same domain. Node id - This is for ClusterHealthRequest only? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, 30 seconds is the default for ClusterHealthRequest. For indexing documents, the default is 1 minute: |
||
.actionGet(); | ||
} catch (Exception e) { | ||
LOGGER.debug("Caught a {} but we just try again ...", e.toString()); | ||
LOGGER.error("Caught a {} but we just try again ...", e.toString()); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. please keep the logging format consistent for logging error, i.e.,
|
||
} | ||
|
||
while (response == null || response.isTimedOut() || response.getStatus() == ClusterHealthStatus.RED) { | ||
LOGGER.debug( | ||
LOGGER.error( | ||
"index '{}' not healthy yet, we try again ... (Reason: {})", | ||
securityIndex, | ||
response == null ? "no response" : (response.isTimedOut() ? "timeout" : "other, maybe red cluster") | ||
|
@@ -360,9 +372,12 @@ private void waitForSecurityIndexToBeAtLeastYellow() { | |
Thread.currentThread().interrupt(); | ||
} | ||
try { | ||
response = client.admin().cluster().health(new ClusterHealthRequest(securityIndex).waitForYellowStatus()).actionGet(); | ||
response = client.admin() | ||
.cluster() | ||
.health(new ClusterHealthRequest(securityIndex).waitForYellowStatus().timeout(DEFAULT_TIMEOUT)) | ||
.actionGet(); | ||
} catch (Exception e) { | ||
LOGGER.debug("Caught again a {} but we just try again ...", e.toString()); | ||
LOGGER.error("Caught again a {} but we just try again ...", e.toString()); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. please keep the logging format consistent for logging error, i.e.,
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are here in a retry loop
That means that potentially we will log error messages repeatedly. Are we sure this won't be spamming the logs with bogus error messages in "good" cases? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we use warn level? Since its infinite while loop, how do you suggest to get visibility if code is stuck in this block? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe: Only start logging errors after being in this loop for more than 2 minutes? I think one has to experiment a bit (with big clusters) to find a good value for this. If you had issues with stuck threads: Did you collect thread dumps? That is usually a good way to see what's going on. |
||
} | ||
} | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -36,6 +36,7 @@ | |
import com.google.common.collect.ImmutableSet; | ||
|
||
import org.opensearch.common.settings.Settings; | ||
import org.opensearch.common.unit.TimeValue; | ||
import org.opensearch.security.auditlog.impl.AuditCategory; | ||
|
||
import com.password4j.types.Hmac; | ||
|
@@ -426,6 +427,8 @@ public enum RolesMappingResolution { | |
public static final String OPENSEARCH_RESOURCE_SHARING_ENABLED = "plugins.security.experimental.resource_sharing.enabled"; | ||
public static final boolean OPENSEARCH_RESOURCE_SHARING_ENABLED_DEFAULT = false; | ||
|
||
public static final TimeValue DEFAULT_TIMEOUT = TimeValue.timeValueSeconds(30); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should not have a general purpose globall default timeout. Timeout handling is tricky, thus we should look on a case by case basis what timeouts are suitable. Initialization processes might have a much longer timeout than for example API calls triggered by users. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree, can remove this. Timeout currently used is for index creation and default docs indexation, do you see any scenario taking more than 30 sec? What is your recommendation for ideal timeout value here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the loop inside |
||
|
||
public static Set<String> getSettingAsSet( | ||
final Settings settings, | ||
final String key, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -56,6 +56,7 @@ | |
import org.opensearch.transport.client.Client; | ||
|
||
import static org.opensearch.core.xcontent.DeprecationHandler.THROW_UNSUPPORTED_OPERATION; | ||
import static org.opensearch.security.support.ConfigConstants.DEFAULT_TIMEOUT; | ||
|
||
@Deprecated | ||
public class ConfigHelper { | ||
|
@@ -95,6 +96,7 @@ public static void uploadFile( | |
final IndexRequest indexRequest = new IndexRequest(index).id(configType) | ||
.opType(OpType.CREATE) | ||
.setRefreshPolicy(RefreshPolicy.IMMEDIATE) | ||
.timeout(DEFAULT_TIMEOUT) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happens when a cluster is large and overloaded, taking more than 30 seconds for indexing? Shouldn't the timeout value be configurable like a normal indexing API? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the timeout for normal indexing API? |
||
.source(configType, readXContent(reader, XContentType.YAML)); | ||
final String res = tc.index(indexRequest).actionGet().getId(); | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good to move these constants in src/main/java/org/opensearch/security/support/ConfigConstants.java