Skip to content

Conversation

@nielsbauman
Copy link
Contributor

As of #133954, we clone indices before performing the force-merge step in the searchable_snapshot action. On slow CI servers, 10 seconds for the index to go through the whole searchable_snapshot action isn't enough, so we bump the timeout to 20 seconds.

I looked at the logs of a few test failures, and ILM was clearly still progressing when the test timed out. I didn't identify any particular step that was taking extraordinarily long; there were always just a few steps that took a bit longer. I would love to make these tests faster rather than bumping the timeout, but the searchable_snapshot action is simply one of the largest ILM actions and ILM itself isn't particularly fast.

That being said, if a timeout of 20 seconds proves to be insufficient (i.e. test failures come back), I do think it's worth having a look at reducing the runtime of the tests somehow first before we increase the timeout further.

Closes #137149
Closes #137151
Closes #137152
Closes #137153
Closes #137156
Closes #137166
Closes #137167
Closes #137192

As of elastic#133954, we clone indices before performing the force-merge step
in the `searchable_snapshot` action. On slow CI servers, 10 seconds for
the index to go through the whole `searchable_snapshot` action isn't
enough, so we bump the timeout to 20 seconds.

I looked at the logs of a few test failures and ILM was clearly still
progressing when the test timed out. I didn't identify any particular
step that was taking extraordinarily long; there were always just a few
steps that took a bit longer. I would love to make these tests faster
rather than bumping the timeout, but the `searchable_snapshot` action is
simply one of the largest ILM actions and ILM itself isn't particularly
fast.

That being said, if a timeout of 20 seconds proves to be insufficient, I
do think it's worth having a look at reducing the runtime of the tests
somehow first before we increase the timeout further.
@nielsbauman nielsbauman added >test Issues or PRs that are addressing/adding tests :Data Management/ILM+SLM Index and Snapshot lifecycle management auto-backport Automatically create backport pull requests when merged branch:9.2 labels Nov 3, 2025
@elasticsearchmachine elasticsearchmachine added v9.3.0 Team:Data Management Meta label for data/management team v9.2.1 labels Nov 3, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

Copy link
Member

@PeteGillinElastic PeteGillinElastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks as always.

assertOK(client().performRequest(restoreSnapshot));

assertThat(indexExists(searchableSnapMountedIndexName), is(true));
awaitIndexExists(searchableSnapMountedIndexName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking, this one doesn't need the extended timeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, this is just waiting for the index to be restored after the _restore API from a few lines before. That should definitely not take more than 10 seconds. Thanks for checking!

Map<String, Phase> phases = new HashMap<>();
phases.put("cold", new Phase("cold", TimeValue.ZERO, coldActions));
phases.put("delete", new Phase("delete", TimeValue.timeValueMillis(10000), Map.of(DeleteAction.NAME, WITH_SNAPSHOT_DELETE)));
phases.put("delete", new Phase("delete", TimeValue.ZERO, Map.of(DeleteAction.NAME, WITH_SNAPSHOT_DELETE)));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, FTR, I changed this value from 10s to 0s because there is no point in waiting 10 seconds before we delete the searchable snapshotted index; we can just delete it immediately without compromising the flakiness or value of this test.

@nielsbauman nielsbauman enabled auto-merge (squash) November 3, 2025 12:44
@nielsbauman nielsbauman disabled auto-merge November 3, 2025 12:44
@nielsbauman nielsbauman enabled auto-merge (squash) November 3, 2025 12:44
@nielsbauman nielsbauman merged commit 60b89a8 into elastic:main Nov 3, 2025
34 of 35 checks passed
@nielsbauman nielsbauman deleted the fix-searchable-snapshot-tests branch November 3, 2025 14:35
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
9.2 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 137514

@nielsbauman
Copy link
Contributor Author

💚 All backports created successfully

Status Branch Result
9.2

Questions ?

Please refer to the Backport tool documentation

nielsbauman added a commit to nielsbauman/elasticsearch that referenced this pull request Nov 3, 2025
As of elastic#133954, we clone indices before performing the force-merge step in the `searchable_snapshot` action. On slow CI servers, 10 seconds for the index to go through the whole `searchable_snapshot` action isn't enough, so we bump the timeout to 20 seconds.

I looked at the logs of a few test failures, and ILM was clearly still progressing when the test timed out. I didn't identify any particular step that was taking extraordinarily long; there were always just a few steps that took a bit longer. I would love to make these tests faster rather than bumping the timeout, but the `searchable_snapshot` action is simply one of the largest ILM actions and ILM itself isn't particularly fast.

That being said, if a timeout of 20 seconds proves to be insufficient (i.e. test failures come back), I do think it's worth having a look at reducing the runtime of the tests somehow first before we increase the timeout further.

Closes elastic#137149
Closes elastic#137151
Closes elastic#137152
Closes elastic#137153
Closes elastic#137156
Closes elastic#137166
Closes elastic#137167
Closes elastic#137192

(cherry picked from commit 60b89a8)

# Conflicts:
#	muted-tests.yml
elasticsearchmachine pushed a commit that referenced this pull request Nov 3, 2025
…7524)

As of #133954, we clone indices before performing the force-merge step in the `searchable_snapshot` action. On slow CI servers, 10 seconds for the index to go through the whole `searchable_snapshot` action isn't enough, so we bump the timeout to 20 seconds.

I looked at the logs of a few test failures, and ILM was clearly still progressing when the test timed out. I didn't identify any particular step that was taking extraordinarily long; there were always just a few steps that took a bit longer. I would love to make these tests faster rather than bumping the timeout, but the `searchable_snapshot` action is simply one of the largest ILM actions and ILM itself isn't particularly fast.

That being said, if a timeout of 20 seconds proves to be insufficient (i.e. test failures come back), I do think it's worth having a look at reducing the runtime of the tests somehow first before we increase the timeout further.

Closes #137149
Closes #137151
Closes #137152
Closes #137153
Closes #137156
Closes #137166
Closes #137167
Closes #137192

(cherry picked from commit 60b89a8)

# Conflicts:
#	muted-tests.yml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment