-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Fix inner hits + aggregations concurrency bug #128036
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
benchaplin
wants to merge
38
commits into
elastic:main
Choose a base branch
from
benchaplin:inner_hits_aggs_concurrency_bug
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Fix inner hits + aggregations concurrency bug #128036
benchaplin
wants to merge
38
commits into
elastic:main
from
benchaplin:inner_hits_aggs_concurrency_bug
+151
−1
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
Hi @benchaplin, I've created a changelog YAML for you. |
…tionTests testSimpleWithCranky elastic#128024
It's possible for another component to request a S3 client after the node has started to shut down, and today the `S3Service` will dutifully attempt to create a fresh client instance even if it is closed. Such clients will then leak, resulting in test failures. With this commit we refuse to create new S3 clients once the service has started to shut down.
This changes the default value of both the `data_streams.auto_sharding.increase_shards.load_metric` and `data_streams.auto_sharding.decrease_shards.load_metric` cluster settings from `PEAK` to `ALL_TIME`. This setting has been applied via config for several weeks now. The approach taken to updating the tests was to swap the values given for the all-time and peak loads in all the stats objects provided as input to the tests, and to swap the enum values in the couple of places they appear.
…up-join.MvJoinKeyOnTheLookupIndex ASYNC} elastic#128030
Fixes the final assertion in testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleRestores() that wasn't addressed with the fix in PR 127615 for issue 127286. Closes elastic#127787 Co-authored-by: David Turner <[email protected]>
Today there are various mechanisms to prevent writes to readonly repositories, but they are scattered across the snapshot codebase and do not obviously prevent writes in all possible circumstances; it'd be easy to add a new operation on a repository that does not check the readonly flag in quite the right way. This commit adds much tighter checks which cannot be circumvented: - Do not allow to start an update of the root `index-N` blob if the repository is marked as readonly in the cluster state. - Conversely, do not allow the readonly flag to be set if an update of the root `index-N` blob is in progress. - Establish the invariant that we never create a `SnapshotsInProgress$Entry`, `SnapshotDeletionsInProgress$Entry`, or `RepositoryCleanupInProgress$Entry` if the repository is marked as readonly in the cluster state. Closes elastic#93575
Co-authored-by: Elastic Machine <[email protected]>
…er() (elastic#127998) Relaxes a log expectation assertion from an exact test of numShards running snapshots to 1-numShards, since it is possible for some of the shard snapshot statuses to already be in stage=PAUSED. Closes elastic#127690
…)" (elastic#128005) (elastic#128051) This reverts commit 8a17a5e. reapplying ivf format, but with a fix.
After updating Develocity to 2015.1.3 we can also update the according plugin
In Serverless tests, we sometimes hit rounding errors because even single node tests are executed on 3 nodes there. Rounding makes this test deterministic.
…rentUserAndGroupWithoutBindMounting elastic#128044
This entitlement is required, but only if validating the metadata endpoint against `https://login.microsoft.com/` which isn't something we can do in a test. Kind of a SDK bug, we should be using an existing event loop rather than spawning threads randomly like this.
…stic#128003) (elastic#128052) * Reapply "Adds unused lower level ivf knn query (elastic#127852)" (elastic#128003) This reverts commit 648d74b. * Fixing tests
javanna
reviewed
May 19, 2025
assertTrue(source1.containsKey("message")); | ||
assertTrue(source1.containsKey("reviewers")); | ||
}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a great reproducer! Should we have one for the parent child sub context too? Sounds like that may be affected by the same bug?
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
>bug
:Search Foundations/Search
Catch all for Search Foundations
Team:Search Foundations
Meta label for the Search Foundations team in Elasticsearch
v8.19.0
v9.1.0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Resolves #122419.
There's a concurrency bug that occurs when doing aggregations on inner hits. It can result in one of three exceptions:
java.lang.IllegalStateException: Error retrieving path
java.lang.NullPointerException: Cannot invoke "java.util.Map.get(Object)" because "this.preloadedStoredFieldValues" is null
java.lang.AssertionError: invalid decRef call: already closed
The underlying issue is that
InnerHitSubContext
is not thread safe, yet instances are shared across leaf slice search threads during an aggregation. Specifically, the race condition occurs whenInnerHitSubContext.rootId
&InnerHitSubContext.rootSource
fields are set and accessed concurrently by multiple threads.The tests I've added to
TopHitsIT
reproduce the issue. If you paste those tests into main and run them a few times you should see one of the exceptions.I've solved this by forking the
InnerHitSubContext
instances, similar to what was done here #106990.SearchExecutionContext
is at times accessed fromInnerHitSubContext
, so I also had to make sure the forkedSearchExecutionContext
was used in those cases.