Skip to content

Fix inner hits + aggregations concurrency bug #128036

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 38 commits into
base: main
Choose a base branch
from

Conversation

benchaplin
Copy link
Contributor

Resolves #122419.

There's a concurrency bug that occurs when doing aggregations on inner hits. It can result in one of three exceptions:

  • java.lang.IllegalStateException: Error retrieving path
  • java.lang.NullPointerException: Cannot invoke "java.util.Map.get(Object)" because "this.preloadedStoredFieldValues" is null
  • java.lang.AssertionError: invalid decRef call: already closed

The underlying issue is that InnerHitSubContext is not thread safe, yet instances are shared across leaf slice search threads during an aggregation. Specifically, the race condition occurs when InnerHitSubContext.rootId & InnerHitSubContext.rootSource fields are set and accessed concurrently by multiple threads.

The tests I've added to TopHitsIT reproduce the issue. If you paste those tests into main and run them a few times you should see one of the exceptions.

I've solved this by forking the InnerHitSubContext instances, similar to what was done here #106990. SearchExecutionContext is at times accessed from InnerHitSubContext, so I also had to make sure the forked SearchExecutionContext was used in those cases.

@benchaplin benchaplin requested a review from javanna May 12, 2025 19:52
@benchaplin benchaplin added >bug Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch :Search Foundations/Search Catch all for Search Foundations v8.19.0 v9.1.0 labels May 12, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

@elasticsearchmachine
Copy link
Collaborator

Hi @benchaplin, I've created a changelog YAML for you.

elasticsearchmachine and others added 2 commits May 14, 2025 08:50
It's possible for another component to request a S3 client after the
node has started to shut down, and today the `S3Service` will dutifully
attempt to create a fresh client instance even if it is closed. Such
clients will then leak, resulting in test failures.

With this commit we refuse to create new S3 clients once the service has
started to shut down.
PeteGillinElastic and others added 16 commits May 14, 2025 08:50
This changes the default value of both the
`data_streams.auto_sharding.increase_shards.load_metric` and
`data_streams.auto_sharding.decrease_shards.load_metric` cluster
settings from `PEAK` to `ALL_TIME`. This setting has been applied via
config for several weeks now.

The approach taken to updating the tests was to swap the values given for the all-time and peak loads in all the stats objects provided as input to the tests, and to swap the enum values in the couple of places they appear.
Fixes the final assertion in
testRestoreSnapshotAllocationDoesNotExceedWatermarkWithMultipleRestores()
that wasn't addressed with the fix in PR 127615 for issue 127286.

Closes elastic#127787

Co-authored-by: David Turner <[email protected]>
Today there are various mechanisms to prevent writes to readonly
repositories, but they are scattered across the snapshot codebase and do
not obviously prevent writes in all possible circumstances; it'd be easy
to add a new operation on a repository that does not check the readonly
flag in quite the right way.

This commit adds much tighter checks which cannot be circumvented:

- Do not allow to start an update of the root `index-N` blob if the
  repository is marked as readonly in the cluster state.

- Conversely, do not allow the readonly flag to be set if an update of
  the root `index-N` blob is in progress.

- Establish the invariant that we never create a
  `SnapshotsInProgress$Entry`, `SnapshotDeletionsInProgress$Entry`, or
  `RepositoryCleanupInProgress$Entry` if the repository is marked as
  readonly in the cluster state.

Closes elastic#93575
…er() (elastic#127998)

Relaxes a log expectation assertion from an exact test
of numShards running snapshots to 1-numShards, since it
is possible for some of the shard snapshot statuses to
already be in stage=PAUSED.

Closes elastic#127690
After updating Develocity to 2015.1.3 we can also update the according plugin
In Serverless tests, we sometimes hit rounding errors because even
single node tests are executed on 3 nodes there. Rounding makes this
test deterministic.
This entitlement is required, but only if validating the metadata
endpoint against `https://login.microsoft.com/` which isn't something we
can do in a test. Kind of a SDK bug, we should be using an existing
event loop rather than spawning threads randomly like this.
…stic#128003) (elastic#128052)

* Reapply "Adds unused lower level ivf knn query (elastic#127852)" (elastic#128003)

This reverts commit 648d74b.

* Fixing tests
assertTrue(source1.containsKey("message"));
assertTrue(source1.containsKey("reviewers"));
});
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a great reproducer! Should we have one for the parent child sub context too? Sounds like that may be affected by the same bug?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v8.19.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

IllegalStateException during search: "Error retrieving path ..."