Skip to content

Conversation

anandpatel9998
Copy link

@anandpatel9998 anandpatel9998 commented Oct 4, 2025

Description

Current cardinality aggregator logic selects DirectCollector over OrdinalsCollector when relative memory overhead due to OrdinalsCollector (compared to DirectCollector) is higher. Because of this relative memory consumption logic, DirectCollector is selected for high cardinality aggregation queries. DirectCollector is slower compared to OrdinalsCollector. This default selection leads to higher search latency even when Opensearch process have available memory to use ordinals collector for faster query performance.

There is no way to figure out memory requirement for nested aggregation because number of buckets are dynamically created as we traverse through all the matching document ids. To overcome this limitation, this change create a hybrid collector which will first use Ordinals Collector and will switch to DirectCollector if memory usage for Ordinals Collector Increase beyond certain threshold. When Hybrid collector switch from Ordinals Collector to Direct Collector, it will utilize already computed aggregation data from Ordinals Collector so that we do not have to rebuild aggregation result using Direct Collector.

Signed-off-by: Anand Pravinbhai Patel [email protected]

Related Issues

Resolves #19260

Check List

  • [ Done ] Functionality includes testing.
  • [ Not Applicable ] API changes companion pull request created, if applicable.
  • [ Is it required ? ] Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

github-actions bot commented Oct 4, 2025

❌ Gradle check result for a2f5dd7: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Oct 4, 2025

❌ Gradle check result for 41a9e69: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Oct 5, 2025

❌ Gradle check result for c142ac4: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Oct 6, 2025

❌ Gradle check result for 88989f3: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Oct 6, 2025

❌ Gradle check result for c142ac4: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Oct 6, 2025

❌ Gradle check result for 06ce5c3: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Oct 7, 2025

❌ Gradle check result for fc328a2: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@anandpatel9998
Copy link
Author

Thanks for the suggestion @owaiskazi19

I am wondering if that will help or not since if one process is running without latest commit changes, it may still fail. Can you help me understand how mixed cluster tests execute ?

@owaiskazi19
Copy link
Member

owaiskazi19 commented Oct 7, 2025

Mixed clusters tests mixed-version clusters to ensure that newer versions can interoperate correctly with older nodes. The :qa:mixed-cluster task spins up a test cluster composed of different versions (old/new nodes). Then the tests validate behavior across upgrades or during rolling restarts.
There is a blog also for the bwc framework: https://opensearch.org/blog/bwc-testing-for-opensearch/
You can also try conditional matching

- is_one_of: 
    profile.shards.0.aggregations.0.debug.ordinals_collectors_used: [0, 1]

Copy link
Contributor

github-actions bot commented Oct 7, 2025

❕ Gradle check result for 4ee0fd1: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link

codecov bot commented Oct 7, 2025

Codecov Report

❌ Patch coverage is 87.27273% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.06%. Comparing base (39b7a59) to head (a34c044).
⚠️ Report is 11 commits behind head on main.

Files with missing lines Patch % Lines
...ch/aggregations/metrics/CardinalityAggregator.java 92.68% 1 Missing and 2 partials ⚠️
...va/org/opensearch/search/DefaultSearchContext.java 83.33% 2 Missing ⚠️
.../org/opensearch/search/internal/SearchContext.java 0.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #19524      +/-   ##
============================================
+ Coverage     73.00%   73.06%   +0.05%     
+ Complexity    70534    70522      -12     
============================================
  Files          5719     5719              
  Lines        323260   323310      +50     
  Branches      46816    46818       +2     
============================================
+ Hits         235993   236217     +224     
+ Misses        68224    67995     -229     
- Partials      19043    19098      +55     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@anandpatel9998
Copy link
Author

Thanks @owaiskazi19 for your suggestions. Adding skip filter helped fix the mixed-cluster tests.

Current cardinality aggregator logic selects DirectCollector over OrdinalsCollector when relative memory overhead due to OrdinalsCollector (compared to DirectCollector) is higher. Because of this relative memory consumption logic, DirectCollector is selected for high cardinality aggregation queries. DirectCollector is slower compared to OrdinalsCollector. This default selection leads to higher search latency even when Opensearch process have available memory to use ordinals collector for faster query performance.

There is no way to figure out memory requirement for nested aggregation because number of buckets are dynamically created as we traverse through all the matching document ids. To overcome this limitation, this change create a hybrid collector which will first use Ordinals Collector and will switch to DirectCollector if memory usage for Ordinals Collector Increase beyond certain threshold. When Hybrid collector switch from Ordinals Collector to Direct Collector, it will utilize already computed aggregation data from Ordinals Collector so that we do not have to rebuild aggregation result using Direct Collector.

Signed-off-by: Anand Pravinbhai Patel <[email protected]>
Copy link
Contributor

github-actions bot commented Oct 8, 2025

❕ Gradle check result for 522a92b: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link
Contributor

github-actions bot commented Oct 8, 2025

❌ Gradle check result for 6375b70: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Oct 8, 2025

❌ Gradle check result for b666de2: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Oct 8, 2025

❌ Gradle check result for e9e7fe0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Oct 9, 2025

❌ Gradle check result for 5848513: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Oct 9, 2025

✅ Gradle check result for a34c044: SUCCESS

Copy link
Contributor

❌ Gradle check result for 871ff0a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

bits = new BitArray(maxOrd, bigArrays);
visitedOrds.set(bucketOrd, bits);
// Update memory usage when new BitArray is created
currentMemoryUsage += memoryOverhead(maxOrd);
Copy link
Contributor

@rishabhmaurya rishabhmaurya Oct 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we maintain a flag here if memory limit is breached and use it in hybrid collector check?
and maybe switch the active collector too? this may avoid additional check with each collect call.

Other way could be throw special exception here, catch it in hybrid collector and switch the collector to direct.

return 0;
}

private boolean evaluateCardinalityAggregationHybridCollectorEnabled() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason for placing these methods in SearchContext?

@rishabhmaurya
Copy link
Contributor

rishabhmaurya commented Oct 11, 2025

@anandpatel9998 I have added couple of comments, changes mostly looks good. Thanks for working on it.
Did you happen to run benchmark against cases where hybrid collector will come into action in big5? If not, we should add a query where we hit this code path and compare the performance when director collector would have used.

If we are able to prove decent gains, this change calls for a blogpost. Term with cardinality aggs is a pain point for a lot of users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants