-
Notifications
You must be signed in to change notification settings - Fork 1.1k
LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader #179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
That is really odd that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is close! I left a bunch of small comments. The performance is odd but I don't think that should block this improvement.
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
Outdated
Show resolved
Hide resolved
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
Outdated
Show resolved
Hide resolved
...e/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
Show resolved
Hide resolved
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
Show resolved
Hide resolved
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/FloatTaxonomyFacets.java
Outdated
Show resolved
Hide resolved
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
Outdated
Show resolved
Hide resolved
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
Outdated
Show resolved
Hide resolved
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
Outdated
Show resolved
Hide resolved
/* | ||
If ordinals[i] >= leafReaderMaxDoc then we find the next leaf that contains our ordinal | ||
*/ | ||
if (values == null || ordinals[i] >= leafReaderMaxDoc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, don't we need to compare ordinals[i] >= leafReaderMaxDoc + leafReaderDocBase
instead? I.e. the ordinal is in the global IndexReader
docid space, but the leafReaderMaxDoc
is just for this leaf. How come tests are not failing due to this :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a big miss. Thanks for catching this.
Here is why the test was passing:
When the ordinals were all in the first leaf, the ordinal was < leafReaderMaxDoc and the wrong code worked correctly.
When the ordinals started spilling into the second leaf, the wrong code would recalculate the leaf each time and then find the correct Label
. Thus were not actually making use of the increasing values of the ordinals (and were recalculating the leafContext
each time).
This would go on for all other leaves as well.
I'll rerun benchmarks
Hey @mikemccand , my apologies for leaving this PR hanging! The unexpected benchmark results threw me off.. I've updated the PR to fix the bug you had previously discovered and updated the PR to bring it on par with mainline. I re-ran benchmarks after fixing the
The gain in taxonomy based facets is about 2%. I ran it multiple times locally and it gave around 2% gain in those runs as well. SSDV facets still show a 3% gain which is unexplained. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the iteration @gautamworah96 -- we are getting closer!
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/FloatTaxonomyFacets.java
Outdated
Show resolved
Hide resolved
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
Show resolved
Hide resolved
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
Outdated
Show resolved
Hide resolved
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
Show resolved
Hide resolved
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
Show resolved
Hide resolved
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
Outdated
Show resolved
Hide resolved
...e/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
Show resolved
Hide resolved
...e/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
Outdated
Show resolved
Hide resolved
...e/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
Show resolved
Hide resolved
contention). Modified the test case to be super random (threads, iterations etc). Modified TaxonomyReader with a getBulkPath method that is overridden in DirTaxonomyReader. Improved documentation, improved method signatures to use the newer (...) style syntax, fixed the CHANGES entry
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
Show resolved
Hide resolved
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks super close! I left some small comments.
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyReader.java
Show resolved
Hide resolved
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
Outdated
Show resolved
Hide resolved
...e/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
Outdated
Show resolved
Hide resolved
...e/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
Outdated
Show resolved
Hide resolved
...e/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
Show resolved
Hide resolved
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great -- I left tiny polish comments -- almost done! Thanks @gautamworah96!
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
Show resolved
Hide resolved
/* | ||
If the index is constructed with the older StoredFields it will not have any BinaryDocValues field and will return null | ||
*/ | ||
if (values == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's sort of weird to do this check per-segment, since we can know up front, based on indexed created version, whether it's stored fields or doc values? I.e. this is not a segment by segment decision. But let's do it in follow-on issue. I think this one is ready!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I'll create it after this PR is merged. Otherwise, it just feels weird to create an issue for a fix in a piece of code that has not been merged :D
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/TestTaxonomyCombined.java
Show resolved
Hide resolved
...ne/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestBackwardsCompatibility.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gautamworah96 -- looks great! I'll try to merge soon! Then we should watch nightly Lucene benchmarks and see if facets got a bit faster.
Merged! @gautamworah96 I think we should backport this one to 8.x? It is just adding a faster API. |
Resolved Conflicts: gradle/validation/validate-source-patterns.gradle solr/core/src/java/org/apache/solr/core/CoreContainer.java
This change relies on the older repo having the "LUCENE-9450 Switch taxonomy index to use BinaryDocValues" change. We had initially decided that we will not backport it to 8x because it was a major change to the index type/format. Thus this change also can't be backported :/ |
Ahh, yes, sorry! We cannot backport this change then. So this is 9.0 only optimization. |
Description
In LUCENE-9450 we switched the Taxonomy index from Stored Fields to BinaryDocValues. In the resulting implementation of the getPath code, we create a new BinaryDocValues's values instance for each ordinal.
It may happen that we may traverse over the same nodes over and over again if the getPath API is called multiple times for ordinals in the same segment/with the same readerIndex.
This PR takes advantage of that fact by sorting ordinals and then trying to find out if some of the ordinals are present in the same segment/have the same readerIndex (by trying to advanceExact to the correct position and not failing) thereby allowing us to reuse the previous BinaryDocValues object.
This PR was originally proposed in the older
lucene-solr
repo here. I discontinued it because we could not find a good "real" use case for it.This new PR uses the
getBulkPath
API insideIntTaxonomyFacetCounts
andFloatTaxonomyFacetCounts
classes. In thegetTopChildren
method within these classes, we usually collect all thetop-n
ordinals first and then translate them intoFacetLabels
one by one using an iterative API. This call has been replaced with thegetBulkPath
method in this PR.Solution
Steps:
Tests
getPath
results from ordinals with the bulkFacetLabels
returned by the getBulkPath APIBenchmarks:
The stock
luceneutil
taxonomy tasks callgetTopChildren(10, ..)
for testing the performance of taxonomy facet counting classes. SinceFastTaxonomyFacetCounts
inherits fromIntTaxonomyFacetCounts
(and uses the samegetTopChildren
logic) the user should be able to reap the benefits of the bulk API directly.luceneutil
that tests the top 10 children:Since, the previous results did not show any gain in the
Browse*TaxoFacets
tasks, I realized that maybe it was because the ordinalCache with a stock size of 4000 had all ordinals cached which led to the bulk API behaving the same as the iterativegetPath
call.The next benchmark tested the
getTopChildren
call with 10,000 childrenluceneutil
that tests the top 10000 children:I surprisingly see some gain in
BrowseDayOfYearSSDVFacets
andBrowseMonthSSDVFacets
tasks but I think that is noise (because taxonomy facets have a different impl. as compared toSortedSetDocValuesFacets
).Edit: Did an A/A test and
BrowseDayOfYearSSDVFacets
does not show any gain. So theBrowseDayOfYearSSDVFacets
4.1% perf improvement is unexplained..Browse*TaxoFacets
tasks don't show gain.Checklist
Please review the following and check all that apply:
main
branch../gradlew check
.