-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Use bloom filters to collect large DFs #25009
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
12f2fdd
to
35e139b
Compare
ef14d7f
to
454c630
Compare
Ignore the red tests for now, cleaning them up |
core/trino-main/src/main/java/io/trino/sql/gen/columnar/BloomColumnarFilter.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/sql/planner/DynamicFilterDomain.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/sql/planner/DynamicFilterDomain.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/sql/planner/DynamicFilterDomain.java
Outdated
Show resolved
Hide resolved
{ | ||
return new DynamicFilterDomain(Domain.all(type)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feels like it should just be none
core/trino-main/src/main/java/io/trino/sql/planner/BloomFilterWithRange.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/sql/planner/BloomFilterWithRange.java
Outdated
Show resolved
Hide resolved
// returned mask sets 3 bits based on portions of given hash | ||
// Extract 38th to 43rd bits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this bit numbers is sheer magic. Please provide a structure of bloom filter in javadoc at the top of the class.
Some ascii art with bit marking would be appreciated.
| (1L << ((hashCode >> 33) & 63)); | ||
} | ||
|
||
private static int getBloomFilterSize(int valuesCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
english explanation what it does
import static java.util.Objects.requireNonNull; | ||
import static java.util.stream.Collectors.toMap; | ||
|
||
public class DynamicFilterTupleDomain<T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be feasible to fight for reusing more code from TupleDomain
instead of copy pasting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skimmed and it looks fine to my untrained 👁️
454c630
to
97c0284
Compare
d3af94e
to
a9f1a72
Compare
200e239
to
04d8e93
Compare
09bee49
to
53cdc3b
Compare
6c32f54
to
216c41b
Compare
e000b4a
to
6cf30bd
Compare
BenchmarkDynamicPageFilter.filterPages (filterSize) (inputDataSet) (inputNullChance) (nonNullsSelectivity) (nullsAllowed) Mode Cnt Score using fastutil set Score using bloom filter 1000 INT64_RANDOM 0.05 0.2 false thrpt 20 30.282 ± 0.792 ops/s 65.017 ± 0.566 ops/s 10000 INT64_RANDOM 0.05 0.2 false thrpt 20 33.799 ± 0.511 ops/s 63.218 ± 1.783 ops/s 100000 INT64_RANDOM 0.05 0.2 false thrpt 20 29.464 ± 0.469 ops/s 63.482 ± 1.626 ops/s 1000000 INT64_RANDOM 0.05 0.2 false thrpt 20 18.854 ± 0.558 ops/s 63.690 ± 1.662 ops/s BenchmarkDynamicFilterSourceOperator.dynamicFilterCollect, maxDistinctValuesCount = 600572 Collection type (positionsPerPage) Mode Cnt Score Error Units Hash set 4096 avgt 45 39.950 ± 0.281 ms/op Bloom filter 4096 avgt 45 10.297 ± 0.065 ms/op Min-max 4096 avgt 45 5.845 ± 0.038 ms/op no-op 4096 avgt 45 0.075 ± 0.001 ms/op BenchmarkDynamicFilterSourceOperator.dynamicFilterCollect, maxDistinctValuesCount = 6001215 Collection type (positionsPerPage) Mode Cnt Score Error Units Hash set 4096 avgt 45 590.042 ± 22.009 ms/op Bloom filter 4096 avgt 45 98.025 ± 0.750 ms/op Min-max 4096 avgt 45 61.982 ± 6.330 ms/op no-op 4096 avgt 45 0.092 ± 0.001 ms/op
6cf30bd
to
db6464b
Compare
This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack. |
Description
Additional context and related issues
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: