Skip to content

Use bloom filters to collect large DFs #25009

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Conversation

raunaqmorarka
Copy link
Member

@raunaqmorarka raunaqmorarka commented Feb 13, 2025

Description

BenchmarkDynamicPageFilter.filterPages
    (filterSize)  (inputDataSet)  (inputNullChance)  (nonNullsSelectivity)  (nullsAllowed)   Mode  Cnt  Score using fastutil set Score using bloom filter
            1000    INT64_RANDOM               0.05                    0.2           false  thrpt   20  30.282 ± 0.792  ops/s    65.017 ± 0.566  ops/s
           10000    INT64_RANDOM               0.05                    0.2           false  thrpt   20  33.799 ± 0.511  ops/s    63.218 ± 1.783  ops/s
          100000    INT64_RANDOM               0.05                    0.2           false  thrpt   20  29.464 ± 0.469  ops/s    63.482 ± 1.626  ops/s
         1000000    INT64_RANDOM               0.05                    0.2           false  thrpt   20  18.854 ± 0.558  ops/s    63.690 ± 1.662  ops/s

    BenchmarkDynamicFilterSourceOperator.dynamicFilterCollect, maxDistinctValuesCount = 600572
    Collection type    (positionsPerPage)  Mode  Cnt   Score   Error  Units
    Hash set                         4096  avgt   45  39.950 ± 0.281  ms/op
    Bloom filter                     4096  avgt   45  10.297 ± 0.065  ms/op
    Min-max                          4096  avgt   45   5.845 ± 0.038  ms/op
    no-op                            4096  avgt   45   0.075 ± 0.001  ms/op

    BenchmarkDynamicFilterSourceOperator.dynamicFilterCollect, maxDistinctValuesCount = 6001215
    Collection type    (positionsPerPage)  Mode  Cnt    Score   Error   Units
    Hash set                         4096  avgt   45  590.042 ± 22.009  ms/op
    Bloom filter                     4096  avgt   45   98.025 ±  0.750  ms/op
    Min-max                          4096  avgt   45   61.982 ±  6.330  ms/op
    no-op                            4096  avgt   45    0.092 ±  0.001  ms/op

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Section
* Fix some things. ({issue}`issuenumber`)

@raunaqmorarka
Copy link
Member Author

OSS iceberg partitioned sf1k bloom.pdf
OSS iceberg unpartitioned sf1k bloom.pdf

Screenshot 2025-02-18 at 9 10 45 PM Screenshot 2025-02-18 at 9 11 45 PM

@raunaqmorarka
Copy link
Member Author

Ignore the red tests for now, cleaning them up

Comment on lines 252 to 265
{
return new DynamicFilterDomain(Domain.all(type));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like it should just be none

Comment on lines 284 to 281
// returned mask sets 3 bits based on portions of given hash
// Extract 38th to 43rd bits
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this bit numbers is sheer magic. Please provide a structure of bloom filter in javadoc at the top of the class.
Some ascii art with bit marking would be appreciated.

| (1L << ((hashCode >> 33) & 63));
}

private static int getBloomFilterSize(int valuesCount)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

english explanation what it does

import static java.util.Objects.requireNonNull;
import static java.util.stream.Collectors.toMap;

public class DynamicFilterTupleDomain<T>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be feasible to fight for reusing more code from TupleDomain instead of copy pasting?

Copy link
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skimmed and it looks fine to my untrained 👁️

@github-actions github-actions bot added the memory Memory connector label Feb 26, 2025
@raunaqmorarka raunaqmorarka force-pushed the df-bloom-final branch 2 times, most recently from d3af94e to a9f1a72 Compare February 26, 2025 12:30
@raunaqmorarka raunaqmorarka force-pushed the df-bloom-final branch 3 times, most recently from 200e239 to 04d8e93 Compare February 27, 2025 11:04
@github-actions github-actions bot added the mongodb MongoDB connector label Feb 27, 2025
@raunaqmorarka raunaqmorarka force-pushed the df-bloom-final branch 5 times, most recently from 09bee49 to 53cdc3b Compare March 6, 2025 11:34
@raunaqmorarka raunaqmorarka force-pushed the df-bloom-final branch 10 times, most recently from 6c32f54 to 216c41b Compare March 19, 2025 07:45
@raunaqmorarka raunaqmorarka force-pushed the df-bloom-final branch 5 times, most recently from e000b4a to 6cf30bd Compare March 25, 2025 12:43
BenchmarkDynamicPageFilter.filterPages
(filterSize)  (inputDataSet)  (inputNullChance)  (nonNullsSelectivity)  (nullsAllowed)   Mode  Cnt  Score using fastutil set Score using bloom filter
        1000    INT64_RANDOM               0.05                    0.2           false  thrpt   20  30.282 ± 0.792  ops/s    65.017 ± 0.566  ops/s
       10000    INT64_RANDOM               0.05                    0.2           false  thrpt   20  33.799 ± 0.511  ops/s    63.218 ± 1.783  ops/s
      100000    INT64_RANDOM               0.05                    0.2           false  thrpt   20  29.464 ± 0.469  ops/s    63.482 ± 1.626  ops/s
     1000000    INT64_RANDOM               0.05                    0.2           false  thrpt   20  18.854 ± 0.558  ops/s    63.690 ± 1.662  ops/s

BenchmarkDynamicFilterSourceOperator.dynamicFilterCollect, maxDistinctValuesCount = 600572
Collection type    (positionsPerPage)  Mode  Cnt   Score   Error  Units
Hash set                         4096  avgt   45  39.950 ± 0.281  ms/op
Bloom filter                     4096  avgt   45  10.297 ± 0.065  ms/op
Min-max                          4096  avgt   45   5.845 ± 0.038  ms/op
no-op                            4096  avgt   45   0.075 ± 0.001  ms/op

BenchmarkDynamicFilterSourceOperator.dynamicFilterCollect, maxDistinctValuesCount = 6001215
Collection type    (positionsPerPage)  Mode  Cnt    Score   Error   Units
Hash set                         4096  avgt   45  590.042 ± 22.009  ms/op
Bloom filter                     4096  avgt   45   98.025 ±  0.750  ms/op
Min-max                          4096  avgt   45   61.982 ±  6.330  ms/op
no-op                            4096  avgt   45    0.092 ±  0.001  ms/op
Copy link

This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack.

@github-actions github-actions bot added the stale label Apr 16, 2025
@raunaqmorarka raunaqmorarka added stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed. and removed stale labels Apr 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed docs hive Hive connector kudu memory Memory connector mongodb MongoDB connector performance stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed.
Development

Successfully merging this pull request may close these issues.

3 participants