Skip to content

OR condition does not leverage all parquet metadata (metrics, dictionary, bloom filter) causing inefficient queries #10029

Closed as not planned
@cccs-jc

Description

@cccs-jc

Apache Iceberg version

1.4.3

Query engine

Spark

Please describe the bug 🐞

I'm testing a table of flow data with a schema of SRC_IP long, DST_IP long

I did a thorough investigation and there really seem to be a problem with bloom are used...

Here I have done a search for a random IP value. I do it 10 times to get a precise average execution time.

I use 512MB target file size and vary the row group size of from 128MB to 16MB. I also have 1 test done with files which have no blooms in them. All files are zordered by SRC and DST IP.

The where clauses are SRC_IP=val, DST_IP=val, SRC_IP=val AND DST_IP=val, SRC_IP=val OR DST_IP=val

('R:128MB', 'SRC', 4.070408272743225)
('R:64MB', 'SRC', 3.479648399353027)
('R:32MB', 'SRC', 7.69552972316742)
('R:16MB', 'SRC', 12.549865365028381)
('R:128MB no bloom', 'SRC', 17.634950709342956)
('R:128MB', 'DST', 5.119180655479431)
('R:64MB', 'DST', 5.0318292617797855)
('R:32MB', 'DST', 8.04975097179413)
('R:16MB', 'DST', 16.09592936038971)
('R:128MB no bloom', 'DST', 46.66901330947876)
('R:128MB', 'AND', 2.262153959274292)
('R:64MB', 'AND', 2.3894467115402223)
('R:32MB', 'AND', 4.230756330490112)
('R:16MB', 'AND', 8.178192615509033)
('R:128MB no bloom', 'AND', 6.790379118919373)
('R:128MB', 'OR', 72.06906585693359)
('R:64MB', 'OR', 34.00628011226654)
('R:32MB', 'OR', 25.374402904510497)
('R:16MB', 'OR', 24.86290364265442)
('R:128MB no bloom', 'OR', 101.38280701637268)

As you can see with 128MB row group query SRC task on average 4.07 seconds, DST takes 5.11 seconds. But doing an OR will take on average 72 seconds.
this is in contrast with the no bloom scenario where it takes 17 seconds, 46 seconds and for the OR 101 seconds.

Looking at the Iceberg code. I suspect the issue is as described:

There are 3 types of expression evaluators:
ParquetMetricsRowGroupFilter
ParquetDictionaryRowGroupFilter
ParquetBloomRowGroupFilter

The ReadConf applies these 3 evaluators like so:

boolean shouldRead =
          filter == null
              || (statsFilter.shouldRead(typeWithIds, rowGroup)
                  && dictFilter.shouldRead(
                      typeWithIds, rowGroup, reader.getDictionaryReader(rowGroup))
                  && bloomFilter.shouldRead(
                      typeWithIds, rowGroup, reader.getBloomFilterDataReader(rowGroup)));

If any of the 3 evaluator say it's not possible for a particular value to be found in a file's rowgroup then that rowgroup will be skipped.

Let's say the where clause is SRC_IP=1 OR DST_IP=1

And let's suppose both columns are dictionary encoded.

The dictFilter will determine that the value 1 is not in the dictionary of SRC_IP and not in the dictionary of DST_IP. So it returns shouldRead=False.

However, if SRC_IP is dictionary encoded and DST_IP is not (it uses a bloom).

Then the when dictFilter evaluates DST_IP=1, it returns shouldRead=True because there is no dictionary so it can't rule it out. It returns shouldRead=True.

Conversely when the bloomFilter test SRC_IP=1, and determines there is no bloom on SRC_IP ( dictionary encoded), it returns shouldRead=True because again it can't rule it out.

The result are combined by ReadConf and result in a shouldRead=True. Even though based on the dictionary of SRC_IP and the bloom information on DST_IP we should have skipped the rowgroup. But since the evaluation is done independently neither evaluator can make that decision.

To prove my hypothesis I created a new dataset where I set the write.parquet.dict-size-bytes very low to make sure it does not use the dictionary encoding. Since the columns are not dictionary encoded they use a bloom filter.

The results here show that when both SRC_IP and DST_IP columns are using bloom it is fast to evaluate the OR condition. 10 seconds compared to 379 seconds.

('R:128MB no dict', 'SRC', 8.311380386352539)
('R:128MB', 'DST', 0.7780213356018066)
('R:128MB no dict', 'DST', 0.8464977741241455)
('R:128MB', 'AND', 0.5989606380462646)
('R:128MB no dict', 'AND', 0.7306745052337646)
('R:128MB', 'OR', 379.92671513557434)
('R:128MB no dict', 'OR', 10.23964548110962)

I propose that the ParquetDictionaryRowGroupFilter and ParquetBloomRowGroupFilter be combined into a single evaluator that test every column. Each column would be tested either using a dictionary or a bloom.

@huaxingao @RussellSpitzer

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstale

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions