Skip to content

Unexpected collisions in checksum aggregate function over maps #24878

Open
@mbasmanova

Description

@mbasmanova

Checksum algorithm doesn't distinguish between map keys and values. Checksum({1: 10, 2: 20}) is the same as checksum({10: 1, 20: 2}).

presto:di> select map(array[1, 2], array[10, 20]);
    _col0
--------------
 {1=10, 2=20}

Query 20250406_205840_36247_w9a6u, FINISHED, 1 node
Splits: 5 total, 5 done (100.00%)
[Latency: client-side: 161ms, server-side: 56ms] [0 rows, 0B] [0 rows/s, 0B/s]

presto:di> select checksum(map(array[1, 2], array[10, 20]));
          _col0
-------------------------
 75 f8 a5 51 6e 26 d1 6c

presto:di> select map(array[10, 20], array[1, 2]);
    _col0
--------------
 {20=2, 10=1}

presto:di> select checksum(map(array[10, 20], array[1, 2]));
          _col0
-------------------------
 75 f8 a5 51 6e 26 d1 6c

This is because MapType.hash uses XOR to combine hashes of the key and value.

com/facebook/presto/common/type/MapType.java

    @Override
    public long hash(Block block, int position)
    {
        Block mapBlock = getObject(block, position);
        long result = 0;

        for (int i = 0; i < mapBlock.getPositionCount(); i += 2) {
            result += hashPosition(keyType, mapBlock, i) ^ hashPosition(valueType, mapBlock, i + 1);
        }
        return result;
    }

A more robust implementation would be the same as ArrayType.hash

    @Override
    public long hash(Block block, int position)
    {
        Block array = getObject(block, position);
        long hash = 0;
        for (int i = 0; i < array.getPositionCount(); i++) {
            hash = 31 * hash + hashPosition(elementType, array, i);
        }
        return hash;
    }

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    🆕 Unprioritized

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions