Open
Description
Checksum algorithm doesn't distinguish between map keys and values. Checksum({1: 10, 2: 20}) is the same as checksum({10: 1, 20: 2}).
presto:di> select map(array[1, 2], array[10, 20]);
_col0
--------------
{1=10, 2=20}
Query 20250406_205840_36247_w9a6u, FINISHED, 1 node
Splits: 5 total, 5 done (100.00%)
[Latency: client-side: 161ms, server-side: 56ms] [0 rows, 0B] [0 rows/s, 0B/s]
presto:di> select checksum(map(array[1, 2], array[10, 20]));
_col0
-------------------------
75 f8 a5 51 6e 26 d1 6c
presto:di> select map(array[10, 20], array[1, 2]);
_col0
--------------
{20=2, 10=1}
presto:di> select checksum(map(array[10, 20], array[1, 2]));
_col0
-------------------------
75 f8 a5 51 6e 26 d1 6c
This is because MapType.hash uses XOR to combine hashes of the key and value.
com/facebook/presto/common/type/MapType.java
@Override
public long hash(Block block, int position)
{
Block mapBlock = getObject(block, position);
long result = 0;
for (int i = 0; i < mapBlock.getPositionCount(); i += 2) {
result += hashPosition(keyType, mapBlock, i) ^ hashPosition(valueType, mapBlock, i + 1);
}
return result;
}
A more robust implementation would be the same as ArrayType.hash
@Override
public long hash(Block block, int position)
{
Block array = getObject(block, position);
long hash = 0;
for (int i = 0; i < array.getPositionCount(); i++) {
hash = 31 * hash + hashPosition(elementType, array, i);
}
return hash;
}
Metadata
Metadata
Assignees
Type
Projects
Status
🆕 Unprioritized
Activity
mbasmanova commentedon Apr 6, 2025
CC: @kaikalur @amitkdutta @aditi-pandit @pedroerp
[-]Too many collisions in checksum aggregate function over maps[/-][+]Unexpected collisions in checksum aggregate function over maps[/+]amitkdutta commentedon Apr 7, 2025
CC: @rschlussel