Skip to content

Identity hash function by default on GCC #73

@stgatilov

Description

@stgatilov

I ran the following code built with GCC in Linux VM:

#include <stdint.h>
#include <stdio.h>
#include <time.h>
#include <tsl/robin_set.h>

int main() {
    for (uint64_t size = 1 << 20; size <= 1 << 25; size <<= 1) {
        for (int mode = 0; mode < 2; mode++) {
            tsl::robin_set<uint64_t> x;
        
            uint64_t startTime = clock();
            for (uint64_t i = 0; i < size; i++)
                x.insert(i * (mode == 0 ? 0xDEADBEEF : 1 << 20));
            uint64_t endTime = clock();

            double deltaTime = double(endTime - startTime) / CLOCKS_PER_SEC;
            uint64_t sum = 0;
            for (uint64_t val : x)
                sum += val;
        
            printf("N = %llu %c:    time = %0.3lf    chk = %llu\n", size, (mode ? 'B' : 'M'), deltaTime, sum);
        }
    }
    return 0;
}

and I got:

N = 1048576 M:    time = 0.043    chk = 6257894696195457024
N = 1048576 B:    time = 8.280    chk = 576460202547609600
N = 2097152 M:    time = 0.115    chk = 6588752116096958464
N = 2097152 B:    time = 19.110    chk = 2305841909702066176
N = 4194304 M:    time = 0.169    chk = 7916099200727646208
N = 4194304 B:    time = 35.500    chk = 9223369837831520256
N = 8388608 M:    time = 0.354    chk = 13233322349299761152
Killed

So I guess inserting integers divisible by 2^20 takes quadratic time.
Moreover, trying to insert 16M values results in a crash.
Most likely because std::hash<uint64_t>(x) = x on GCC.

Note that I used default settings and got no warnings!
Awful hash function by default is rather critical issue for people who don't know much about hashing (and would probably do worse trying to implement their own hash function or hash table). And given that TSL interface is very STL-like, I think that's the audience it is targeted at.


A proper hash function usually contains three parts:

  1. Combining: getting one integer out of many values/tuples/sequences/etc.
  2. Finalizing: doing some transformation for good statistical properties after step 3.
  3. Reduction: reducing the domain from something like whole range of uint64_t to an index in hash table.

As usual, C++ standard is not precise enough, and STL is not cross-platform.
On MSVC, std::hash performs steps 1 and 2, while std::unordered_set only does step 3.
On GCC, std::hash only performs step 1, while std::unordered_set does steps 2 and 3.
It means that if you use std::hash directly, then you should run your own hash finalizer. TSL hash table only does step 3, but uses std::hash, meaning that the crucial step 2 is missed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions