TimescaleDB Toolkit provides an implementation of the Hyperloglog estimator for COUNT DISTINCT
approximations of any type that has a hash function.
Timescale's HyperLogLog is implemented as an aggregate function in PostgreSQL. They do not support moving-aggregate mode, and are not ordered-set aggregates. It is restricted to values that have an extended hash function. They are partializable and are good candidates for continuous aggregation.
hyperloglog(
size INTEGER,
value AnyElement¹
) RETURNS Hyperloglog
¹The type must have an extended (64bit) hash function.
This will construct and return a Hyperloglog with at least the specified number of buckets over the given values.
Name | Type | Description |
---|---|---|
buckets |
INTEGER |
Number of buckets in the digest. Will be rounded up to the next power of 2, must be between 16 and 2^18. Increasing this will usually provide more accurate at the expense of more storage. |
value |
AnyElement |
Column to count the distinct elements of. |
Column | Type | Description |
---|---|---|
hyperloglog |
Hyperloglog |
A hyperloglog object which may be passed to other hyperloglog APIs. |
For this examples assume we have a table 'samples' with a column 'weights' holding DOUBLE PRECISION
values. The following will simply return a digest over that column
SELECT hyperloglog(64, weights) FROM samples;
It may be more useful to build a view from the aggregate that we can later pass to other tdigest functions.
CREATE VIEW digest AS SELECT hyperloglog(64, data) FROM samples;
rollup(
log hyperloglog
) RETURNS Hyperloglog
Returns a Hyperloglog by aggregating over the union of the input elements.
Name | Type | Description |
---|---|---|
log |
Hyperloglog |
Column of Hyperloglogs to be unioned. |
Column | Type | Description |
---|---|---|
hyperloglog |
Hyperloglog |
A hyperloglog containing the count of the union of the input Hyperloglogs. |
SELECT distinct_count(rollup(logs))
FROM (
(SELECT hyperloglog(32, v::text) logs FROM generate_series(1, 100) v)
UNION ALL
(SELECT hyperloglog(32, v::text) FROM generate_series(50, 150) v)
) hll;
count
-------
153
distinct_count(hyperloglog Hyperloglog) RETURNS BIGINT
Get the number of distinct values from a hyperloglog.
Name | Type | Description |
---|---|---|
hyperloglog |
Hyperloglog |
The hyperloglog to extract the count from. |
Column | Type | Description |
---|---|---|
distinct_count |
BIGINT |
The number of distinct elements counted by the hyperloglog. |
SELECT distinct_count(hyperloglog(64, data))
FROM generate_series(1, 100) data
distinct_count
----------------
104
stderror(hyperloglog Hyperloglog) RETURNS DOUBLE PRECISION
Returns an estimate of the relative stderror of the hyperloglog based on the hyperloglog error formula. Approximate result are:
precision ┃ registers ┃ error ┃ bytes
━━━━━━━━━━━╋━━━━━━━━━━━╋━━━━━━━━╋━━━━━━━━
4 ┃ 16 ┃ 0.2600 ┃ 12
5 ┃ 32 ┃ 0.1838 ┃ 24
6 ┃ 64 ┃ 0.1300 ┃ 48
7 ┃ 128 ┃ 0.0919 ┃ 96
8 ┃ 256 ┃ 0.0650 ┃ 192
9 ┃ 512 ┃ 0.0460 ┃ 384
10 ┃ 1024 ┃ 0.0325 ┃ 768
11 ┃ 2048 ┃ 0.0230 ┃ 1536
12 ┃ 4096 ┃ 0.0163 ┃ 3072
13 ┃ 8192 ┃ 0.0115 ┃ 6144
14 ┃ 16384 ┃ 0.0081 ┃ 12288
15 ┃ 32768 ┃ 0.0057 ┃ 24576
16 ┃ 65536 ┃ 0.0041 ┃ 49152
17 ┃ 131072 ┃ 0.0029 ┃ 98304
18 ┃ 262144 ┃ 0.0020 ┃ 196608
Name | Type | Description |
---|---|---|
hyperloglog |
Hyperloglog |
The hyperloglog to extract the count from. |
Column | Type | Description |
---|---|---|
stderror |
BIGINT |
The number of distinct elements counted by the hyperloglog. |
SELECT stderror(hyperloglog(64, data))
FROM generate_series(1, 100) data
stderror
----------
0.13