These benchmarks use the query_stats
table from the pganalyze staging environment (which doesn't contain any customer data). This data is collected from pg_stat_statements every minute and then sent to pganalyze every 10 minutes.
The first two benchmarks incrementally improve the compression ratio by changing the data model. Then the comparison.rs
benchmark compares the resulting data model with different compression methods.
Size is listed in megabytes, and times are listed in seconds.
Re-grouping the data from 10 minute buckets to 24 hour buckets improves the compression ratio and read/write time. Hopefully the time spent on small buckets can be improved in future upstream changes to pco.
The ideal bucket size will depend on your workload. A larger bucket results in better compression, but means more unwanted data has to be loaded and discarded at read time.
Size | Write time | Read time | Average group size | |
---|---|---|---|---|
1 day bucket (pco) | 217 | 18.5 | 2.1 | 28,433 |
10 minute bucket (pco) | 318 | 31.5 | 8.2 | 214 |
10 minute bucket (Postgres arrays) | 485 | 214 |
Rounding the total_time
and io_time
float values to varying levels of precision can significantly improve the compression ratio. Converting the floats into integers that are multiplied by 10^N at write time to preserve the desired fractional precision further improves the compression ratio.
Reducing the float precision to 2 decimal points reduces the size by 29% (217 MB -> 155 MB). Then using an integer representation further reduces the size by 31% (155 MB -> 107 MB). Combined, that's a 51% improvement.
Size | Write time | Read time | |
---|---|---|---|
bucket_size.rs baseline: full precision |
217 | 18.5 | 2.1 |
rounded to 0 decimals | 107 | 17.4 | 2.1 |
rounded to 1 decimal | 131 | 18.6 | 2.2 |
rounded to 2 decimals | 155 | 19.3 | 2.2 |
multiplied by 1 and casted to integer | 89 | 15.9 | 2.1 |
multiplied by 10 and casted to integer | 97 | 16.1 | 2.1 |
multiplied by 100 and casted to integer | 107 | 16.5 | 2.1 |
Now with the optimized data model, this benchmark compares the performance of using pco, pco_store, or Postgres array types.
Size | Write time | Read time | Compression method | |
---|---|---|---|---|
pco | 107 | 15.6 | 1.9 | pco |
pco_store | 107 | 17.3 | 2.0 | pco |
Postgres arrays | 207 | 84.0 | 10.7 | Postgres pglz |
There are other numeric compression libraries available in Rust, but they've been disqualified from this comparison because of their limited data type support. In tests not included in this repo, none of these crates outperformed pco even when using the limited set of data types available.
- stream-vbyte: doesn't support
i64
or floats - bitpacking: doesn't support
i64
or floats - tsz-compress: doesn't support floats
The standard library SystemTime
is being used depsite chrono's more feature-complete API because adding durations to a timestamp (in decompress
) is noticeably slower when using chrono.
TODO: write this benchmark
First install git-lfs, then build the query_stats
table from the compressed backup file:
pg_restore -c -d postgres benches/query_stats.db
Then run the benchmarks. The table sizes can be seen with this query:
ANALYZE;
SELECT name,
pg_size_pretty(sum(total_bytes)) AS total,
pg_size_pretty(sum(table_bytes)) AS table,
pg_size_pretty(sum(toast_bytes)) AS toast,
pg_size_pretty(sum(index_bytes)) AS index,
sum(reltuples::int) AS rows
FROM (
SELECT *, total_bytes - index_bytes - COALESCE(toast_bytes, 0) AS table_bytes
FROM (
SELECT relname AS name,
pg_total_relation_size(c.oid) AS total_bytes,
pg_indexes_size(c.oid) AS index_bytes,
pg_total_relation_size(reltoastrelid) AS toast_bytes,
reltuples
FROM pg_class c
LEFT JOIN pg_namespace n ON n.oid = relnamespace
WHERE relkind = 'r' AND nspname = 'public'
) _
) _
GROUP BY name ORDER BY name;
ALTER TABLE postgres_roles DROP CONSTRAINT postgres_roles_pkey;
ALTER TABLE postgres_roles ADD COLUMN id_bigint bigint PRIMARY KEY GENERATED ALWAYS AS IDENTITY;
CREATE INDEX CONCURRENTLY ON postgres_roles USING btree (id);
CREATE TABLE query_stats (
database_id bigint NOT NULL,
start_at timestamptz NOT NULL,
end_at timestamptz NOT NULL,
collected_at timestamptz[] NOT NULL,
collected_secs bigint[] NOT NULL,
fingerprint bigint[] NOT NULL,
postgres_role_id bigint[] NOT NULL,
calls bigint[] NOT NULL,
rows bigint[] NOT NULL,
total_time double precision[] NOT NULL,
io_time double precision[] NOT NULL,
shared_blks_hit bigint[] NOT NULL,
shared_blks_read bigint[] NOT NULL
);
CREATE INDEX ON query_stats USING btree (database_id);
CREATE INDEX ON query_stats USING btree (end_at, start_at);
INSERT INTO query_stats
SELECT database_id,
min_collected_at,
(SELECT max(c) FROM unnest(collected_at) c),
collected_at,
collected_interval_secs,
fingerprint,
(SELECT array_agg(id_bigint) FROM unnest(postgres_role_id) p, postgres_roles WHERE id = p),
calls,
rows,
total_time,
(SELECT array_agg(r + w) FROM unnest(blk_read_time, blk_write_time) _(r, w)),
shared_blks_hit,
shared_blks_read
FROM query_stats_packed_35d;
And then run:
pg_dump -Z7 -Fc -O --table query_stats SOURCE_DB_NAME > benches/query_stats.db