Skip to content

Improve accuracy of compression stats #7901

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

erimatnor
Copy link
Contributor

@erimatnor erimatnor commented Apr 2, 2025

This change improves the accuracy of compression chunks size stats. These stats are computed at compression time, but can be inaccurate for a number of reasons.

First, the "after" compression size only included the size of the compressed relation. While the non-compressed relation is, in most cases, empty after compression, it still occupies some space. In some other cases, data might be left in the non-compressed relation after compression or it might contain garbage. In particular, Hypercore TAM retains indexes on the non-compressed relation so it should be included in the post-compression size.

Second, segmentwise compression didn't update the compression stats at all. As a consequence, the stats can become out-of-date in case of backfill and/or deletes that increase or decrease the amount of data. An extreme case of this occurs when setting Hypercore TAM as default on the hypertable since new chunks are technically "compressed" by default, but empty, and all inserts are akin to backfill. In that case no stats are created at all.

However, updating compression stats on segmentwise recompression is challenging because it is not possible to distinguish between backfilled tuples (which increase the size) and tuples that were decompressed due to updates (which don't increase size). Therefore, this change currently only updates stats when the compressed relation is empty.

Fixes: #7713

This change improves the accuracy of compression chunks size
stats. These stats are computed at compression time, but can be
inaccurate for a number of reasons.

First, the "after" compression size only included the size of the
compressed relation. While the non-compressed relation is, in most
cases, empty after compression, it still occupies some space. In some
other cases, data might be left in the non-compressed relation after
compression or it might contain garbage. In particular, Hypercore TAM
retains indexes on the non-compressed relation so it should be
included in the post-compression size.

Second, segmentwise compression didn't update the compression stats at
all. As a consequence, the stats can become out-of-date in case of
backfill and/or deletes that increase or decrease the amount of
data. An extreme case of this occurs when setting Hypercore TAM as
default on the hypertable since new chunks are technically
"compressed" by default, but empty, and all inserts are akin to
backfill. In that case no stats are created at all.

However, updating compression stats on segmentwise recompression is
challenging because it is not possible to distinguish between
backfilled tuples (which increase the size) and tuples that were
decompressed due to updates (which don't increase size). Therefore,
this change currently only updates stats when the compressed relation
is empty.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Invalid hypertable_compression_stats output until chunks are recompressed
1 participant