[Proposal] Incremental Compacting Implementation #537

rkistner · 2026-03-02T12:47:27Z

rkistner
Mar 2, 2026
Maintainer

Background

Currently we recommend running a compact job once per day. It then iterates through existing bucket_data and bucket_parameters, and removes duplicates in short.

We've went through multiple iterations on how exactly this is implemented, but currently:

bucket_data is compacted by iterating through each bucket in bucket_state, checking if it meets the compact threshold requirements, then compacts the data.
bucket_parameters is compacted by iterating through each lookup value, then removing duplicates.

In the past, we've had a variation where we only iterated through buckets in bucket_state with more than 10 new operations since the previous compact. An issue here was that you could have large buckets that continuously get new operations, and then the compact job repeatedly re-compacts the same bucket.

Further issues is that the compact job is not resumeable: If it fails in the middle, you need to restart the process from scratch. It is also not feasible to do more regular compacting without adding a lot of overhead.

The effect of these issues are often not present until you reach high data volumes. We do however have cases of hundreds of millions of buckets on a single instance, where these issues become quite prevelant.

Proposal

The goal is to reduce the base overhead of both compact processes to make it possible to run incrementally: It should be possible to run say once per hour instead of once per day, without significant more overhead.

bucket_data

For bucket_data, we split the process into two steps:

Compute and persist the list of buckets to compact since the last run.
Iterate through the list of buckets and compact them.

For step 1, we utilize the index on bucket_state.last_op: This is already being used to track the buckets updated after any checkpoint. We can therefore keep track of the last checkpoint for which we computed the list of buckets to compact, and then use this to find any new buckets that we may need to compact. This gives us flexibility in what info we use to decide whether or not to compact a bucket, without needing to change indexes. We then store these buckets in a new collection, for example buckets_to_compact.

Step 2 is then simple:

Iterate through buckets_to_compact, in any order.
Compact each bucket, and remove it from buckets_to_compact.

This split approach means we can keep minimal state in bucket_state, keeping the replication write path clean. Any compact-specific state tracking can happen in the separate buckets_to_compact collection.

We could also do a form of compact progress tracking using the buckets_to_compact collection.

In the future, we could extend this to allow multiple concurrent compact processes: All we'd need additionally is some form of "locking" on the buckets_to_compact collection.

bucket_parameters

The bucket_parameters documents are already indexed by op_id sequence, correlated with checkpoints. The compact process can iterate through all the documents added since the last run, and compact on those (lookup, key) values only.

Splitting compact

We should add new CLI options to compact either only bucket_data, or only bucket_parameters. Splitting this will help make it more robust: At the moment, we have cases of bucket_data compact never finishing, which then means bucket_parameters is never compacted, causing performance issues. By splitting these, we can ensure each gets a chance to compact.

The default option for the compact command should remain to compact both.

Postgres storage

On Postgres storage, we currently only compact bucket_data. We should extend this to also compact bucket_parameters.

MongoDB Storage: Read from secondaries

The compact can produce a high read load on the storage database. We can make the readPreference for compact jobs configurable, potentially defaulting to readPreference: secondaryPreferred.

Right now it may be possible to set readPreference in the connection URI, but it would make more sense to configure it for compacting specifically.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PowerSync

[Proposal] Incremental Compacting Implementation #537

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

PowerSync

[Proposal] Incremental Compacting Implementation #537

Uh oh!

Uh oh!

rkistner Mar 2, 2026 Maintainer

Background

Proposal

bucket_data

bucket_parameters

Splitting compact

Postgres storage

MongoDB Storage: Read from secondaries

Replies: 0 comments

rkistner
Mar 2, 2026
Maintainer