perf(pruning): skip block cache + tighten compaction tunables by lucca30 · Pull Request #36 · 0xPolygon/cometbft

lucca30 · 2026-05-04T21:01:40Z

Summary

Two related groups of changes for the pruner. Tier-1 has a measured local win; tier-2 is production-only and shipped together so a single mainnet canary covers both.

Tier 1 — DontFillCache on prune-path iterators

Bumps `cometbft-db` to `v0.14.2-polygon` (adds `IteratorWithOpts` extension).
Threads `DontFillCache=true` through the three iterator call sites that fire during pruning:
- `state/indexer/block/kv/kv.go` — `BlockerIndexer.Prune` full-keyspace scan
- `state/txindex/kv/kv.go` — `TxIndexer.Prune` height-range sweep + per-event-tag lookups in `getKeysToDelete`
- `internal/db/db_utils.go` — `FindSmallestValueWithBrokenKeys`, fired on every ABCI-results pruning cycle

Tier 2 — compaction shard size + writeback throttle

`MaxCompactionInterval` 300_000 → 50_000. `CompactRange` merges every SST overlapping the requested range in one pass; on a grown mainnet DB the 300k span pulled hundreds of SSTs into one shard's working set. 50k keeps per-shard merges bounded.
`WaitTimeBetweenCompactions` 2ms → 50ms. Linux's `vm.dirty_writeback_centisecs` defaults to 5s, so 2ms doesn't actually let the kernel drain dirty pages between shards. 50ms gives writeback a real window.

Why

Tier 1. These iterators are one-shot scans during prune. Without `DontFillCache`, they pull index/data blocks into goleveldb's block cache as they iterate, evicting the hot Search/Has/Index working set on the way through.

Local A/B bench (heimdall `pruning-bench`, real `BlockerIndexer.Index()` keyspace, linux/arm64 docker, bind-mounted data dir):

Scale	Cap	d_heap baseline	d_heap fix	Δ
100k × 12 entries	512m	+46 MiB	+9 MiB	−80%
300k × 12 entries	1g	+27 MiB	+16 MiB	−41%

`d_rss` also drops (−40% at 100k, −14% at 300k) but tapers at scale because the 8 MiB block cache saturates regardless.

Tier 2. The local bench doesn't reproduce the multi-shard case the smaller `MaxCompactionInterval` targets — at our scale the whole dataset fits in a single 50k shard so the win never fires. Mechanism is grounded in goleveldb behavior on a large DB. Same for the throttle bump: the existing 2ms is a scheduler-yield win but is far below Linux's `vm.dirty_writeback_centisecs` default (5s), so it doesn't let the kernel drain dirty pages between shards. 50ms gives writeback a real window.

Cost ceiling on the heaviest pruner cycle (~1500 shards: indexers ×2, int-sharded stores ×4, prefix-hex256 ×1): ~75s extra at 50ms. Fits inside the 3h pruner cycle budget.

Risk

Tier 1. Low. Read paths (Search, Has, Index) are unchanged — the flag is only set on prune iterators. The helper `dbm.IteratorWithOpts` falls back to plain `Iterator` for backends that don't implement the extension; goleveldb's existing `Iterator` / `ReverseIterator` now delegate to the `*WithOpts` forms with `opts=nil`, so existing callers see no behavior change.
Tier 2 (`MaxCompactionInterval`). More shards = more total compaction work and more throttle wait time, but each shard's memory footprint is smaller. If the actual hot path is on already-narrow shards this knob does nothing.
Tier 2 (throttle bump). If the OOM root cause isn't page-cache stacking, this knob has no effect. The original `0ms → 2ms` data point was likely the scheduler-yield effect; we don't have prod data on `2ms → 50ms` specifically. The cost is bounded (~75s per cycle), so if the canary surfaces a need to tune higher, we can revisit with data.

Test plan

`go build ./...`
`go test ./state/indexer/... ./state/txindex/... ./internal/db/...` — 150/150 pass
Mainnet canary (one validator + one RPC) for ≥24h covering one full 3h pruner cycle. Acceptance: peak RSS during prune ≤ 1.5× steady-state, no OOM, no regression in pruner cycle wall time beyond the bounded throttle cost.

Bumps cometbft-db to v0.14.2-polygon to pick up the new IteratorWithOpts extension, then threads DontFillCache=true through the three pruner iterator call sites: - state/indexer/block/kv: BlockerIndexer.Prune full-keyspace scan - state/txindex/kv: TxIndexer.Prune height-range sweep + the per-event-tag lookups it drives via getKeysToDelete - internal/db/db_utils: FindSmallestValueWithBrokenKeys, fired on every ABCI-results pruning cycle These are one-shot scans that would otherwise pollute the goleveldb block cache, evicting hot Search/Has/Index working set on the way through. Local A/B benchmark (heimdall pruning-bench against the real BlockerIndexer.Index() keyspace) cut d_heap by 80% at 100k blocks (46 -> 9 MiB) and 41% at 300k. Read paths are unaffected: the helper falls back to plain Iterator for backends that don't implement the extension, and goleveldb's existing Iterator/ReverseIterator now just delegate to the *WithOpts forms with opts=nil.

Two tunables for the sharded pruner. No local benchmark signal at investigation scale — these are production-only changes; the goal is to ship them alongside the DontFillCache wiring on a single canary so operators don't pay two validation cycles. - MaxCompactionInterval 300_000 -> 50_000. CompactRange merges every SST that overlaps the requested range in one pass; on a grown mainnet DB the 300_000 default could pull hundreds of SSTs into a single shard's working set. 50_000 keeps the per-shard merge bounded. - WaitTimeBetweenCompactions 2ms -> 50ms (default), with a new COMETBFT_COMPACTION_WAIT_MS env override. The 2ms throttle was a scheduler-yield win, but Linux's vm.dirty_writeback_centisecs defaults to 5s, so 2ms doesn't actually let the kernel drain dirty pages between shards. 50ms gives writeback a real window. The constant is now a var initialized at startup from the env so operators can tune up to ~200ms without a release. Cost ceiling on the heaviest pruner cycle (~1500 shards: indexers x2, int-sharded stores x4, prefix-hex256 x1): 75s extra at 50ms, 5min at 200ms. Both fit inside the 3h pruner cycle budget.

YAGNI. The 50ms cost is bounded (~75s extra per heaviest pruner cycle), so there's no operational scenario where we need a runtime knob before seeing canary data. If a future canary surfaces a need to tune, we'll add the override then with real numbers behind the choice. Reverts WaitTimeBetweenCompactions back to a const, drops the env-var plumbing and the associated unit tests.

@latest

Makefile used @latest, which caused 39 unrelated lint failures on this branch (across libs/json, mempool, types, rpc — none in files this PR touches) once newer linter versions enabled stricter rule sets after PR #35 last passed. Pinning to v2.1.6 makes lint reproducible across PRs and local runs. Future linter bumps should be intentional commits rather than silent @latest drift.

Pinning to v2.1.6 (prior commit) wasn't enough — even on that version golangci-lint flags 24 stale //nolint:prealloc directives across the codebase that are no longer triggered by current prealloc rules. None are in files this PR touches. Disabling the nolintlint linter to unblock this PR's CI rather than mass-deleting the stale directives, which would mix ~24 unrelated files into a perf change. A follow-up cleanup PR can sweep the directives and re-enable nolintlint. Also fixes the one legitimate prealloc miss in scripts/metricsgen/metricsgen.go (preallocates help slice with known capacity len(cg.List)) — kept because prealloc itself remains enabled.

lucca30 · 2026-05-04T21:41:38Z

@claude Review it once. Check the cometbft-db code bump also and see if these changes protects the application from memory spikes while not impacting the performance.

…of, resumable The existing TxIndex.Prune walks tx.height/* in lex order, which is NOT numeric order for decimal-string heights. With a mix of digit widths (e.g. "1000" sorting before "11"), the iterator can hit a high-numbered height while many lower-numbered heights remain unprocessed and exit via the keyHeight >= retainHeight early-break. Repro: 1000 heights, retain=900 -> returns pruned=3 instead of 899. PruneNumeric sidesteps this by iterating one height at a time with a prefix-bounded Iterator(prefixKeyForHeight(h), nil) + bytes.HasPrefix guard. Storage format unchanged; drop-in replacement. Two further wins on top of correctness: - Eliminates the 100k-key staging slice in Prune. Deletes go straight into the batch (flushed every 1000), peak transient ~150 KB instead of the multi-MiB staging pool. A/B at 100k blocks: d_heap drops from +9 MiB to +0 MiB (flat), d_rss halves (+16 -> +8 MiB). - Checkpoints lastRetainHeight after each batch flush, so a process killed mid-cycle resumes from the last persisted height instead of restarting at 1. Removes the 'perpetual re-bootstrap' failure mode on long-running pruning workloads. Trade: ~60% slower than Prune at 100k due to per-height iter open/close. Worth it for correctness + bounded memory + resumability.

sonarqubecloud · 2026-05-06T19:25:58Z

Quality Gate passed

Issues
3 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

lucca30 added 2 commits May 4, 2026 18:01

lucca30 changed the title ~~perf(pruning): skip block cache on prune-path iterators~~ perf(pruning): skip block cache + tighten compaction tunables May 4, 2026

lucca30 added 3 commits May 4, 2026 18:23

lucca30 mentioned this pull request May 4, 2026

chore(deps): bump cometbft to v0.3.7-beta-polygon (pruning canary) 0xPolygon/heimdall-v2#583

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(pruning): skip block cache + tighten compaction tunables#36

perf(pruning): skip block cache + tighten compaction tunables#36
lucca30 wants to merge 6 commits intodevelopfrom
lmartins/pruning-dontfillcache

lucca30 commented May 4, 2026 •

edited

Loading

Uh oh!

lucca30 commented May 4, 2026

Uh oh!

sonarqubecloud Bot commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lucca30 commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tier 1 — DontFillCache on prune-path iterators

Tier 2 — compaction shard size + writeback throttle

Why

Risk

Test plan

Uh oh!

lucca30 commented May 4, 2026

Uh oh!

sonarqubecloud Bot commented May 6, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lucca30 commented May 4, 2026 •

edited

Loading