fix: dedup chunks at 10-minute date_bin granularity#64
Conversation
One HOUR of the largest project's day partition exceeds Arrow's i32
string-offset limit ('Offset overflow error: 2222394106' in prod) and
its materialization OOM-killed the 19:30Z container from a 34GB
baseline. 10-minute bins match the flush-bucket granularity: ~110k rows
/ 1-2GB strings per chunk for that project — under both the offset
limit and a sane transient peak. Probe lists only dup-containing bins,
so steady-state sweeps stay probe-only.
Code ReviewOverviewThis is a focused, well-motivated hotfix that changes dedup chunking from 1-hour to 10-minute bins to address two distinct production failure modes:
The change is minimal (+6/-2) and directly targets the root cause. The logic is correct: 10-minute bins align with the flush-bucket granularity, so no duplicate group can ever straddle a bin boundary. Correctness
IssuesMinor — stale variable name let mut hours = Vec::new(); // now holds 10-minute bin timestampsRenaming to Minor — stale prose comment (pre-existing, right time to update it): This block comment (around line 3195) still says "hour" after the bin granularity changed. Should read "bin" or "10-minute bin". Test CoverageThe existing
These are lower-priority given the prod failure is deterministic and well-understood, but worth a follow-up. Performance
SummaryThe fix is correct and safe to merge. Two small follow-ups worth tracking:
🤖 Generated with Claude Code |
Follow-up to #56/#61. The hour-chunked dedup still fails on the largest project: one hour holds >2.1GB of string data (
Offset overflow error: 2222394106— Arrow i32 offsets) and its materialization OOM-killed the 19:30Z container from a 34GB baseline. Binning bydate_bin(INTERVAL '10 minutes', …)— the flush-bucket granularity — caps a chunk at ~110k rows / 1-2GB strings for that project, under both the offset limit and a sane memory peak. The probe lists only duplicate-containing bins, so steady-state sweeps remain probe-only.dedup_compaction_test2/2 green.🤖 Generated with Claude Code