Skip to content

fix: dedup chunks at 10-minute date_bin granularity#64

Merged
tonyalaribe merged 1 commit into
masterfrom
fix/dedup-10min-chunks
Jun 11, 2026
Merged

fix: dedup chunks at 10-minute date_bin granularity#64
tonyalaribe merged 1 commit into
masterfrom
fix/dedup-10min-chunks

Conversation

@tonyalaribe

Copy link
Copy Markdown
Contributor

Follow-up to #56/#61. The hour-chunked dedup still fails on the largest project: one hour holds >2.1GB of string data (Offset overflow error: 2222394106 — Arrow i32 offsets) and its materialization OOM-killed the 19:30Z container from a 34GB baseline. Binning by date_bin(INTERVAL '10 minutes', …) — the flush-bucket granularity — caps a chunk at ~110k rows / 1-2GB strings for that project, under both the offset limit and a sane memory peak. The probe lists only duplicate-containing bins, so steady-state sweeps remain probe-only. dedup_compaction_test 2/2 green.

🤖 Generated with Claude Code

One HOUR of the largest project's day partition exceeds Arrow's i32
string-offset limit ('Offset overflow error: 2222394106' in prod) and
its materialization OOM-killed the 19:30Z container from a 34GB
baseline. 10-minute bins match the flush-bucket granularity: ~110k rows
/ 1-2GB strings per chunk for that project — under both the offset
limit and a sane transient peak. Probe lists only dup-containing bins,
so steady-state sweeps stay probe-only.
@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

Code Review

Overview

This is a focused, well-motivated hotfix that changes dedup chunking from 1-hour to 10-minute bins to address two distinct production failure modes:

  1. Arrow i32 offset overflow — one hour of the largest project exceeds 2.1 GB of string data (observed: 2,222,394,106 bytes > INT32_MAX).
  2. OOM kill — materializing a full hour during the rewrite was allocating tens of GB transiently.

The change is minimal (+6/-2) and directly targets the root cause. The logic is correct: 10-minute bins align with the flush-bucket granularity, so no duplicate group can ever straddle a bin boundary.


Correctness

  • date_bin alignment is correct. date_bin(INTERVAL '10 minutes', ts, TIMESTAMP '1970-01-01T00:00:00') bins to :00/:10/:20/:30/:40/:50 of every hour because 10 minutes divides evenly into 60. This is consistent with the flush-bucket granularity described in the PR.
  • Sealed-chunk guard remains correct. sealed_before = now - 2h with the check if end > sealed_before { skip } — a 10-minute bin is only rewritten once it has been closed for at least 2 hours. The 2-hour margin was designed around a ~70-minute buffer window, which still holds here.
  • Probe and rewrite are consistent. The probe groups by 10-minute bins, and the rewrite filter uses the same 10-minute [start, start+10min) window. No alignment mismatch.

Issues

Minor — stale variable name hours (pre-existing, but newly misleading after this change):

let mut hours = Vec::new();  // now holds 10-minute bin timestamps

Renaming to bins would prevent future confusion when someone reads the loop.

Minor — stale prose comment (pre-existing, right time to update it):

// Rewriting an hour that late data may still flush into races
// replace_where against the append...

This block comment (around line 3195) still says "hour" after the bin granularity changed. Should read "bin" or "10-minute bin".


Test Coverage

The existing dedup_compaction_collapses_cross_flush_duplicates test pins the timestamp 3 hours in the past — well outside the 2-hour sealed window — so the 10-minute vs. 1-hour distinction doesn't exercise any new code path. The test validates that dedup works at all with the new SQL, which is valuable, but it doesn't cover:

  • Bin boundary alignment: rows at :09:59 and :10:00 landing in different bins and being rewritten independently.
  • The sealed-chunk filter at the new granularity: a bin from 1h55m ago should be skipped; a bin from 2h05m ago should be rewritten.

These are lower-priority given the prod failure is deterministic and well-understood, but worth a follow-up.


Performance

  • The probe aggregation cost is unchanged (same GROUP BY, different bin expression).
  • The rewrite now processes up to 6× more bins, but each is 6× smaller — net materialization cost is similar, and peak transient allocation per chunk drops from ~1–2 GB to ~110–200 MB, well within both the Arrow i32 limit and any reasonable container memory budget.
  • Probe-only bins (no duplicates) are still skipped, so steady-state overhead is unaffected.

Summary

The fix is correct and safe to merge. Two small follow-ups worth tracking:

  1. Rename hoursbins in dedup_partition to match the new semantics.
  2. Update the "Rewriting an hour..." prose comment to say "bin".
  3. (Optional) Add a targeted test for bin boundary behavior and the sealed-chunk filter at 10-minute granularity.

🤖 Generated with Claude Code

@tonyalaribe tonyalaribe merged commit 26ea7a4 into master Jun 11, 2026
10 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant