Commit 9c0d499
authored
seg: SamplingFactor>1 produced only 1 sample (erigontech#21806)
## Problem
`Compressor.AddWord` decided when a superstring window was full from
`len(c.superstring)`. But skipped (non-sampled) windows never append to
that buffer, so once `SamplingFactor>1` reached the first skipped window
the buffer stopped growing, the overflow branch never fired again, the
window counter froze, and **the pattern dictionary was built from the
first `superstringLimit` (16MB) of the file only.**
This affects every `SamplingFactor>1` user: `seg.DefaultCfg` and
`BlockCompressCfg` both use `SamplingFactor=4`, so tx/bodies/headers
segments have been built from first-16MB dictionaries.
## Fix
Track scanned bytes per window in `scannedBytes`, advanced on **every**
word regardless of sampling, and use it (not `len(c.superstring)`) for
the overflow check. Window boundaries now advance through skipped
windows, so `SamplingFactor` honestly samples every Nth window across
the whole file.
While here, the windowing logic moved into a small `advanceScan` helper
so the byte accounting can't drift from the rollover it feeds, and the
send condition is now `len(superstring) > 0` (non-empty ⟺ sampled): this
stops pushing empty buffers to the workers and only fetches a fresh pool
buffer when one is actually handed off.
## Scope / impact
- **Affected:** `seg.DefaultCfg`, `BlockCompressCfg`
(`SamplingFactor=4`) — tx/bodies/headers segments now build from a true
25% whole-file sample (slower build, smaller output; offset by erigontech#21625's
matcher speedups).
- **Not affected:** `DomainCompressCfg` / `HistoryCompressCfg`
(`SamplingFactor=1`) — for `SamplingFactor=1` the new accounting is
provably identical to the old, so domain/history snapshots stay
**byte-identical** (verified by the unchanged checksum tests).
Addresses the core bug in erigontech#21628. Per-config `SamplingFactor` re-tuning
(the issue's other open thread) is intentionally left out of this PR.
## Testing
- New `TestCompressSamplingCoversWholeFile` pins the invariant "the
number of windows a file splits into must not depend on
`SamplingFactor`" — fails on the old code (SF=1 → 90 windows, SF=4 →
stuck at 1), passes after the fix.
- Full `db/seg` suite passes under `-race`; existing checksum-asserting
tests unchanged (single-window output byte-identical).1 parent 3fd3193 commit 9c0d499
2 files changed
Lines changed: 61 additions & 10 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
123 | 123 | | |
124 | 124 | | |
125 | 125 | | |
| 126 | + | |
| 127 | + | |
126 | 128 | | |
127 | 129 | | |
128 | 130 | | |
| |||
187 | 189 | | |
188 | 190 | | |
189 | 191 | | |
| 192 | + | |
190 | 193 | | |
191 | 194 | | |
192 | 195 | | |
| |||
267 | 270 | | |
268 | 271 | | |
269 | 272 | | |
270 | | - | |
271 | | - | |
272 | | - | |
273 | | - | |
274 | | - | |
275 | | - | |
276 | | - | |
277 | | - | |
278 | | - | |
279 | | - | |
| 273 | + | |
| 274 | + | |
280 | 275 | | |
281 | 276 | | |
282 | 277 | | |
| |||
287 | 282 | | |
288 | 283 | | |
289 | 284 | | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
290 | 304 | | |
291 | 305 | | |
292 | 306 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
387 | 387 | | |
388 | 388 | | |
389 | 389 | | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
| 412 | + | |
| 413 | + | |
| 414 | + | |
| 415 | + | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
0 commit comments