Commit 23ae265
tokenize: size window + levanter batch from parquet row groups (#5158)
* size zephyr window and levanter cache `batch_size` from parquet
row-group metadata so each unit of work aligns with ~half a row group
end-to-end
* probe first parquet file's footer via `_avg_parquet_row_group_rows`,
then set `window = min(avg_rows_per_rg // 2, 64)` and `batch_size =
avg_rows_per_rg // 2`
* halving gives zephyr headroom to pipeline two windows per row group
and caps per-worker peak memory
* non-parquet inputs keep the old defaults (`window=64`, `batch_size`
from `config.levanter_batch_size`)
* caller-supplied `config.levanter_batch_size` still wins over the
row-group-derived default
* extract `_MAX_WINDOW_SIZE = 64` constant [^1]
CC: @rjpower
[^1]: rationale for the 64 cap lives in
#2829 (comment)
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 30f6b6c commit 23ae265
1 file changed
Lines changed: 42 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
| 22 | + | |
22 | 23 | | |
23 | 24 | | |
24 | 25 | | |
| |||
44 | 45 | | |
45 | 46 | | |
46 | 47 | | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
47 | 64 | | |
48 | 65 | | |
49 | 66 | | |
| |||
396 | 413 | | |
397 | 414 | | |
398 | 415 | | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
| 427 | + | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
399 | 439 | | |
400 | 440 | | |
401 | 441 | | |
402 | 442 | | |
403 | 443 | | |
404 | 444 | | |
405 | 445 | | |
406 | | - | |
407 | | - | |
408 | | - | |
| 446 | + | |
409 | 447 | | |
410 | 448 | | |
411 | 449 | | |
412 | 450 | | |
413 | 451 | | |
414 | | - | |
| 452 | + | |
415 | 453 | | |
416 | 454 | | |
417 | 455 | | |
| |||
0 commit comments