[zephyr/tokenize] Use bulk list-objects for file sizes, delete filescan job by rjpower · Pull Request #4658 · marin-community/marin

rjpower · 2026-04-11T17:40:35Z

What

Replace per-file fsspec_size() stat RPCs with fsspec.glob(detail=True), which returns file sizes from the same list-objects API call that discovers files. Zero extra RPCs.

This eliminates the tokenize-filescan Zephyr job (#4341) that launched 32 distributed workers just to stat files one at a time. On nemotron hq_actual (2,755 files, 1 TB), glob(detail=True) takes ~2s — the same cost as a plain glob.

Zephyr changes

Dataset.from_files() now stores a lazy GlobSource op instead of eagerly globbing at construction time
compute_plan() resolves GlobSource → FileEntry objects (path + size) via resolve_glob()
InputFileSpec gains a size field populated from the bulk listing
_compute_file_pushdown now takes list[FileEntry] instead of list[str]

Tokenize changes

Delete the tokenize-filescan Zephyr job and all its machinery (fsspec_size, batched stat workers)
Add _glob_with_sizes() — takes a list of patterns, returns [{"filename", "size"}] using detail=True
Add _expand_tokenize_paths() — directory → recursive extension globs
Both TokenizeConfig and HfTokenizeConfig paths now go through the same glob → bundle flow
Remove dead InputName/ExecutorStep early-return branch

Benchmark

nemotron hq_actual (2,755 files, 1 TB):
  glob(detail=True): ~2s  (same cost as plain glob — sizes are free)
  Previous: 32-worker Zephyr job doing individual stats

detail=True works identically for gs://, hf://, s3://, and local filesystems.

Part of #4411, part of #4587

…-file stats Replace per-file fsspec stat RPCs with fsspec glob(detail=True), which returns file sizes from the same list-objects API call that discovers files — zero extra RPCs needed. Zephyr changes: - Dataset.from_files() now stores a lazy GlobSource instead of eagerly globbing. Resolution happens at plan time in compute_plan(). - resolve_glob() uses glob(detail=True) to produce FileEntry objects (path + size) from the bulk listing response. - InputFileSpec gains a size field, populated from the bulk listing. Tokenize changes: - Delete the tokenize-filescan Zephyr job (introduced in #4341) that launched 32 distributed workers just to stat files one at a time. - Replace it with _glob_with_sizes() which gets sizes for free from the same glob call. Works for gs://, hf://, s3://, and local paths. - Remove dead InputName/ExecutorStep early-return branch from _get_filepaths_to_tokenize. Benchmark on nemotron hq_actual (2,755 files, 1 TB): glob(detail=True): ~2s (same cost as plain glob) Previous approach: 32-worker Zephyr job doing individual stats Part of #4411, part of #4587

claude · 2026-04-11T17:40:52Z

Code Review in Progress

Check if PR needs review
Collect relevant CLAUDE.md / AGENTS.md files
Summarize PR changes
Run 4 parallel review agents
Validate each found issue
Post review (inline comments or summary)

View job run

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 98604d2e02

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

lib/marin/src/marin/processing/tokenize/tokenize.py

rjpower · 2026-04-12T00:00:50Z

@claude strip provenance.json from the validation/test sets as well, as per PR feedback

claude · 2026-04-12T00:01:26Z

Claude finished @rjpower's task in 2m 43s —— View job

Done — provenance.json filter applied to validation files

Read context and PR feedback
Read lib/marin/src/marin/processing/tokenize/tokenize.py
Locate provenance.json filtering logic (line 339 — train only)
Apply same filter to validation files (covers HF test split too, since HfTokenizeConfig loads data_files["test"] into validation_patterns)
Pre-commit: clean (ruff/black/pyrefly all ok)
Tests: uv run pytest tests/processing/tokenize/test_tokenize.py — 20 passed, 1 deselected (the @pytest.mark.slow integration test)
Commit and push (cbd0aee4b)

Change

train_file_stats = _glob_with_sizes(train_patterns)
train_file_stats = [f for f in train_file_stats if "provenance.json" not in f["filename"]]
validation_file_stats = _glob_with_sizes(validation_patterns)
validation_file_stats = [f for f in validation_file_stats if "provenance.json" not in f["filename"]]  # new

One-line mirror of the existing train filter in lib/marin/src/marin/processing/tokenize/tokenize.py. Since HfTokenizeConfig routes the test split through validation_patterns (line 333), this also covers HF test splits.
· Branch

The bulk glob flow previously stripped provenance.json only from train files, so directory-based validation inputs could include metadata JSON in the tokenization stream. Apply the same filter to validation files. Co-authored-by: Russell Power <rjpower@users.noreply.github.com>

rjpower added the agent-generated Created by automation/agent label Apr 11, 2026

Remove benchmark script

98604d2

rjpower requested a review from ravwojdyla April 11, 2026 17:42

rjpower enabled auto-merge (squash) April 11, 2026 17:42

chatgpt-codex-connector bot reviewed Apr 11, 2026

View reviewed changes

lib/marin/src/marin/processing/tokenize/tokenize.py Show resolved Hide resolved

lib/marin/src/marin/processing/tokenize/tokenize.py Show resolved Hide resolved

Fix tests.

4ad070f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[zephyr/tokenize] Use bulk list-objects for file sizes, delete filescan job#4658

[zephyr/tokenize] Use bulk list-objects for file sizes, delete filescan job#4658
rjpower wants to merge 4 commits intomainfrom
zephyr-fast-filescan

rjpower commented Apr 11, 2026

Uh oh!

claude bot commented Apr 11, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

rjpower commented Apr 12, 2026

Uh oh!

claude bot commented Apr 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rjpower commented Apr 11, 2026

What

Zephyr changes

Tokenize changes

Benchmark

Uh oh!

claude bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review in Progress

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

rjpower commented Apr 12, 2026

Uh oh!

claude bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Done — provenance.json filter applied to validation files

Change

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude bot commented Apr 11, 2026 •

edited

Loading

claude bot commented Apr 12, 2026 •

edited

Loading