Skip to content

[zephyr/tokenize] Use bulk list-objects for file sizes, delete filescan job#4658

Open
rjpower wants to merge 4 commits intomainfrom
zephyr-fast-filescan
Open

[zephyr/tokenize] Use bulk list-objects for file sizes, delete filescan job#4658
rjpower wants to merge 4 commits intomainfrom
zephyr-fast-filescan

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Apr 11, 2026

What

Replace per-file fsspec_size() stat RPCs with fsspec.glob(detail=True), which returns file sizes from the same list-objects API call that discovers files. Zero extra RPCs.

This eliminates the tokenize-filescan Zephyr job (#4341) that launched 32 distributed workers just to stat files one at a time. On nemotron hq_actual (2,755 files, 1 TB), glob(detail=True) takes ~2s — the same cost as a plain glob.

Zephyr changes

  • Dataset.from_files() now stores a lazy GlobSource op instead of eagerly globbing at construction time
  • compute_plan() resolves GlobSourceFileEntry objects (path + size) via resolve_glob()
  • InputFileSpec gains a size field populated from the bulk listing
  • _compute_file_pushdown now takes list[FileEntry] instead of list[str]

Tokenize changes

  • Delete the tokenize-filescan Zephyr job and all its machinery (fsspec_size, batched stat workers)
  • Add _glob_with_sizes() — takes a list of patterns, returns [{"filename", "size"}] using detail=True
  • Add _expand_tokenize_paths() — directory → recursive extension globs
  • Both TokenizeConfig and HfTokenizeConfig paths now go through the same glob → bundle flow
  • Remove dead InputName/ExecutorStep early-return branch

Benchmark

nemotron hq_actual (2,755 files, 1 TB):
  glob(detail=True): ~2s  (same cost as plain glob — sizes are free)
  Previous: 32-worker Zephyr job doing individual stats

detail=True works identically for gs://, hf://, s3://, and local filesystems.

Part of #4411, part of #4587

…-file stats

Replace per-file fsspec stat RPCs with fsspec glob(detail=True), which
returns file sizes from the same list-objects API call that discovers
files — zero extra RPCs needed.

Zephyr changes:
- Dataset.from_files() now stores a lazy GlobSource instead of eagerly
  globbing. Resolution happens at plan time in compute_plan().
- resolve_glob() uses glob(detail=True) to produce FileEntry objects
  (path + size) from the bulk listing response.
- InputFileSpec gains a size field, populated from the bulk listing.

Tokenize changes:
- Delete the tokenize-filescan Zephyr job (introduced in #4341) that
  launched 32 distributed workers just to stat files one at a time.
- Replace it with _glob_with_sizes() which gets sizes for free from
  the same glob call. Works for gs://, hf://, s3://, and local paths.
- Remove dead InputName/ExecutorStep early-return branch from
  _get_filepaths_to_tokenize.

Benchmark on nemotron hq_actual (2,755 files, 1 TB):
  glob(detail=True): ~2s (same cost as plain glob)
  Previous approach: 32-worker Zephyr job doing individual stats

Part of #4411, part of #4587
@rjpower rjpower added the agent-generated Created by automation/agent label Apr 11, 2026
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 11, 2026

Code Review in Progress

  • Check if PR needs review
  • Collect relevant CLAUDE.md / AGENTS.md files
  • Summarize PR changes
  • Run 4 parallel review agents
  • Validate each found issue
  • Post review (inline comments or summary)

View job run

@rjpower rjpower requested a review from ravwojdyla April 11, 2026 17:42
@rjpower rjpower enabled auto-merge (squash) April 11, 2026 17:42
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 98604d2e02

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Apr 12, 2026

@claude strip provenance.json from the validation/test sets as well, as per PR feedback

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 12, 2026

Claude finished @rjpower's task in 2m 43s —— View job


Done — provenance.json filter applied to validation files

  • Read context and PR feedback
  • Read lib/marin/src/marin/processing/tokenize/tokenize.py
  • Locate provenance.json filtering logic (line 339 — train only)
  • Apply same filter to validation files (covers HF test split too, since HfTokenizeConfig loads data_files["test"] into validation_patterns)
  • Pre-commit: clean (ruff/black/pyrefly all ok)
  • Tests: uv run pytest tests/processing/tokenize/test_tokenize.py — 20 passed, 1 deselected (the @pytest.mark.slow integration test)
  • Commit and push (cbd0aee4b)

Change

train_file_stats = _glob_with_sizes(train_patterns)
train_file_stats = [f for f in train_file_stats if "provenance.json" not in f["filename"]]
validation_file_stats = _glob_with_sizes(validation_patterns)
validation_file_stats = [f for f in validation_file_stats if "provenance.json" not in f["filename"]]  # new

One-line mirror of the existing train filter in lib/marin/src/marin/processing/tokenize/tokenize.py. Since HfTokenizeConfig routes the test split through validation_patterns (line 333), this also covers HF test splits.
· Branch

The bulk glob flow previously stripped provenance.json only from train
files, so directory-based validation inputs could include metadata JSON
in the tokenization stream. Apply the same filter to validation files.

Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant