[zephyr/tokenize] Use bulk list-objects for file sizes, delete filescan job#4658
[zephyr/tokenize] Use bulk list-objects for file sizes, delete filescan job#4658
Conversation
…-file stats Replace per-file fsspec stat RPCs with fsspec glob(detail=True), which returns file sizes from the same list-objects API call that discovers files — zero extra RPCs needed. Zephyr changes: - Dataset.from_files() now stores a lazy GlobSource instead of eagerly globbing. Resolution happens at plan time in compute_plan(). - resolve_glob() uses glob(detail=True) to produce FileEntry objects (path + size) from the bulk listing response. - InputFileSpec gains a size field, populated from the bulk listing. Tokenize changes: - Delete the tokenize-filescan Zephyr job (introduced in #4341) that launched 32 distributed workers just to stat files one at a time. - Replace it with _glob_with_sizes() which gets sizes for free from the same glob call. Works for gs://, hf://, s3://, and local paths. - Remove dead InputName/ExecutorStep early-return branch from _get_filepaths_to_tokenize. Benchmark on nemotron hq_actual (2,755 files, 1 TB): glob(detail=True): ~2s (same cost as plain glob) Previous approach: 32-worker Zephyr job doing individual stats Part of #4411, part of #4587
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 98604d2e02
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
@claude strip provenance.json from the validation/test sets as well, as per PR feedback |
|
Claude finished @rjpower's task in 2m 43s —— View job Done — provenance.json filter applied to validation files
Changetrain_file_stats = _glob_with_sizes(train_patterns)
train_file_stats = [f for f in train_file_stats if "provenance.json" not in f["filename"]]
validation_file_stats = _glob_with_sizes(validation_patterns)
validation_file_stats = [f for f in validation_file_stats if "provenance.json" not in f["filename"]] # newOne-line mirror of the existing train filter in |
The bulk glob flow previously stripped provenance.json only from train files, so directory-based validation inputs could include metadata JSON in the tokenization stream. Apply the same filter to validation files. Co-authored-by: Russell Power <rjpower@users.noreply.github.com>

What
Replace per-file
fsspec_size()stat RPCs withfsspec.glob(detail=True), which returns file sizes from the same list-objects API call that discovers files. Zero extra RPCs.This eliminates the
tokenize-filescanZephyr job (#4341) that launched 32 distributed workers just to stat files one at a time. On nemotron hq_actual (2,755 files, 1 TB),glob(detail=True)takes ~2s — the same cost as a plain glob.Zephyr changes
Dataset.from_files()now stores a lazyGlobSourceop instead of eagerly globbing at construction timecompute_plan()resolvesGlobSource→FileEntryobjects (path + size) viaresolve_glob()InputFileSpecgains asizefield populated from the bulk listing_compute_file_pushdownnow takeslist[FileEntry]instead oflist[str]Tokenize changes
tokenize-filescanZephyr job and all its machinery (fsspec_size, batched stat workers)_glob_with_sizes()— takes a list of patterns, returns[{"filename", "size"}]usingdetail=True_expand_tokenize_paths()— directory → recursive extension globsTokenizeConfigandHfTokenizeConfigpaths now go through the same glob → bundle flowInputName/ExecutorStepearly-return branchBenchmark
detail=Trueworks identically forgs://,hf://,s3://, and local filesystems.Part of #4411, part of #4587