🤖
Problem
download_hf() in lib/marin/src/marin/download/huggingface/download_hf.py uses HfFileSystem.find() to enumerate files when hf_urls_glob is not specified (line 278):
if not cfg.hf_urls_glob:
files = hf_fs.find(hf_source_path, revision=cfg.revision)
On large HuggingFace repos, hf_fs.find() returns a truncated file list without raising an error. The download then completes "successfully" on the partial list, writes a success status, and downstream tokenization runs on incomplete data.
Observed impact
- nvidia/Nemotron-CC-v2 (8971 files):
find() returned 8330 files. 641 files silently dropped. Tokenization ran on 93% of the data.
- HuggingFaceFW/finetranslations (7149 files, 532 language dirs):
find() returned files for only 138-224 languages depending on the run. Download marked as succeeded each time.
Both cases were caught only by manually comparing file counts against the HF API.
Workaround
Passing hf_urls_glob=["**/*.parquet"] (or per-subset globs) uses hf_fs.glob() instead of hf_fs.find(), which appears to handle pagination correctly.
Proposed fix
Add a validation step after hf_fs.find() that cross-checks the file count against HfApi.list_repo_files():
if not cfg.hf_urls_glob:
files = hf_fs.find(hf_source_path, revision=cfg.revision)
# Cross-check against HfApi which handles pagination correctly
from huggingface_hub import HfApi
api = HfApi()
all_files = list(api.list_repo_files(cfg.hf_dataset_id, repo_type="dataset", revision=cfg.revision))
expected_count = len([f for f in all_files if not f.startswith(".")])
if len(files) < expected_count:
raise RuntimeError(
f"HfFileSystem.find() returned {len(files)} files but HfApi.list_repo_files() "
f"found {expected_count}. This is likely a pagination bug in HfFileSystem. "
f"Use hf_urls_glob to work around this issue."
)
Alternatively, replace hf_fs.find() with HfApi.list_repo_files() entirely.
🤖
Problem
download_hf()inlib/marin/src/marin/download/huggingface/download_hf.pyusesHfFileSystem.find()to enumerate files whenhf_urls_globis not specified (line 278):On large HuggingFace repos,
hf_fs.find()returns a truncated file list without raising an error. The download then completes "successfully" on the partial list, writes a success status, and downstream tokenization runs on incomplete data.Observed impact
find()returned 8330 files. 641 files silently dropped. Tokenization ran on 93% of the data.find()returned files for only 138-224 languages depending on the run. Download marked as succeeded each time.Both cases were caught only by manually comparing file counts against the HF API.
Workaround
Passing
hf_urls_glob=["**/*.parquet"](or per-subset globs) useshf_fs.glob()instead ofhf_fs.find(), which appears to handle pagination correctly.Proposed fix
Add a validation step after
hf_fs.find()that cross-checks the file count againstHfApi.list_repo_files():Alternatively, replace
hf_fs.find()withHfApi.list_repo_files()entirely.