Skip to content

🤖 HfFileSystem.find() silently truncates file lists on large repos, causing incomplete downloads #4170

@Helw150

Description

@Helw150

🤖

Problem

download_hf() in lib/marin/src/marin/download/huggingface/download_hf.py uses HfFileSystem.find() to enumerate files when hf_urls_glob is not specified (line 278):

if not cfg.hf_urls_glob:
    files = hf_fs.find(hf_source_path, revision=cfg.revision)

On large HuggingFace repos, hf_fs.find() returns a truncated file list without raising an error. The download then completes "successfully" on the partial list, writes a success status, and downstream tokenization runs on incomplete data.

Observed impact

  • nvidia/Nemotron-CC-v2 (8971 files): find() returned 8330 files. 641 files silently dropped. Tokenization ran on 93% of the data.
  • HuggingFaceFW/finetranslations (7149 files, 532 language dirs): find() returned files for only 138-224 languages depending on the run. Download marked as succeeded each time.

Both cases were caught only by manually comparing file counts against the HF API.

Workaround

Passing hf_urls_glob=["**/*.parquet"] (or per-subset globs) uses hf_fs.glob() instead of hf_fs.find(), which appears to handle pagination correctly.

Proposed fix

Add a validation step after hf_fs.find() that cross-checks the file count against HfApi.list_repo_files():

if not cfg.hf_urls_glob:
    files = hf_fs.find(hf_source_path, revision=cfg.revision)
    # Cross-check against HfApi which handles pagination correctly
    from huggingface_hub import HfApi
    api = HfApi()
    all_files = list(api.list_repo_files(cfg.hf_dataset_id, repo_type="dataset", revision=cfg.revision))
    expected_count = len([f for f in all_files if not f.startswith(".")])
    if len(files) < expected_count:
        raise RuntimeError(
            f"HfFileSystem.find() returned {len(files)} files but HfApi.list_repo_files() "
            f"found {expected_count}. This is likely a pagination bug in HfFileSystem. "
            f"Use hf_urls_glob to work around this issue."
        )

Alternatively, replace hf_fs.find() with HfApi.list_repo_files() entirely.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions