🤖 HfFileSystem.find() silently truncates file lists on large repos, causing incomplete downloads

🤖

## Problem

`download_hf()` in `lib/marin/src/marin/download/huggingface/download_hf.py` uses `HfFileSystem.find()` to enumerate files when `hf_urls_glob` is not specified (line 278):

```python
if not cfg.hf_urls_glob:
    files = hf_fs.find(hf_source_path, revision=cfg.revision)
```

On large HuggingFace repos, `hf_fs.find()` returns a truncated file list without raising an error. The download then completes "successfully" on the partial list, writes a success status, and downstream tokenization runs on incomplete data.

### Observed impact

- **nvidia/Nemotron-CC-v2** (8971 files): `find()` returned 8330 files. 641 files silently dropped. Tokenization ran on 93% of the data.
- **HuggingFaceFW/finetranslations** (7149 files, 532 language dirs): `find()` returned files for only 138-224 languages depending on the run. Download marked as succeeded each time.

Both cases were caught only by manually comparing file counts against the HF API.

### Workaround

Passing `hf_urls_glob=["**/*.parquet"]` (or per-subset globs) uses `hf_fs.glob()` instead of `hf_fs.find()`, which appears to handle pagination correctly.

## Proposed fix

Add a validation step after `hf_fs.find()` that cross-checks the file count against `HfApi.list_repo_files()`:

```python
if not cfg.hf_urls_glob:
    files = hf_fs.find(hf_source_path, revision=cfg.revision)
    # Cross-check against HfApi which handles pagination correctly
    from huggingface_hub import HfApi
    api = HfApi()
    all_files = list(api.list_repo_files(cfg.hf_dataset_id, repo_type="dataset", revision=cfg.revision))
    expected_count = len([f for f in all_files if not f.startswith(".")])
    if len(files) < expected_count:
        raise RuntimeError(
            f"HfFileSystem.find() returned {len(files)} files but HfApi.list_repo_files() "
            f"found {expected_count}. This is likely a pagination bug in HfFileSystem. "
            f"Use hf_urls_glob to work around this issue."
        )
```

Alternatively, replace `hf_fs.find()` with `HfApi.list_repo_files()` entirely.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🤖 HfFileSystem.find() silently truncates file lists on large repos, causing incomplete downloads #4170

Problem

Observed impact

Workaround

Proposed fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

🤖 HfFileSystem.find() silently truncates file lists on large repos, causing incomplete downloads #4170

Description

Problem

Observed impact

Workaround

Proposed fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions