[download] Replace HfFileSystem.find() with HfApi.list_repo_files() to fix silent truncation#4171
[download] Replace HfFileSystem.find() with HfApi.list_repo_files() to fix silent truncation#4171claude[bot] wants to merge 3 commits intomainfrom
Conversation
…o fix silent truncation HfFileSystem.find() silently truncates file lists on large HuggingFace repos due to a pagination bug, causing incomplete downloads. Replace it with HfApi.list_repo_files() which handles pagination correctly. Bucket paths still use find() since they are not standard repos. Fixes #4170 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e25d56f8e5
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| downstream code (hf_fs.info, hf_fs.open, _relative_path_in_source) works unchanged. | ||
| """ | ||
| api = HfApi(token=os.environ.get("HF_TOKEN", False)) | ||
| repo_type = REPO_TYPE_PREFIX_TO_API_TYPE.get(cfg.hf_repo_type_prefix, "model") |
There was a problem hiding this comment.
Normalize repo-type prefix before HfApi lookup
_list_repo_files maps cfg.hf_repo_type_prefix by exact key, so values like "datasets/" or "spaces/" fall through to the default "model". This is a regression from the previous hf_fs.find path-based behavior: configs that include the trailing slash (which is still documented in DownloadConfig) will now call list_repo_files(..., repo_type="model") for dataset/space repos and fail to list files (typically 404/not found). Strip trailing / (or otherwise normalize) before the lookup to preserve existing inputs.
Useful? React with 👍 / 👎.
|
@claude I'm not sure this is a good root cause fix. We don't want to use list_repo_files as the way to access these things, but we want to cross-reference the file count from hf_fs.find with list_repo_files. |
|
Claude finished @Helw150's task in 3m 9s —— View job Refactored to cross-reference approach per feedback
What changed: Per @Helw150's feedback, I refactored the approach so that
Key changes:
Testing: All 10 tests in |
… of replacing it Keep hf_fs.find() as the primary file listing method but cross-reference the file count against HfApi.list_repo_files() to detect silent truncation from the pagination bug. Raises RuntimeError if counts don't match. Co-authored-by: William Held <Helw150@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@claude these mocked tests aren't super useful, so you can remove them |
|
Claude finished @Helw150's task in 2m 56s —— View job Removing mocked tests
Removed the 3 mocked tests and cleaned up the existing Testing: |
Per review feedback, remove the three mocked tests (test_get_expected_file_count_returns_count, test_download_hf_cross_references_find_with_list_repo_files, test_download_hf_raises_on_truncated_find) as they aren't useful. Existing tests now patch _get_expected_file_count to return None instead of mocking HfApi. Co-authored-by: William Held <Helw150@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
This pull request has been inactive for 23 days and is marked as stale. |
|
This pull request has been automatically closed due to inactivity. |
HfFileSystem.find() silently truncates file lists on large HuggingFace repos due to a pagination bug, causing incomplete downloads (e.g. nvidia/Nemotron-CC-v2 returned 8330/8971 files). This replaces find() with HfApi.list_repo_files() which handles pagination correctly. Bucket paths still use find() since they are not standard HF repos. Adds tests verifying the new code path is used and produces correct HfFileSystem-format paths.
Fixes #4170