Skip to content

[download] Replace HfFileSystem.find() with HfApi.list_repo_files() to fix silent truncation#4171

Closed
claude[bot] wants to merge 3 commits intomainfrom
agent/20260326-fix-4170
Closed

[download] Replace HfFileSystem.find() with HfApi.list_repo_files() to fix silent truncation#4171
claude[bot] wants to merge 3 commits intomainfrom
agent/20260326-fix-4170

Conversation

@claude
Copy link
Copy Markdown
Contributor

@claude claude Bot commented Mar 26, 2026

HfFileSystem.find() silently truncates file lists on large HuggingFace repos due to a pagination bug, causing incomplete downloads (e.g. nvidia/Nemotron-CC-v2 returned 8330/8971 files). This replaces find() with HfApi.list_repo_files() which handles pagination correctly. Bucket paths still use find() since they are not standard HF repos. Adds tests verifying the new code path is used and produces correct HfFileSystem-format paths.

Fixes #4170

…o fix silent truncation

HfFileSystem.find() silently truncates file lists on large HuggingFace repos
due to a pagination bug, causing incomplete downloads. Replace it with
HfApi.list_repo_files() which handles pagination correctly. Bucket paths
still use find() since they are not standard repos.

Fixes #4170

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e25d56f8e5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

downstream code (hf_fs.info, hf_fs.open, _relative_path_in_source) works unchanged.
"""
api = HfApi(token=os.environ.get("HF_TOKEN", False))
repo_type = REPO_TYPE_PREFIX_TO_API_TYPE.get(cfg.hf_repo_type_prefix, "model")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Normalize repo-type prefix before HfApi lookup

_list_repo_files maps cfg.hf_repo_type_prefix by exact key, so values like "datasets/" or "spaces/" fall through to the default "model". This is a regression from the previous hf_fs.find path-based behavior: configs that include the trailing slash (which is still documented in DownloadConfig) will now call list_repo_files(..., repo_type="model") for dataset/space repos and fail to list files (typically 404/not found). Strip trailing / (or otherwise normalize) before the lookup to preserve existing inputs.

Useful? React with 👍 / 👎.

@Helw150
Copy link
Copy Markdown
Member

Helw150 commented Mar 26, 2026

@claude I'm not sure this is a good root cause fix. We don't want to use list_repo_files as the way to access these things, but we want to cross-reference the file count from hf_fs.find with list_repo_files.

@claude
Copy link
Copy Markdown
Contributor Author

claude Bot commented Mar 26, 2026

Claude finished @Helw150's task in 3m 9s —— View job


Refactored to cross-reference approach per feedback

  • Gather context — read current PR changes and understand the code
  • Implement new approach: keep hf_fs.find() as primary, cross-reference count with list_repo_files()
  • Update tests to match new behavior
  • Run pre-commit and tests
  • Push changes

What changed:

Per @Helw150's feedback, I refactored the approach so that hf_fs.find() remains the primary file listing method. Instead of replacing it with list_repo_files(), we now cross-reference the file count:

  1. hf_fs.find() is called as before to get the file list
  2. HfApi.list_repo_files() is called to get the expected count
  3. If the counts don't match, a RuntimeError is raised with a clear message about the pagination bug

Key changes:

  • _list_repo_files()_get_expected_file_count() — now returns just the count instead of a full file list
  • download_hf() calls hf_fs.find() first, then validates the count against _get_expected_file_count() for non-bucket paths
  • Added test_download_hf_raises_on_truncated_find — verifies the error is raised when counts mismatch
  • Updated existing tests to match the new flow

Testing: All 10 tests in tests/download/test_huggingface.py pass. Pre-commit checks pass.
·

… of replacing it

Keep hf_fs.find() as the primary file listing method but cross-reference
the file count against HfApi.list_repo_files() to detect silent truncation
from the pagination bug. Raises RuntimeError if counts don't match.

Co-authored-by: William Held <Helw150@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Helw150
Copy link
Copy Markdown
Member

Helw150 commented Mar 26, 2026

@claude these mocked tests aren't super useful, so you can remove them

@claude
Copy link
Copy Markdown
Contributor Author

claude Bot commented Mar 26, 2026

Claude finished @Helw150's task in 2m 56s —— View job


Removing mocked tests

  • Read current test file to identify mocked tests
  • Remove mocked tests (test_get_expected_file_count_returns_count, test_download_hf_cross_references_find_with_list_repo_files, test_download_hf_raises_on_truncated_find)
  • Clean up existing tests to patch _get_expected_file_count instead of mocking HfApi
  • Run pre-commit and tests
  • Commit and push

Removed the 3 mocked tests and cleaned up the existing download_hf tests to patch _get_expected_file_countNone (skipping the cross-reference check) instead of mocking HfApi. All 7 remaining tests pass. Pre-commit checks pass.

Testing: uv run pytest tests/download/test_huggingface.py -v — 7 passed.

View job

Per review feedback, remove the three mocked tests (test_get_expected_file_count_returns_count,
test_download_hf_cross_references_find_with_list_repo_files, test_download_hf_raises_on_truncated_find)
as they aren't useful. Existing tests now patch _get_expected_file_count to return None instead of
mocking HfApi.

Co-authored-by: William Held <Helw150@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has been inactive for 23 days and is marked as stale.
If there is no further activity within 7 days, it will be automatically closed.
If you believe this PR should remain open, please add a comment or update the PR.

@github-actions github-actions Bot added the stale label Apr 19, 2026
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has been automatically closed due to inactivity.
If you would like to continue working on this, please reopen it or create a new PR.

@github-actions github-actions Bot closed this Apr 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🤖 HfFileSystem.find() silently truncates file lists on large repos, causing incomplete downloads

1 participant