Skip to content

feat: pass pre-computed SHA-256 to hf_xet upload#3876

Open
XciD wants to merge 3 commits intomainfrom
feat/pass-sha256-to-xet
Open

feat: pass pre-computed SHA-256 to hf_xet upload#3876
XciD wants to merge 3 commits intomainfrom
feat/pass-sha256-to-xet

Conversation

@XciD
Copy link
Member

@XciD XciD commented Mar 3, 2026

Summary

Pass the SHA-256 hashes already computed during CommitOperationAdd.__post_init__() (via UploadInfo.from_path()) to hf_xet.upload_files() via the new sha256s keyword parameter.

Context

Double computation today

For repo commits, huggingface_hub computes SHA-256 on every file for LFS batch negotiation, then hf_xet recomputes it internally because upload_files() doesn't accept pre-computed hashes:

CommitOperationAdd.__post_init__()
  → UploadInfo.from_path()
    → sha_fileobj()          ← SHA-256 #1

upload_files(paths, ...)     ← sha256s not passed
  → SingleFileCleaner
    → ShaGenerator::Generate ← SHA-256 #2 (same bytes, same result)

Performance impact

On instances without SHA-NI (e.g. AWS m5.xlarge), SHA-256 runs at ~280-310 MB/s in software and accounts for 70-80% of the upload pipeline CPU time. This eliminates the redundant computation.

Scope

Only the repo commit path (_upload_xet_files in _commit_api.py) is changed, where UploadInfo.sha256 is already available.

The bucket path (hf_api.py:_batch_bucket_files) does not compute SHA-256 upfront, so it is not changed here.

Depends on: huggingface/xet-core#678


Note

Cursor Bugbot is generating a summary for commit 4ac84d1. Configure here.

@bot-ci-comment
Copy link

bot-ci-comment bot commented Mar 3, 2026

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR eliminates a redundant SHA-256 computation in the Xet repo-commit upload path. During a commit, CommitOperationAdd.__post_init__() already computes a SHA-256 hash for every file (via UploadInfo.from_path()). Previously, hf_xet.upload_files() would compute that same hash a second time internally. This PR passes the already-computed hashes to upload_files() via its new sha256s keyword parameter, halving SHA-256 work on the repo-commit path.

Changes:

  • Builds all_sha256s as a list of hex-encoded SHA-256 strings derived from op.upload_info.sha256 for all path-based upload operations.
  • Passes sha256s=all_sha256s as a new keyword argument to hf_xet.upload_files().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice finding! Looks good, let's just wait for huggingface/xet-core#678 to be merged and shipped so we can bump hf_xet in the dependencies

@XciD XciD changed the title Pass pre-computed SHA-256 to hf_xet upload feat: pass pre-computed SHA-256 to hf_xet upload Mar 3, 2026
XciD added a commit to huggingface/xet-core that referenced this pull request Mar 3, 2026
## Summary

- Add optional `sha256s` keyword parameter to the Python-exposed
`upload_files()` function
- Forward it to `data_client::upload_async()` which already supports it

## Context

### Double computation today

`huggingface_hub` computes SHA-256 on every file during
`CommitOperationAdd.__post_init__()` for LFS batch negotiation, then
`hf_xet` recomputes it internally because `upload_files()` doesn't
accept pre-computed hashes.

### Performance impact

This change eliminates the redundant computation entirely.

### Backward compatibility

- `sha256s` is a keyword-only parameter with default `None` — no change
for existing callers
- `data_client::upload_async()` already accepts `sha256s:
Option<Vec<String>>` since day one
- When provided, `SingleFileCleaner` uses `ShaGenerator::ProvidedValue`
and skips internal recomputation

Companion PR: huggingface/huggingface_hub#3876
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants