Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Pull request overview
This PR eliminates a redundant SHA-256 computation in the Xet repo-commit upload path. During a commit, CommitOperationAdd.__post_init__() already computes a SHA-256 hash for every file (via UploadInfo.from_path()). Previously, hf_xet.upload_files() would compute that same hash a second time internally. This PR passes the already-computed hashes to upload_files() via its new sha256s keyword parameter, halving SHA-256 work on the repo-commit path.
Changes:
- Builds
all_sha256sas a list of hex-encoded SHA-256 strings derived fromop.upload_info.sha256for all path-based upload operations. - Passes
sha256s=all_sha256sas a new keyword argument tohf_xet.upload_files().
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Wauplin
left a comment
There was a problem hiding this comment.
Very nice finding! Looks good, let's just wait for huggingface/xet-core#678 to be merged and shipped so we can bump hf_xet in the dependencies
## Summary - Add optional `sha256s` keyword parameter to the Python-exposed `upload_files()` function - Forward it to `data_client::upload_async()` which already supports it ## Context ### Double computation today `huggingface_hub` computes SHA-256 on every file during `CommitOperationAdd.__post_init__()` for LFS batch negotiation, then `hf_xet` recomputes it internally because `upload_files()` doesn't accept pre-computed hashes. ### Performance impact This change eliminates the redundant computation entirely. ### Backward compatibility - `sha256s` is a keyword-only parameter with default `None` — no change for existing callers - `data_client::upload_async()` already accepts `sha256s: Option<Vec<String>>` since day one - When provided, `SingleFileCleaner` uses `ShaGenerator::ProvidedValue` and skips internal recomputation Companion PR: huggingface/huggingface_hub#3876
Summary
Pass the SHA-256 hashes already computed during
CommitOperationAdd.__post_init__()(viaUploadInfo.from_path()) tohf_xet.upload_files()via the newsha256skeyword parameter.Context
Double computation today
For repo commits,
huggingface_hubcomputes SHA-256 on every file for LFS batch negotiation, thenhf_xetrecomputes it internally becauseupload_files()doesn't accept pre-computed hashes:Performance impact
On instances without SHA-NI (e.g. AWS m5.xlarge), SHA-256 runs at ~280-310 MB/s in software and accounts for 70-80% of the upload pipeline CPU time. This eliminates the redundant computation.
Scope
Only the repo commit path (
_upload_xet_filesin_commit_api.py) is changed, whereUploadInfo.sha256is already available.The bucket path (
hf_api.py:_batch_bucket_files) does not compute SHA-256 upfront, so it is not changed here.Depends on: huggingface/xet-core#678
Note
Cursor Bugbot is generating a summary for commit 4ac84d1. Configure here.