Skip to content

feat(data): introduce streaming client#642

Open
kszucs wants to merge 14 commits intohuggingface:mainfrom
kszucs:download_bytes
Open

feat(data): introduce streaming client#642
kszucs wants to merge 14 commits intohuggingface:mainfrom
kszucs:download_bytes

Conversation

@kszucs
Copy link
Member

@kszucs kszucs commented Feb 9, 2026

Several query engines use the datfusion/rust ecosystem eventually depending on Apache OpenDAL. We also use opendal in the dataset viewer.

In order to support seamless hf:// experience throughout the ecosystem I added the missing features to opendal's huggingface backend, one particular is Xet downlod/upload support hence this PR.

Opendal has a specific API requirements for streaming upload and download so I created a XetClient and corresponding XetWriter and XetReader to stream downloads/uploads using the existing xet machinery and utilities.

This changeset is actually used by apache/opendal#7185

@kszucs kszucs changed the title feat(data): add download_bytes_async to data client feat(data): introduce streaming client Feb 11, 2026
@rajatarya rajatarya requested review from hoytak and seanses and removed request for hoytak and seanses February 11, 2026 16:46
@rajatarya
Copy link
Collaborator

Hey @kszucs : @seanses is working on a new session-based interface that will simplify making a Rust xet crate right now. We would want that interface to be used for adding streaming support. Can you work with him to align the efforts?

@kszucs kszucs marked this pull request as ready for review February 11, 2026 18:17
/// * `hash` - The Xet hash of the file. This is a Merkle hash string.
/// * `file_size` - The size of the file.
/// * `sha256` - The SHA256 hash of the file.
pub fn with_sha256(hash: String, file_size: u64, sha256: String) -> Self {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to commit to the hub I need the calculated sha256 hash.

self: &Arc<Self>,
file_name: Option<Arc<str>>,
size: u64,
size: Option<u64>,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When writing using opendal the size is not known ahead of time, so in order to avoid collecting the data I had to implement this workaround. Ideally we could turn of progress tracking but seems like its API is more coupled with the general upload process.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turning it off is done by simply using the NoOp version, so it's possible to do. We should definitely take that situation into account.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for the hint! Trying it out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a closer look. While I can turn off the actual reporting by passing noop, CompletionTracker is still running and verifying (in dev mode) the upload so I ended up making total_bytes optional essentially signaling that total_bytes is not known before uploading.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I had to handle the known/unknown case in the download progress tracking in a parallel PR. There it explicitly tries to update the total as more data is streamed if it's not known. The issue with making it optional is that reporting functions elsewhere use the ratio heavily, which would be problematic. Let me put up a PR quick to add this same feature to the upload tracking.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put a PR up for this at #651. This should satisfy your use case and satisfy the needed invariants for the reporting UX in both cases.

@kszucs kszucs force-pushed the download_bytes branch 2 times, most recently from f7631e4 to 6e36baa Compare February 13, 2026 09:15
@kszucs
Copy link
Member Author

kszucs commented Feb 13, 2026

In the meantime I tree-shaked it into a more easily publishable single crate at https://github.com/kszucs/subxet using https://github.com/kszucs/cargo-subset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants