feat(data): introduce streaming client#642
Conversation
dc9b5e3 to
b6341e1
Compare
744b564 to
a4a02cf
Compare
| /// * `hash` - The Xet hash of the file. This is a Merkle hash string. | ||
| /// * `file_size` - The size of the file. | ||
| /// * `sha256` - The SHA256 hash of the file. | ||
| pub fn with_sha256(hash: String, file_size: u64, sha256: String) -> Self { |
There was a problem hiding this comment.
In order to commit to the hub I need the calculated sha256 hash.
| self: &Arc<Self>, | ||
| file_name: Option<Arc<str>>, | ||
| size: u64, | ||
| size: Option<u64>, |
There was a problem hiding this comment.
When writing using opendal the size is not known ahead of time, so in order to avoid collecting the data I had to implement this workaround. Ideally we could turn of progress tracking but seems like its API is more coupled with the general upload process.
There was a problem hiding this comment.
Turning it off is done by simply using the NoOp version, so it's possible to do. We should definitely take that situation into account.
There was a problem hiding this comment.
Great, thanks for the hint! Trying it out.
There was a problem hiding this comment.
Took a closer look. While I can turn off the actual reporting by passing noop, CompletionTracker is still running and verifying (in dev mode) the upload so I ended up making total_bytes optional essentially signaling that total_bytes is not known before uploading.
There was a problem hiding this comment.
Right. I had to handle the known/unknown case in the download progress tracking in a parallel PR. There it explicitly tries to update the total as more data is streamed if it's not known. The issue with making it optional is that reporting functions elsewhere use the ratio heavily, which would be problematic. Let me put up a PR quick to add this same feature to the upload tracking.
There was a problem hiding this comment.
Put a PR up for this at #651. This should satisfy your use case and satisfy the needed invariants for the reporting UX in both cases.
f7631e4 to
6e36baa
Compare
Add download_bytes_async for streaming in-memory downloads, making the download API symmetric with upload_bytes_async. A single FileDownloader is shared across all files, with each file getting its own background task writing into a bounded mpsc channel that provides backpressure. The exposed API returns a stream of bytes, which can be consumed by the caller.
This reverts commit 66af7d0.
…ing progress tracking
6e36baa to
cc27189
Compare
|
In the meantime I tree-shaked it into a more easily publishable single crate at https://github.com/kszucs/subxet using https://github.com/kszucs/cargo-subset. |
Several query engines use the datfusion/rust ecosystem eventually depending on Apache OpenDAL. We also use opendal in the dataset viewer.
In order to support seamless hf:// experience throughout the ecosystem I added the missing features to opendal's huggingface backend, one particular is Xet downlod/upload support hence this PR.
Opendal has a specific API requirements for streaming upload and download so I created a
XetClientand correspondingXetWriterandXetReaderto stream downloads/uploads using the existing xet machinery and utilities.This changeset is actually used by apache/opendal#7185