feat(python): hf:// URL scheme for reading Vortex files from the Hugging Face Hub#8351
Closed
joseph-isaacs wants to merge 2 commits into
Closed
feat(python): hf:// URL scheme for reading Vortex files from the Hugging Face Hub#8351joseph-isaacs wants to merge 2 commits into
joseph-isaacs wants to merge 2 commits into
Conversation
…ing Face Hub
Support HfFileSystem-style hf:// URLs in vx.open() and
vortex.store.from_url(), plus a vortex.hf module with an open()
helper that accepts an explicit access token:
vxf = vx.open("hf://datasets/org/name[@rev]/data/train.vortex")
URLs are translated to the Hub's resolve endpoint and read with
ranged HTTP requests, so lazy scans (projection, predicate pushdown,
row indices) only download the bytes they need. HF_ENDPOINT is
honored for mirrors and tests, and tokens for gated or private
repositories are resolved from HF_TOKEN (and friends) or the
huggingface_hub token cache file.
Tests run against a local Range-supporting stand-in for the Hub's
resolve endpoint (test/hub_server.py, exposed as the local_hub
fixture), covering URL parsing, token resolution and auth, revision
pinning, and that projected scans download strictly less than the
file using only ranged reads.
Signed-off-by: "Joe Isaacs" <joe.isaacs@live.co.uk>
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Merging this PR will not alter performance
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | bitwise_not_vortex_buffer_mut[128] |
215.3 ns | 244.4 ns | -11.93% |
| ⚡ | Simulation | encode_varbin[(1000, 4)] |
159.4 µs | 142.4 µs | +11.95% |
| ⚡ | Simulation | encode_varbin[(1000, 32)] |
164.6 µs | 148.1 µs | +11.2% |
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing claude/vortex-hf-url-scheme-kib4lk (b649eea) with develop (3d7bbfb)
Add a docs page for the vortex.hf module (fixing the unresolved :mod:`vortex.hf` cross-references that failed the Sphinx build with warnings-as-errors) including an end-to-end example of converting a Parquet shard to Vortex, publishing it to a Hub dataset repository, and lazily reading it back over hf://. Also resolve all basedpyright warnings in the new code, since CI treats warnings as failures. Signed-off-by: "Joe Isaacs" <joe.isaacs@live.co.uk> Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds support for reading Vortex files directly from the Hugging Face Hub via
hf://URLs, following theHfFileSystemURL convention:vortex/hf/_resolve.py— parseshf://[datasets/|spaces/]namespace/name[@revision]/pathURLs and translates them to the Hub'sresolveendpoint, served as ranged HTTP reads throughHTTPStore. HonorsHF_ENDPOINToverrides (local mirrors getallow_httpautomatically). Tokens for gated/private repos resolve fromHF_TOKEN/HUGGING_FACE_HUB_TOKENor thehuggingface_hubtoken cache file, and attach as anauthorizationheader.vortex/hf/__init__.py— public API:vortex.hf.open()(accepts an explicittokenand HTTP client config), plusHFLocation,resolve_url,http_store,store_and_path,token,endpoint.vortex/file.py—vx.open("hf://...")routes through the resolver.vortex/store/__init__.py—vortex.store.from_url("hf://...")returns anHTTPStorerooted at the resolve URL.Because the Hub's
resolveendpoint supports HTTP range requests, Vortex's lazy scans (projection, predicate pushdown, row indices) work against Hub-hosted files without downloading them: in local measurements a projected single-column scan of a 200k-row shard downloads ~15% of the file.Unnamespaced repo ids (e.g.
hf://datasets/squad/...) are rejected: they are ambiguous without a Hub API call, so repo ids must be fully qualifiednamespace/name.A follow-up PR stacked on this one adds a Hugging Face
datasetsbuilder and a torch-compatible map-style dataset.Test plan
test/test_hf.py(23 tests) runs against a local Range-supporting stand-in for the Hubresolveendpoint (test/hub_server.py, exposed as the sharedlocal_hubfixture) — no network or HF account needed. Covers URL parsing, token resolution and 401→token→success auth, revision pinning, and laziness (a projected scan must download strictly less than the file, using only ranged reads).vortex-pythonsuite: 152 passed, 2 skipped, 1 xfailed.basedpyright vortex-python: 0 errors.ruff check+ruff format --check: clean.https://claude.ai/code/session_01Q7vfwXk1FwrcgaDS8sHYth
Generated by Claude Code