Skip to content

feat(python): hf:// URL scheme for reading Vortex files from the Hugging Face Hub#8351

Closed
joseph-isaacs wants to merge 2 commits into
developfrom
claude/vortex-hf-url-scheme-kib4lk
Closed

feat(python): hf:// URL scheme for reading Vortex files from the Hugging Face Hub#8351
joseph-isaacs wants to merge 2 commits into
developfrom
claude/vortex-hf-url-scheme-kib4lk

Conversation

@joseph-isaacs

Copy link
Copy Markdown
Contributor

Summary

Adds support for reading Vortex files directly from the Hugging Face Hub via hf:// URLs, following the HfFileSystem URL convention:

import vortex as vx

vxf = vx.open("hf://datasets/my-org/my-dataset/data/train.vortex")
vxf.scan(["score"], expr=...)  # lazy: only downloads the bytes it needs
  • vortex/hf/_resolve.py — parses hf://[datasets/|spaces/]namespace/name[@revision]/path URLs and translates them to the Hub's resolve endpoint, served as ranged HTTP reads through HTTPStore. Honors HF_ENDPOINT overrides (local mirrors get allow_http automatically). Tokens for gated/private repos resolve from HF_TOKEN/HUGGING_FACE_HUB_TOKEN or the huggingface_hub token cache file, and attach as an authorization header.
  • vortex/hf/__init__.py — public API: vortex.hf.open() (accepts an explicit token and HTTP client config), plus HFLocation, resolve_url, http_store, store_and_path, token, endpoint.
  • vortex/file.pyvx.open("hf://...") routes through the resolver.
  • vortex/store/__init__.pyvortex.store.from_url("hf://...") returns an HTTPStore rooted at the resolve URL.

Because the Hub's resolve endpoint supports HTTP range requests, Vortex's lazy scans (projection, predicate pushdown, row indices) work against Hub-hosted files without downloading them: in local measurements a projected single-column scan of a 200k-row shard downloads ~15% of the file.

Unnamespaced repo ids (e.g. hf://datasets/squad/...) are rejected: they are ambiguous without a Hub API call, so repo ids must be fully qualified namespace/name.

A follow-up PR stacked on this one adds a Hugging Face datasets builder and a torch-compatible map-style dataset.

Test plan

  • New test/test_hf.py (23 tests) runs against a local Range-supporting stand-in for the Hub resolve endpoint (test/hub_server.py, exposed as the shared local_hub fixture) — no network or HF account needed. Covers URL parsing, token resolution and 401→token→success auth, revision pinning, and laziness (a projected scan must download strictly less than the file, using only ranged reads).
  • Full vortex-python suite: 152 passed, 2 skipped, 1 xfailed.
  • basedpyright vortex-python: 0 errors. ruff check + ruff format --check: clean.

https://claude.ai/code/session_01Q7vfwXk1FwrcgaDS8sHYth


Generated by Claude Code

…ing Face Hub

Support HfFileSystem-style hf:// URLs in vx.open() and
vortex.store.from_url(), plus a vortex.hf module with an open()
helper that accepts an explicit access token:

    vxf = vx.open("hf://datasets/org/name[@rev]/data/train.vortex")

URLs are translated to the Hub's resolve endpoint and read with
ranged HTTP requests, so lazy scans (projection, predicate pushdown,
row indices) only download the bytes they need. HF_ENDPOINT is
honored for mirrors and tests, and tokens for gated or private
repositories are resolved from HF_TOKEN (and friends) or the
huggingface_hub token cache file.

Tests run against a local Range-supporting stand-in for the Hub's
resolve endpoint (test/hub_server.py, exposed as the local_hub
fixture), covering URL parsing, token resolution and auth, revision
pinning, and that projected scans download strictly less than the
file using only ranged reads.

Signed-off-by: "Joe Isaacs" <joe.isaacs@live.co.uk>
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@codspeed-hq

codspeed-hq Bot commented Jun 11, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 2 improved benchmarks
❌ 1 regressed benchmark
✅ 1529 untouched benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation bitwise_not_vortex_buffer_mut[128] 215.3 ns 244.4 ns -11.93%
Simulation encode_varbin[(1000, 4)] 159.4 µs 142.4 µs +11.95%
Simulation encode_varbin[(1000, 32)] 164.6 µs 148.1 µs +11.2%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing claude/vortex-hf-url-scheme-kib4lk (b649eea) with develop (3d7bbfb)

Open in CodSpeed

Add a docs page for the vortex.hf module (fixing the unresolved
:mod:`vortex.hf` cross-references that failed the Sphinx build with
warnings-as-errors) including an end-to-end example of converting a
Parquet shard to Vortex, publishing it to a Hub dataset repository,
and lazily reading it back over hf://.

Also resolve all basedpyright warnings in the new code, since CI
treats warnings as failures.

Signed-off-by: "Joe Isaacs" <joe.isaacs@live.co.uk>
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@joseph-isaacs joseph-isaacs marked this pull request as draft June 11, 2026 10:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant