Skip to content

Commit d435b93

Browse files
talmoclaude
andauthored
docs: Add Remote loading guide (URLs, cloud, Google Drive, video) (#444)
* docs: Add Remote loading guide and document URL/cloud/Drive/video loading Consolidate the URL-loading documentation into a standalone Guides page (docs/remote.md) covering http/https + cloud (s3/gs/az) + Google Drive loading, streaming modes, caching/clear_remote_cache, authentication and security, embedded pkg.slp streaming, remote video (pyav), and error handling/troubleshooting. Replace the large examples.md section with a concise pointer, add the page to the nav, and add short autoref'd mentions across install, formats, and the Video model docs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: Add reencode to codespell ignore-words-list Align the pyproject.toml codespell config with the CI workflow (.github/workflows/codespell.yml), which already ignores `reencode`. `reencode` is the literal `sio reencode` CLI subcommand name documented in docs/examples.md, docs/formats/index.md, and docs/model/video.md, so codespell must not rewrite it to "re-encode". This makes local codespell match CI. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent c1999fd commit d435b93

9 files changed

Lines changed: 370 additions & 267 deletions

File tree

docs/examples.md

Lines changed: 10 additions & 263 deletions
Original file line numberDiff line numberDiff line change
@@ -628,277 +628,24 @@ sio.save_coco(
628628

629629
## Loading from URLs
630630

631-
Load `.slp` files directly from HTTP/HTTPS and cloud-storage URLs. Reads are
632-
lazy and range-based: for the default streaming mode only the bytes actually
633-
needed (typically a small fraction of the file) are pulled over the network.
634-
635-
### Quickstart
636-
637-
[`load_slp`][sleap_io.load_slp] (and the universal
638-
[`load_file`][sleap_io.load_file]) accept a URL anywhere a local path is
639-
accepted:
640-
641-
```python
642-
import sleap_io as sio
643-
644-
# Just works for http/https out of the box
645-
labels = sio.load_slp("https://example.com/path/labels.slp")
646-
647-
# load_file also accepts URLs (dispatches by extension)
648-
labels = sio.load_file("https://example.com/path/labels.slp")
649-
```
650-
651-
`pkg.slp` files with embedded frames work too — the embedded `HDF5Video`
652-
backends reopen the remote file lazily when you read frames, reusing the same
653-
streaming configuration.
654-
655-
!!! info "What is and isn't supported"
656-
URL loading currently covers `.slp`/`.pkg.slp` (including through the
657-
universal [`load_file`][sleap_io.load_file]) and remote *media video* over
658-
`http`/`https` (see [Remote video](#remote-video) below). Other *labels*
659-
formats (NWB, COCO, Label Studio, JABS, DLC, TrackMate, LEAP, GeoJSON,
660-
Ultralytics) are not yet implemented over URLs and raise
661-
`NotImplementedError`; download the file locally first. These are tracked
662-
as follow-ups.
663-
664-
### Remote video
665-
666-
[`load_video`][sleap_io.load_video] (and [`load_file`][sleap_io.load_file] for
667-
video extensions) can read a media video directly from an `http`/`https` URL:
631+
[`load_slp`][sleap_io.load_slp] and [`load_file`][sleap_io.load_file] accept a
632+
URL anywhere a local path is accepted, with lazy range-based streaming by
633+
default:
668634

669635
```python
670636
import sleap_io as sio
671637

672-
# Reads frames lazily over the network; needs the [pyav] extra
673-
video = sio.load_video("https://example.com/path/video.mp4")
674-
frame = video[0] # frames are decoded on demand
675-
```
676-
677-
Supported container extensions are the same as for local media videos
678-
(`mp4`, `avi`, `mov`, `mj2`, `mkv`). Only `http` and `https` URLs are accepted
679-
for video — cloud schemes (`s3://`, `gs://`, …) are not supported for video
680-
loading. The URL's query string and fragment are ignored when detecting the
681-
extension, so pre-signed/tokenized URLs like
682-
`https://host/video.mp4?token=...` route correctly.
683-
684-
Remote video requires the `pyav` extra, which is selected automatically as the
685-
backend for URLs:
686-
687-
```bash
688-
pip install "sleap-io[pyav]" # remote video support (provides `av`)
689-
```
690-
691-
If the `av` package is missing, `load_video(url)` raises an `ImportError` with
692-
the install hint above.
693-
694-
!!! danger "Security: remote video hands untrusted data to FFmpeg"
695-
Decoding a remote video streams bytes from the URL into FFmpeg (via pyav).
696-
FFmpeg's demuxers and decoders are a large, historically
697-
vulnerability-prone attack surface (the CVE history for media parsers is
698-
extensive), so a malicious URL or stream can attempt to exploit the
699-
decoder running in your process.
700-
701-
To limit risk, sleap-io only ever passes `http`/`https` URLs through to the
702-
decoder (no other schemes, no shell/protocol indirection). You should:
703-
704-
- **Load remote video only from sources you trust.** Treat an arbitrary
705-
third-party URL the same as running untrusted code.
706-
- **Sandbox untrusted inputs.** If you must decode video from an untrusted
707-
source, do it in an isolated environment (container/VM with no
708-
credentials, restricted network, and a non-privileged user) and keep
709-
FFmpeg/pyav up to date.
710-
711-
### Google Drive
712-
713-
Google Drive file share links are recognized and resolved to a direct download
714-
automatically, so you can pass a Drive URL straight to
715-
[`load_slp`][sleap_io.load_slp] or [`load_file`][sleap_io.load_file]:
716-
717-
```python
718-
import sleap_io as sio
719-
720-
# Any of these Drive share-link shapes work:
721-
labels = sio.load_slp("https://drive.google.com/file/d/<FILE_ID>/view")
722-
labels = sio.load_slp("https://drive.google.com/uc?id=<FILE_ID>&export=download")
723-
labels = sio.load_slp("https://drive.google.com/open?id=<FILE_ID>")
724-
725-
# load_file resolves the Drive link, sniffs the content, and routes it:
726-
labels = sio.load_file("https://drive.google.com/file/d/<FILE_ID>/view")
727-
```
728-
729-
The file must be shared as **"Anyone with the link"** (no sign-in required).
730-
Because Drive download links carry no file extension and reject the `HEAD`/range
731-
requests that lazy streaming relies on, a Drive file is **fully downloaded into
732-
memory** during resolution (the `stream_mode`/cache keyword arguments do not
733-
apply). The two-hop "can't scan for viruses" confirmation page that Drive serves
734-
for larger files is handled transparently.
735-
736-
Some limitations:
737-
738-
- **Folder links are not supported** — pass a single-file share link
739-
(`…/file/d/<FILE_ID>/view`), not a `…/drive/folders/<ID>` URL. A folder URL
740-
raises a `ValueError`.
741-
- **Drive videos are not supported**`load_video(<drive url>)` raises a
742-
`NotImplementedError`. Download the video file first, then load it locally.
743-
- **Quota / permission errors** — if Drive returns its "too many users have
744-
viewed or downloaded this file recently" page, a `RemoteIOError` is raised;
745-
retry later or re-check the file's sharing settings.
746-
747-
For `load_file`, the format is detected from the downloaded bytes (which costs
748-
one extra fetch). Pass an explicit `format=` (e.g. `format="slp"`) to skip the
749-
detection download.
750-
751-
### Supported schemes and install matrix
752-
753-
| Scheme | Requires | Notes |
754-
|--------|----------|-------|
755-
| `http`, `https` | nothing extra | Works with a plain `pip install sleap-io` |
756-
| `s3` | `sleap-io[cloud]` | Amazon S3 (via `s3fs`) |
757-
| `gs`, `gcs` | `sleap-io[cloud]` | Google Cloud Storage (via `gcsfs`) |
758-
| `az`, `abfs` | `sleap-io[cloud]` | Azure Blob / ADLS (via `adlfs`) |
759-
760-
```bash
761-
pip install sleap-io # http/https only
762-
pip install "sleap-io[cloud]" # + s3, gs/gcs, az/abfs
763-
pip install "sleap-io[all]" # everything (cloud schemes included)
764-
```
765-
766-
```python
767-
# Cloud scheme (requires sleap-io[cloud])
768-
labels = sio.load_slp("s3://my-bucket/path/labels.slp")
769-
```
770-
771-
!!! warning "Missing cloud extra"
772-
Using a cloud scheme without the `[cloud]` extra raises an `ImportError`
773-
whose message names the missing package and the
774-
`pip install 'sleap-io[cloud]'` install hint.
775-
776-
### Streaming modes
777-
778-
Control how bytes are fetched with the `stream_mode` keyword argument:
779-
780-
| `stream_mode` | Backing strategy | Memory | Disk cache | Revalidation | Best for |
781-
|---------------|------------------|--------|------------|--------------|----------|
782-
| `"auto"` (default) | fsspec `blockcache` | Low (LRU of `max_blocks`) | None | n/a | One-off lazy reads, low memory |
783-
| `"blockcache"` | fsspec `blockcache` | Low | None | n/a | Same as `auto` (explicit) |
784-
| `"cache"` | fsspec `simplecache` | Whole file on disk | Persistent | None | Repeated opens of an immutable file |
785-
| `"filecache"` | fsspec `filecache` | Whole file on disk | Persistent | ETag / `Last-Modified` after `cache_expiry` | Repeated opens of a file that may change |
786-
| `"download"` | Full read into memory | Whole file in RAM | None | n/a | Small files, ephemeral environments |
787-
788-
```python
789-
# Default: lazy range reads, low memory
638+
# http/https works with a base install
790639
labels = sio.load_slp("https://example.com/labels.slp")
791640

792-
# Persistent on-disk cache with daily ETag revalidation
793-
labels = sio.load_slp(
794-
"https://example.com/labels.slp",
795-
stream_mode="filecache",
796-
cache_storage="~/.cache/sleap-io",
797-
cache_expiry=86400, # revalidate after a day
798-
)
799-
800-
# Ephemeral full download into memory (no disk cache)
801-
labels = sio.load_slp("https://example.com/labels.slp", stream_mode="download")
802-
```
803-
804-
The `"auto"`/`"blockcache"` reads can be tuned with `block_size` (range block
805-
size, default 1 MiB) and `max_blocks` (in-memory LRU cap per open file,
806-
default 32 → ~32 MiB per file).
807-
808-
### Authentication
809-
810-
Pass HTTP headers (such as a bearer token) with `headers=`:
811-
812-
```python
813-
labels = sio.load_slp(
814-
"https://my-org.example/private/labels.slp",
815-
headers={"Authorization": "Bearer <token>"},
816-
)
641+
# Cloud schemes need the [cloud] extra (s3fs / gcsfs / adlfs)
642+
labels = sio.load_slp("s3://my-bucket/labels.slp")
817643
```
818644

819-
!!! warning "Headers are stripped on cross-origin redirect"
820-
For security, sensitive headers (`Authorization`, `Cookie`,
821-
`Proxy-Authorization`) are **dropped automatically** if the request is
822-
redirected to a different origin (scheme/host/port). This prevents leaking
823-
credentials to a third-party host and is intentional — if a download
824-
redirects cross-origin (e.g. to a pre-signed CDN URL), put the credentials
825-
in the redirect target's query string rather than in `headers`.
826-
827-
Cloud schemes (`s3://`, `gs://`, …) use their own per-provider credential
828-
chains (environment variables, credential files, instance metadata) rather
829-
than `headers=`.
830-
831-
### Caching: location, override, and clearing
832-
833-
For `"cache"` and `"filecache"` modes, downloaded files live in the directory
834-
you pass as `cache_storage=`. sleap-io writes a small marker file there so it
835-
can later identify and clean up only its own cache files.
836-
837-
To clear the cache, call [`clear_remote_cache`][sleap_io.clear_remote_cache]
838-
with the **same** `cache_storage` you loaded with:
839-
840-
```python
841-
import sleap_io as sio
842-
843-
# Delete every sleap-io cache file in the directory
844-
sio.clear_remote_cache(cache_storage="~/.cache/sleap-io")
845-
846-
# Or only files older than an hour (3600 seconds)
847-
sio.clear_remote_cache(cache_storage="~/.cache/sleap-io", older_than=3600)
848-
```
849-
850-
!!! note "An explicit `cache_storage` is required to clear the cache"
851-
`clear_remote_cache` only operates on a directory that contains the
852-
sleap-io marker file, and it only deletes files matching fsspec's
853-
cache-key naming pattern — so it will never touch unrelated files even if
854-
you point it at a shared directory. It refuses to run on a directory with
855-
no marker, or on forbidden paths like `/` or `$HOME`. Because fsspec's
856-
*default* cache directory is a per-process temporary location that cannot
857-
be cleared reliably, you must pass the explicit `cache_storage` you used
858-
when loading.
859-
860-
### CI and ephemeral environments
861-
862-
In CI or other short-lived environments, prefer the default `stream_mode="auto"`
863-
(no persistent cache to manage), or scope a per-run cache to a temporary
864-
directory you control:
865-
866-
```python
867-
import os
868-
import sleap_io as sio
869-
870-
cache_dir = os.path.join(os.environ.get("RUNNER_TEMP", "/tmp"), "sleap-io-cache")
871-
labels = sio.load_slp(url, stream_mode="filecache", cache_storage=cache_dir)
872-
```
873-
874-
### Troubleshooting
875-
876-
- **`RemoteIOError`** — raised for HTTP-level failures (404 not found, 416 range
877-
past end of file, 5xx after retries, connection errors, timeouts). The
878-
exception carries a `status` (HTTP code or `None`) and a credential-redacted
879-
`url` so tokens never leak into logs or tracebacks. See
880-
[`RemoteIOError`][sleap_io.RemoteIOError].
881-
- **`ImportError` for cloud schemes** — install the cloud adapters with
882-
`pip install 'sleap-io[cloud]'` (covers `s3`, `gs`/`gcs`, `az`/`abfs`).
883-
- **`RuntimeWarning` about `aiohttp`** — remote loading needs
884-
`aiohttp >= 3.13.5` for safe cross-origin header stripping. If you see this
885-
warning, upgrade with `pip install --upgrade 'aiohttp>=3.13.5'`.
886-
- **`ImportError` from `load_video(url)`** — remote video loading needs the
887-
`pyav` extra; install it with `pip install 'sleap-io[pyav]'`. Only `http`/
888-
`https` URLs are supported for video. See [Remote video](#remote-video) for
889-
the security considerations of decoding untrusted remote video.
890-
- **Google Drive errors** — a `ValueError` means the link is a folder (pass a
891-
`…/file/d/<ID>/view` file link) or the file ID could not be parsed; a
892-
`RemoteIOError` mentioning a quota/permission page means the file is not shared
893-
publicly or Drive is rate-limiting downloads. See [Google Drive](#google-drive).
894-
895-
!!! note "See also"
896-
- [`load_slp`][sleap_io.load_slp]: Full URL keyword-argument reference
897-
- [`load_file`][sleap_io.load_file]: Universal loader with URL sniffing
898-
- [`load_video`][sleap_io.load_video]: Loads remote media video over http/https
899-
- [`clear_remote_cache`][sleap_io.clear_remote_cache]: Cache cleanup helper
900-
- [`RemoteIOError`][sleap_io.RemoteIOError]: Remote I/O error surface
901-
- [SLP Format](formats/slp.md): The on-disk `.slp` layout that URL loading streams
645+
This also covers `.pkg.slp` embedded-frame streaming, remote media video over
646+
`http`/`https`, and Google Drive share links. For streaming modes, caching,
647+
authentication, security notes, and troubleshooting, see the
648+
[Remote loading](remote.md) guide.
902649

903650
## Editing labels data
904651

docs/formats/index.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,11 @@ sleap-io provides a unified interface for reading and writing pose tracking data
1414

1515
::: sleap_io.io.main.save_video
1616

17+
Media videos can also be read from `http`/`https` URLs with
18+
[`load_video`][sleap_io.load_video] (requires the `pyav` extra; cloud schemes
19+
and Google Drive are not supported for video). See
20+
[Remote video](../examples.md#loading-from-urls).
21+
1722
### Norpix .seq Format
1823

1924
The `.seq` format is used by StreamPix / Norpix for high-speed video recording, commonly used in behavioral neuroscience. sleap-io provides native read support for `.seq` files via the [`SeqVideo`][sleap_io.SeqVideo] backend.
@@ -43,6 +48,8 @@ sio reencode recording.seq -o recording.mp4
4348

4449
The native SLEAP format stores complete pose tracking projects including videos, skeletons, and annotations. SLP is the primary format with full round-trip support for bounding boxes (format 1.7+), regions of interest (ROIs), and segmentation masks (format 1.5+).
4550

51+
`.slp` and `.pkg.slp` files can also be loaded directly from `http`/`https`, cloud (`s3://`, `gs://`, `az://`), and Google Drive URLs with lazy range-based streaming via [`load_slp`][sleap_io.load_slp] — see [Loading from URLs](../examples.md#loading-from-urls).
52+
4653
!!! tip "Detailed Format Specification"
4754
For comprehensive documentation of the SLP file format including HDF5 layout, data structures, and version history, see the **[SLP File Format Reference](slp.md)**.
4855

@@ -580,6 +587,8 @@ sleap-io automatically detects file formats based on:
580587
2. **File content**: For ambiguous extensions like `.h5` (JABS vs DLC) or `.json` (Label Studio vs COCO)
581588
3. **Explicit format**: Pass `format` parameter to override auto-detection
582589

590+
For URLs, ambiguous extensions (`.h5`, `.json`, `.csv`) are disambiguated with a magic-byte sniff via a Range request, controllable with [`load_file`][sleap_io.load_file]'s `sniff=` argument. See [Loading from URLs](../examples.md#loading-from-urls).
591+
583592
## Format Conversion Examples
584593

585594
### Convert Between Formats
@@ -658,6 +667,9 @@ Different formats have varying capabilities:
658667
******Ultralytics segmentation polygons stored as ROIs
659668
*******TrackMate auto-detects sibling `.tif`/`.tiff` video files
660669

670+
!!! note "Remote URL loading"
671+
Loading from a URL is currently supported only for SLEAP `.slp`/`.pkg.slp` (labels) and `http`/`https` media video; all other labels formats raise `NotImplementedError` over a URL — download the file locally first. See [Loading from URLs](../examples.md#loading-from-urls).
672+
661673
## See Also
662674

663675
- [Data Model](../model/index.md): Understanding the core data structures

docs/formats/slp.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -760,6 +760,9 @@ The `/negative_frames` dataset is optional. Files without negative frames will n
760760

761761
For large SLP files with hundreds of thousands of frames, sleap-io provides a lazy loading mode that defers [`Labels`][sleap_io.Labels] object creation until needed.
762762

763+
!!! tip "Streaming from a URL"
764+
`.slp`/`.pkg.slp` files can be opened straight from `http`/`https`, cloud, or Google Drive URLs via [`load_slp`][sleap_io.load_slp] with lazy range-based reads (embedded `pkg.slp` frames reopen the remote file on demand). See [Loading from URLs](../examples.md#loading-from-urls).
765+
763766
### Architecture
764767

765768
```

docs/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ does *not* include labeling, training, or inference.
2323
- **Codecs** -- Convert to/from NumPy arrays, DataFrames (pandas/polars), and dictionaries ([guide](codecs.md))
2424
- **Video I/O** -- Read any video format via pluggable backends (FFMPEG, OpenCV, PyAV) with a NumPy-like interface ([model](model/video.md))
2525
- **Lazy loading** -- Load large SLP files up to 90x faster by deferring object creation ([details](formats/slp.md#lazy-loading))
26-
- **Remote URLs** -- Load `.slp` files directly from `https://`, `s3://`, `gs://`, and `az://` URLs with lazy range-based reads and optional persistent caching ([guide](examples.md#loading-from-urls))
26+
- **Remote URLs** -- Load `.slp`/`.pkg.slp` and remote media video directly from `https://`, `s3://`, `gs://`, `az://`, and Google Drive URLs with lazy range-based reads and optional persistent caching ([guide](remote.md))
2727
- **Dataset splits** -- Create train/val/test splits and export to formats like Ultralytics YOLO ([example](examples.md#make-trainingvalidationtest-splits))
2828

2929
## Installation

docs/install.md

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -128,7 +128,9 @@ sio --version
128128
uv tool install sleap-io # Basic installation
129129
```
130130

131-
Video support via imageio-ffmpeg is always included.
131+
Video support via imageio-ffmpeg is always included. The `[all]` extra also
132+
includes the cloud-storage adapters (`s3fs`, `gcsfs`, `adlfs`) needed to
133+
load `.slp` files from `s3://`/`gs://`/`az://` URLs.
132134

133135
See the [CLI documentation](cli.md) for a complete command reference.
134136

@@ -380,11 +382,16 @@ sleap-io uses optional dependencies for specific features:
380382
| Extra | Packages | Purpose |
381383
|-------|----------|---------|
382384
| `opencv` | `opencv-python` | Fastest video backend (2-3x faster) |
383-
| `pyav` | `av` | Balanced speed/features video backend |
385+
| `pyav` | `av` | Balanced speed/features video backend; required for loading remote media video over `http`/`https` |
386+
| `cloud` | `s3fs`, `gcsfs`, `adlfs` | Load `.slp` from cloud-storage URLs (`s3://`, `gs://`/`gcs://`, `az://`/`abfs://`) |
384387
| `mat` | `pymatreader` | LEAP `.mat` file support |
385388
| `polars` | `polars`, `pyarrow` | Fast dataframe operations |
386389
| `all` | All of the above | Everything included |
387390

391+
`http`/`https` URLs work with the base install; cloud-storage URLs need the
392+
`cloud` extra and remote media video needs the `pyav` extra. See
393+
[Loading from URLs](examples.md#loading-from-urls).
394+
388395
Install specific extras:
389396

390397
```bash

0 commit comments

Comments
 (0)