Skip to content

Allow r2:// URL scheme#365

Draft
timmclaughlin wants to merge 1 commit into
Snowflake-Labs:mainfrom
timmclaughlin:feat/allow-r2-url-scheme
Draft

Allow r2:// URL scheme#365
timmclaughlin wants to merge 1 commit into
Snowflake-Labs:mainfrom
timmclaughlin:feat/allow-r2-url-scheme

Conversation

@timmclaughlin
Copy link
Copy Markdown
Contributor

Description

Allow Cloudflare R2 (`r2://`) URLs through pg_lake's PG-side URL allowlists.

Why this works

The DuckDB `S3FileSystem` natively recognizes `r2://` alongside `s3://`, `s3a://`, `s3n://`, `gcs://`, and `gs://` — see `duckdb-httpfs/src/s3fs.cpp` `S3FileSystem::CanHandleFile`. `duckdb_pglake`'s `RegionAwareS3FileSystem` already short-circuits all non-`s3://` URLs straight to the underlying `S3FileSystem` (region auto-detection is intentionally `s3://` only — R2 always uses `auto`), so no `duckdb_pglake` change is needed. The only thing gating `r2://` today is pg_lake's PG-side URL allowlist.

What this PR changes

  1. Add `r2://` to both URL allowlists:

    • `pg_lake_engine/src/copy/copy_format.c` `IsSupportedURL` — the project-wide check consulted by:
      • `pg_lake_copy` (`CREATE TABLE ... load_from`/`definition_from`)
      • `pg_lake_iceberg` (`lake_iceberg.metadata`/`files`/`snapshots`, the `pg_lake_iceberg.default_location_prefix` GUC)
      • `pg_lake_table` (`file_size`/`file_exists`/`file_preview`/`delete_file`, foreign-table `path`/`location` options, `CREATE TABLE` for iceberg / pg_lake_table)
      • `pg_lake_benchmark`
    • `pg_lake_table/src/util/s3_file_utils.c` `IsFileListSupportedUrl` — the tighter list-only check used by `lake_file.list` (excludes `http(s)://` since those can't enumerate directories).
  2. Refresh the now-stale `"only s3://, gs://, az://, azure://, and abfss://"` error strings across the codebase so they accurately reflect what `IsSupportedURL` accepts. This also picks up the previously-missing `hf://` — HuggingFace URLs have been accepted by `IsSupportedURL` for a while but the error strings were never updated. Updated the four pytest assertions that match those strings exactly.

Repro of the original failure

```sql
SELECT * FROM lake_file.list('r2://my-bucket/prefix/*');
-- ERROR: list_files: only s3://, gs://, az://, azure://, abfss://, and hf:// urls are supported
```

After this PR the same call falls through to `pgduck_server` / DuckDB which handles `r2://` via the standard R2 secret mechanism.


Checklist

  • I have tested my changes and added tests if necessary
  • I updated documentation if needed
  • I confirm that all my commits are signed off (DCO)

Notes for reviewers

  • I considered narrowing the patch to only `IsFileListSupportedUrl` + the `list_files` error string, but every other call site that gates on `IsSupportedURL` (CREATE FOREIGN TABLE, iceberg metadata functions, the location-prefix GUC, etc.) would have inconsistent behavior — `r2://` would work in some paths and not others. Adding `r2://` centrally in `IsSupportedURL` keeps the surface consistent.
  • The error-message refresh is mechanical: every occurrence of the stale prefix list now lists what `IsSupportedURL` actually accepts. Happy to split that out of this PR if you'd prefer two smaller commits.
  • Did not add an e2e test that hits a real R2 bucket — that would require account credentials in CI. The pytest assertions in `test_list_file.py`, `test_file_preview.py`, `test_create_table.py`, `test_create_as_select.py`, `test_create_iceberg_table_load_from.py`, `test_iceberg_functions.py`, `test_queries.py`, and `test_writable_iceberg.py` exercise the allowlist gate against invalid URLs and now verify the updated error string. Happy to add an R2-specific positive test if you have a preferred pattern.

The DuckDB S3FileSystem natively recognizes r2:// alongside s3://, s3a://,
s3n://, gcs://, and gs:// (see duckdb-httpfs s3fs.cpp::CanHandleFile), and
duckdb_pglake's RegionAwareS3FileSystem already short-circuits all non-
s3:// URLs straight to the underlying S3FileSystem (region auto-detection
is intentionally s3:// only — R2 always uses "auto"). The only thing
gating r2:// today is pg_lake's PG-side URL allowlist.

Add r2:// to both URL allowlists:

  * pg_lake_engine IsSupportedURL — the project-wide check used by
    pg_lake_copy (CREATE TABLE ... load_from/definition_from),
    pg_lake_iceberg (lake_iceberg.metadata/files/snapshots, the
    pg_lake_iceberg.default_location_prefix GUC),
    pg_lake_table (file_size/file_exists/file_preview/delete_file,
    foreign-table "path" and "location" options, CREATE TABLE
    iceberg/pg_lake_table), and pg_lake_benchmark.
  * pg_lake_table IsFileListSupportedUrl — the tighter list-only
    check used by lake_file.list (excludes http/https since those
    can't enumerate directories).

Refresh the now-stale "only s3://, gs://, az://, azure://, and abfss://"
error strings across the codebase so they accurately reflect what
IsSupportedURL accepts (this also picks up the missing hf:// that has
been accepted since HuggingFace support landed). Update the four
pytest assertions that match those strings exactly.

Signed-off-by: Tim McLaughlin <tim@gotab.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant