Skip to content

Commit f541c04

Browse files
jonesphillipOxyjun
andauthored
Adds documentation for data catalog (#21422)
* Adds documentation for R2 Data Catalog * Added managing catalogs documentation and R2 Data Catalog as a product. * Add changelog entry * PCX review * Fix PR comments/typos. * Added PySpark example configuration. * Update src/content/docs/r2/data-catalog/config-examples/spark-scala.mdx * Added more context for data catalog auth * Add access policy example for r2 data catalog API tokens --------- Co-authored-by: Jun Lee <[email protected]>
1 parent 205760f commit f541c04

File tree

20 files changed

+1075
-83
lines changed

20 files changed

+1075
-83
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
---
2+
title: R2 Data Catalog is a managed Apache Iceberg data catalog built directly into R2 buckets
3+
description: A managed Apache Iceberg data catalog built directly into R2 buckets
4+
products:
5+
- r2
6+
date: 2025-04-10T13:00:00Z
7+
hidden: true
8+
---
9+
10+
Today, we're launching [R2 Data Catalog](/r2/data-catalog/) in open beta, a managed Apache Iceberg catalog built directly into your [Cloudflare R2](/r2/) bucket.
11+
12+
If you're not already familiar with it, [Apache Iceberg](https://iceberg.apache.org/) is an open table format designed to handle large-scale analytics datasets stored in object storage, offering ACID transactions and schema evolution. R2 Data Catalog exposes a standard Iceberg REST catalog interface, so you can connect engines like [Spark](/r2/data-catalog/config-examples/spark-scala/), [Snowflake](/r2/data-catalog/config-examples/snowflake/), and [PyIceberg](/r2/data-catalog/config-examples/pyiceberg/) to start querying your tables using the tools you already know.
13+
14+
To enable a data catalog on your R2 bucket, find **R2 Data Catalog** in your buckets settings in the dashboard, or run:
15+
16+
```bash
17+
npx wrangler r2 bucket catalog enable my-bucket
18+
```
19+
20+
And that's it. You'll get a catalog URI and warehouse you can plug into your favorite Iceberg engines.
21+
22+
Visit our [getting started guide](/r2/data-catalog/get-started/) for step-by-step instructions on enabling R2 Data Catalog, creating tables, and running your first queries.

Diff for: src/content/docs/r2/api/tokens.mdx

+44-12
Original file line numberDiff line numberDiff line change
@@ -45,12 +45,18 @@ Jurisdictional buckets can only be accessed via the corresponding jurisdictional
4545

4646
## Permissions
4747

48-
| Permission | Description |
49-
| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
50-
| Admin Read & Write | Allows the ability to create, list and delete buckets, and edit bucket configurations in addition to list, write, and read object access. |
51-
| Admin Read only | Allows the ability to list buckets and view bucket configuration in addition to list and read object access. |
52-
| Object Read & Write | Allows the ability to read, write, and list objects in specific buckets. |
53-
| Object Read only | Allows the ability to read and list objects in specific buckets. |
48+
| Permission | Description |
49+
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
50+
| Admin Read & Write | Allows the ability to create, list, and delete buckets, edit bucket configuration, read, write, and list objects, and read and write to data catalog tables and associated metadata. |
51+
| Admin Read only | Allows the ability to list buckets and view bucket configuration, read and list objects, and read from the data catalog tables and associated metadata. |
52+
| Object Read & Write | Allows the ability to read, write, and list objects in specific buckets. |
53+
| Object Read only | Allows the ability to read and list objects in specific buckets. |
54+
55+
:::note
56+
57+
Currently **Admin Read & Write** or **Admin Read only** permission is required to use [R2 Data Catalog](/r2/data-catalog/).
58+
59+
:::
5460

5561
## Create API tokens via API
5662

@@ -90,7 +96,7 @@ All buckets in an account are represented as:
9096

9197
#### Permission groups
9298

93-
Determine what [permission groups](/fundamentals/api/how-to/create-via-api/#permission-groups) should be applied. There are four relevant permission groups for R2.
99+
Determine what [permission groups](/fundamentals/api/how-to/create-via-api/#permission-groups) should be applied.
94100

95101
<table>
96102
<tbody>
@@ -101,7 +107,7 @@ Determine what [permission groups](/fundamentals/api/how-to/create-via-api/#perm
101107
Resource
102108
</th>
103109
<th colspan="5" rowspan="1">
104-
Permission
110+
Description
105111
</th>
106112
<tr>
107113
<td colspan="5" rowspan="1">
@@ -111,7 +117,8 @@ Determine what [permission groups](/fundamentals/api/how-to/create-via-api/#perm
111117
Account
112118
</td>
113119
<td colspan="5" rowspan="1">
114-
Admin Read & Write
120+
Can create, delete, and list buckets, edit bucket configuration, and
121+
read, write, and list objects.
115122
</td>
116123
</tr>
117124
<tr>
@@ -122,7 +129,8 @@ Determine what [permission groups](/fundamentals/api/how-to/create-via-api/#perm
122129
Account
123130
</td>
124131
<td colspan="5" rowspan="1">
125-
Admin Read only
132+
Can list buckets and view bucket configuration, and read and list
133+
objects.
126134
</td>
127135
</tr>
128136
<tr>
@@ -133,7 +141,7 @@ Determine what [permission groups](/fundamentals/api/how-to/create-via-api/#perm
133141
Bucket
134142
</td>
135143
<td colspan="5" rowspan="1">
136-
Object Read & Write
144+
Can read, write, and list objects in buckets.
137145
</td>
138146
</tr>
139147
<tr>
@@ -144,7 +152,31 @@ Determine what [permission groups](/fundamentals/api/how-to/create-via-api/#perm
144152
Bucket
145153
</td>
146154
<td colspan="5" rowspan="1">
147-
Object Read only
155+
Can read and list objects in buckets.
156+
</td>
157+
</tr>
158+
<tr>
159+
<td colspan="5" rowspan="1">
160+
<code>Workers R2 Data Catalog Write</code>
161+
</td>
162+
<td colspan="5" rowspan="1">
163+
Account
164+
</td>
165+
<td colspan="5" rowspan="1">
166+
Can read from and write to data catalogs. This permission allows
167+
access to the Iceberg REST catalog interface.
168+
</td>
169+
</tr>
170+
<tr>
171+
<td colspan="5" rowspan="1">
172+
<code>Workers R2 Data Catalog Read</code>
173+
</td>
174+
<td colspan="5" rowspan="1">
175+
Account
176+
</td>
177+
<td colspan="5" rowspan="1">
178+
Can read from data catalogs. This permission allows read-only
179+
access to the Iceberg REST catalog interface.
148180
</td>
149181
</tr>
150182
</tbody>
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
---
2+
pcx_content_type: navigation
3+
title: Connect to Iceberg engines
4+
head: []
5+
sidebar:
6+
order: 4
7+
group:
8+
hideIndex: true
9+
description: Find detailed setup instructions for Apache Spark and other common query engines.
10+
---
11+
12+
import { DirectoryListing } from "~/components";
13+
14+
Below are configuration examples to connect various Iceberg engines to [R2 Data Catalog](/r2/data-catalog/):
15+
16+
<DirectoryListing />
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
---
2+
title: PyIceberg
3+
pcx_content_type: example
4+
---
5+
6+
Below is an example of using [PyIceberg](https://py.iceberg.apache.org/) to connect to R2 Data Catalog.
7+
8+
## Prerequisites
9+
10+
- Sign up for a [Cloudflare account](https://dash.cloudflare.com/sign-up/workers-and-pages).
11+
- [Create an R2 bucket](/r2/buckets/create-buckets/) and [enable the data catalog](/r2/data-catalog/manage-catalogs/#enable-r2-data-catalog-on-a-bucket).
12+
- [Create an R2 API token](/r2/api/tokens/) with both [R2 and data catalog permissions](/r2/api/tokens/#permissions).
13+
- Install the [PyIceberg](https://py.iceberg.apache.org/#installation) and [PyArrow](https://arrow.apache.org/docs/python/install.html) libraries.
14+
15+
## Example usage
16+
17+
```py
18+
import pyarrow as pa
19+
from pyiceberg.catalog.rest import RestCatalog
20+
from pyiceberg.exceptions import NamespaceAlreadyExistsError
21+
22+
# Define catalog connection details (replace variables)
23+
WAREHOUSE = "<WAREHOUSE>"
24+
TOKEN = "<TOKEN>"
25+
CATALOG_URI = "<CATALOG_URI>"
26+
27+
# Connect to R2 Data Catalog
28+
catalog = RestCatalog(
29+
name="my_catalog",
30+
warehouse=WAREHOUSE,
31+
uri=CATALOG_URI,
32+
token=TOKEN,
33+
)
34+
35+
# Create default namespace
36+
catalog.create_namespace("default")
37+
38+
# Create simple PyArrow table
39+
df = pa.table({
40+
"id": [1, 2, 3],
41+
"name": ["Alice", "Bob", "Charlie"],
42+
})
43+
44+
# Create an Iceberg table
45+
test_table = ("default", "my_table")
46+
table = catalog.create_table(
47+
test_table,
48+
schema=df.schema,
49+
)
50+
```
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
---
2+
title: Snowflake
3+
pcx_content_type: example
4+
---
5+
6+
Below is an example of using [Snowflake](https://docs.snowflake.com/en/user-guide/tables-iceberg-configure-catalog-integration-rest) to connect and query data from R2 Data Catalog (read-only).
7+
8+
## Prerequisites
9+
10+
- Sign up for a [Cloudflare account](https://dash.cloudflare.com/sign-up/workers-and-pages).
11+
- [Create an R2 bucket](/r2/buckets/create-buckets/) and [enable the data catalog](/r2/data-catalog/manage-catalogs/#enable-r2-data-catalog-on-a-bucket).
12+
- [Create an R2 API token](/r2/api/tokens/) with both [R2 and data catalog permissions](/r2/api/tokens/#permissions).
13+
- A [Snowflake](https://www.snowflake.com/) account with the necessary privileges to create external volumes and catalog integrations.
14+
15+
## Example usage
16+
17+
In your Snowflake [SQL worksheet](https://docs.snowflake.com/en/user-guide/ui-snowsight-worksheets-gs) or [notebook](https://docs.snowflake.com/en/user-guide/ui-snowsight/notebooks), run the following commands:
18+
19+
```sql
20+
-- Create a database (if you don't already have one) to organize your external data
21+
CREATE DATABASE IF NOT EXISTS r2_example_db;
22+
23+
-- Create an external volume pointing to your R2 bucket
24+
CREATE OR REPLACE EXTERNAL VOLUME ext_vol_r2
25+
STORAGE_LOCATIONS = (
26+
(
27+
NAME = 'my_r2_storage_location'
28+
STORAGE_PROVIDER = 'S3COMPAT'
29+
STORAGE_BASE_URL = 's3compat://<bucket-name>'
30+
CREDENTIALS = (
31+
AWS_KEY_ID = '<access_key>'
32+
AWS_SECRET_KEY = '<secret_access_key>'
33+
)
34+
STORAGE_ENDPOINT = '<account_id>.r2.cloudflarestorage.com'
35+
)
36+
)
37+
ALLOW_WRITES = FALSE;
38+
39+
-- Create a catalog integration for R2 Data Catalog (read-only)
40+
CREATE OR REPLACE CATALOG INTEGRATION r2_data_catalog
41+
CATALOG_SOURCE = ICEBERG_REST
42+
TABLE_FORMAT = ICEBERG
43+
CATALOG_NAMESPACE = 'default'
44+
REST_CONFIG = (
45+
CATALOG_URI = '<catalog_uri>'
46+
CATALOG_NAME = '<warehouse_name>'
47+
)
48+
REST_AUTHENTICATION = (
49+
TYPE = BEARER
50+
BEARER_TOKEN = '<token>'
51+
)
52+
ENABLED = TRUE;
53+
54+
-- Create an Apache Iceberg table in your selected Snowflake database
55+
CREATE ICEBERG TABLE my_iceberg_table
56+
CATALOG = 'r2_data_catalog'
57+
EXTERNAL_VOLUME = 'ext_vol_r2'
58+
CATALOG_TABLE_NAME = 'my_table'; -- Name of existing table in your R2 data catalog
59+
60+
-- Query your Iceberg table
61+
SELECT * FROM my_iceberg_table;
62+
```
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
---
2+
title: Spark (PySpark)
3+
pcx_content_type: example
4+
---
5+
6+
Below is an example of using [PySpark](https://spark.apache.org/docs/latest/api/python/index.html) to connect to R2 Data Catalog.
7+
8+
## Prerequisites
9+
10+
- Sign up for a [Cloudflare account](https://dash.cloudflare.com/sign-up/workers-and-pages).
11+
- [Create an R2 bucket](/r2/buckets/create-buckets/) and [enable the data catalog](/r2/data-catalog/manage-catalogs/#enable-r2-data-catalog-on-a-bucket).
12+
- [Create an R2 API token](/r2/api/tokens/) with both [R2 and data catalog permissions](/r2/api/tokens/#permissions).
13+
- Install the [PySpark](https://spark.apache.org/docs/latest/api/python/getting_started/install.html) library.
14+
15+
## Example usage
16+
17+
```py
18+
from pyspark.sql import SparkSession
19+
20+
# Define catalog connection details (replace variables)
21+
WAREHOUSE = "<WAREHOUSE>"
22+
TOKEN = "<TOKEN>"
23+
CATALOG_URI = "<CATALOG_URI>"
24+
25+
# Build Spark session with Iceberg configurations
26+
spark = SparkSession.builder \
27+
.appName("R2DataCatalogExample") \
28+
.config('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1,org.apache.iceberg:iceberg-aws-bundle:1.6.1') \
29+
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
30+
.config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog") \
31+
.config("spark.sql.catalog.my_catalog.type", "rest") \
32+
.config("spark.sql.catalog.my_catalog.uri", CATALOG_URI) \
33+
.config("spark.sql.catalog.my_catalog.warehouse", WAREHOUSE) \
34+
.config("spark.sql.catalog.my_catalog.token", TOKEN) \
35+
.config("spark.sql.catalog.my_catalog.header.X-Iceberg-Access-Delegation", "vended-credentials") \
36+
.config("spark.sql.catalog.my_catalog.s3.remote-signing-enabled", "false") \
37+
.config("spark.sql.defaultCatalog", "my_catalog") \
38+
.getOrCreate()
39+
spark.sql("USE my_catalog")
40+
41+
# Create namespace if it does not exist
42+
spark.sql("CREATE NAMESPACE IF NOT EXISTS default")
43+
44+
# Create a table in the namespace using Iceberg
45+
spark.sql("""
46+
CREATE TABLE IF NOT EXISTS default.my_table (
47+
id BIGINT,
48+
name STRING
49+
)
50+
USING iceberg
51+
""")
52+
53+
# Create a simple DataFrame
54+
df = spark.createDataFrame(
55+
[(1, "Alice"), (2, "Bob"), (3, "Charlie")],
56+
["id", "name"]
57+
)
58+
59+
# Write the DataFrame to the Iceberg table
60+
df.write \
61+
.format("iceberg") \
62+
.mode("append") \
63+
.save("default.my_table")
64+
65+
# Read the data back from the Iceberg table
66+
result_df = spark.read \
67+
.format("iceberg") \
68+
.load("default.my_table")
69+
70+
result_df.show()
71+
```

0 commit comments

Comments
 (0)