Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
251 changes: 251 additions & 0 deletions catalogs/ducklake/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,251 @@
# DuckLake Catalog Connector

> **Note:** The DuckLake connector is available in Spice v2.0 or later.

The DuckLake Catalog Connector enables Spice to automatically discover and query all schemas and tables in a [DuckLake](https://ducklake.select/) catalog — an open lakehouse format that stores metadata in a SQLite-compatible database and data in Parquet files.

## Prerequisites

- [DuckDB CLI](https://duckdb.org/docs/installation/) installed (to create a DuckLake catalog).
- Spice is installed (see the [Getting Started](https://docs.spiceai.org/getting-started) documentation).

## Step 1. Create a new directory and initialize a Spicepod

```bash
mkdir ducklake-catalog-recipe
cd ducklake-catalog-recipe
spice init
```

## Step 2. Create a DuckLake catalog with sample data

Open DuckDB and create a DuckLake catalog with TPC-H sample data:

```bash
duckdb
```

Install and load the DuckLake and TPC-H extensions, then create a catalog and populate it:

```sql
INSTALL ducklake;
LOAD ducklake;
INSTALL tpch;
LOAD tpch;

-- Create a DuckLake catalog with local metadata storage
ATTACH 'ducklake:metadata.ducklake' AS my_lakehouse;

-- Generate TPC-H data (scale factor 0.01 for a quick demo)
CALL dbgen(sf = 0.01, catalog = 'my_lakehouse');
Comment thread
lukekim marked this conversation as resolved.
Outdated
```

Verify the tables were created:

```sql
SHOW ALL TABLES;
```

```text
┌──────────────┬─────────┬──────────┬──────────────────┬───────────────────────────────────────────────────────────────┬───────────┐
│ database │ schema │ name │ column_names │ column_types │ temporary │
│ varchar │ varchar │ varchar │ varchar[] │ varchar[] │ boolean │
├──────────────┼─────────┼──────────┼──────────────────┼───────────────────────────────────────────────────────────────┤───────────┤
│ my_lakehouse │ main │ customer │ [c_custkey, ...] │ [INTEGER, VARCHAR, ...] │ false │
│ my_lakehouse │ main │ lineitem │ [l_orderkey, ...] │ [INTEGER, INTEGER, ...] │ false │
│ my_lakehouse │ main │ nation │ [n_nationkey, ...]│ [INTEGER, VARCHAR, ...] │ false │
│ my_lakehouse │ main │ orders │ [o_orderkey, ...] │ [INTEGER, INTEGER, ...] │ false │
│ my_lakehouse │ main │ part │ [p_partkey, ...] │ [INTEGER, VARCHAR, ...] │ false │
│ my_lakehouse │ main │ partsupp │ [ps_partkey, ...] │ [INTEGER, INTEGER, ...] │ false │
│ my_lakehouse │ main │ region │ [r_regionkey, ...]│ [INTEGER, VARCHAR, ...] │ false │
│ my_lakehouse │ main │ supplier │ [s_suppkey, ...] │ [INTEGER, VARCHAR, ...] │ false │
└──────────────┴─────────┴──────────┴──────────────────┴───────────────────────────────────────────────────────────────┴───────────┘
```

Exit DuckDB:

```sql
.exit
```

## Step 3. Configure the DuckLake Catalog Connector in your Spicepod

Edit `spicepod.yaml` to add the DuckLake catalog:

```yaml
version: v1
kind: Spicepod
name: ducklake-catalog-recipe

catalogs:
- from: ducklake:metadata.ducklake
name: my_lakehouse
```

## Step 4. Start the Spice runtime

```bash
spice run
```

Observe that Spice discovers all schemas and tables:
Comment thread
lukekim marked this conversation as resolved.

```bash
2026-03-02T10:00:00.000000Z INFO runtime::init::catalog: Registering catalog 'my_lakehouse' for ducklake
2026-03-02T10:00:00.500000Z INFO runtime::init::catalog: Registered catalog 'my_lakehouse' with 1 schema and 8 tables
```

## Step 5. Query the DuckLake catalog

In a new terminal, start the Spice SQL REPL:

```bash
spice sql
```

List all discovered tables:

```sql
SHOW TABLES;
```

```text
+---------------+--------------+------------+------------+
| table_catalog | table_schema | table_name | table_type |
+---------------+--------------+------------+------------+
| my_lakehouse | main | customer | BASE TABLE |
| my_lakehouse | main | lineitem | BASE TABLE |
| my_lakehouse | main | nation | BASE TABLE |
| my_lakehouse | main | orders | BASE TABLE |
| my_lakehouse | main | part | BASE TABLE |
| my_lakehouse | main | partsupp | BASE TABLE |
| my_lakehouse | main | region | BASE TABLE |
| my_lakehouse | main | supplier | BASE TABLE |
| spice | runtime | task_history | BASE TABLE |
| spice | runtime | metrics | BASE TABLE |
+---------------+--------------+------------+------------+
```

Query the customer table:

```sql
SELECT c_custkey, c_name, c_mktsegment, c_acctbal
FROM my_lakehouse.main.customer
LIMIT 5;
```

```text
+-----------+--------------------+--------------+-----------+
| c_custkey | c_name | c_mktsegment | c_acctbal |
+-----------+--------------------+--------------+-----------+
| 1 | Customer#000000001 | BUILDING | 711.56 |
| 2 | Customer#000000002 | AUTOMOBILE | 121.65 |
| 3 | Customer#000000003 | AUTOMOBILE | 7498.12 |
| 4 | Customer#000000004 | MACHINERY | 2866.83 |
| 5 | Customer#000000005 | HOUSEHOLD | 794.47 |
+-----------+--------------------+--------------+-----------+
```

Run a cross-table query:

```sql
SELECT n.n_name AS nation, COUNT(*) AS num_customers, ROUND(AVG(c.c_acctbal), 2) AS avg_balance
FROM my_lakehouse.main.customer c
JOIN my_lakehouse.main.nation n ON c.c_nationkey = n.n_nationkey
GROUP BY n.n_name
ORDER BY num_customers DESC
LIMIT 5;
```

## Step 6. Enable read-write access (optional)

To enable write operations, update the catalog configuration with `access: read_write`:

```yaml
version: v1
kind: Spicepod
name: ducklake-catalog-recipe

catalogs:
- from: ducklake:metadata.ducklake
name: my_lakehouse
access: read_write
```

Restart Spice and insert data:

```bash
spice run
```

```bash
spice sql
```

```sql
INSERT INTO my_lakehouse.main.region (r_regionkey, r_name, r_comment)
VALUES (5, 'ANTARCTICA', 'A cold and remote region');
```

```text
+-------+
| count |
+-------+
| 1 |
+-------+
```

Verify the insert:

```sql
SELECT * FROM my_lakehouse.main.region ORDER BY r_regionkey;
```

## Using the DuckLake Data Connector

Instead of the catalog connector (which auto-discovers all tables), you can connect to specific tables using the DuckLake data connector:

```yaml
version: v1
kind: Spicepod
name: ducklake-data-connector-recipe

datasets:
- from: ducklake:customer
name: customer
params:
connection_string: metadata.ducklake
- from: ducklake:orders
name: orders
params:
connection_string: metadata.ducklake
```

This is useful when you only need specific tables or want to configure each dataset independently (e.g., with different acceleration settings).

## Using with Cloud Storage (S3)

DuckLake supports storing metadata and data on cloud storage. To use S3:

1. Ensure AWS credentials are available via environment variables, `~/.aws/credentials`, or an IAM instance profile.

2. Create a DuckLake catalog on S3 (via DuckDB CLI):

```sql
ATTACH 'ducklake:s3://my-bucket/lakehouse/metadata.ducklake' AS cloud_lakehouse;
```

3. Configure the Spice catalog:

```yaml
catalogs:
- from: ducklake:s3://my-bucket/lakehouse/metadata.ducklake
name: cloud_lakehouse
```

## Learn more

- [DuckLake website](https://ducklake.select/)
- [DuckLake Catalog Connector documentation](https://spiceai.org/docs/components/catalogs/ducklake)
- [DuckLake Data Connector documentation](https://spiceai.org/docs/components/data-connectors/ducklake)
- For using `spice sql`, see the [CLI reference](https://docs.spiceai.org/cli/reference/sql).
7 changes: 7 additions & 0 deletions catalogs/ducklake/spicepod.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
version: v1
kind: Spicepod
name: ducklake-catalog-recipe

catalogs:
- from: ducklake:metadata.ducklake
name: my_lakehouse