Skip to content

Commit 85e1828

Browse files
authored
Merge pull request #8 from PostHog/feat/ducklake-minio-setup
Add DuckLake with MinIO object storage support
2 parents a23dba4 + f7036aa commit 85e1828

7 files changed

Lines changed: 892 additions & 6 deletions

File tree

README.md

Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -142,6 +142,136 @@ ATTACH 'ducklake:postgres:host=ducklake.example.com user=ducklake password=secre
142142

143143
See [DuckLake documentation](https://ducklake.select/docs/stable/duckdb/usage/connecting) for more details.
144144

145+
### Quick Start with Docker
146+
147+
The easiest way to get started with DuckLake is using the included Docker Compose setup:
148+
149+
```bash
150+
# Start PostgreSQL (metadata) and MinIO (object storage)
151+
docker compose up -d
152+
153+
# Wait for services to be ready
154+
docker compose logs -f # Look for "Bucket ducklake created successfully"
155+
156+
# Start Duckgres with DuckLake configured
157+
./duckgres --config duckgres.yaml
158+
159+
# Connect and start using DuckLake
160+
PGPASSWORD=postgres psql "host=localhost port=5432 user=postgres sslmode=require"
161+
```
162+
163+
The `docker-compose.yaml` creates:
164+
165+
**PostgreSQL** (metadata catalog):
166+
- Host: `localhost`
167+
- Port: `5433` (mapped to avoid conflicts)
168+
- Database: `ducklake`
169+
- User/Password: `ducklake` / `ducklake`
170+
171+
**MinIO** (S3-compatible object storage):
172+
- S3 API: `localhost:9000`
173+
- Web Console: `http://localhost:9001`
174+
- Access Key: `minioadmin`
175+
- Secret Key: `minioadmin`
176+
- Bucket: `ducklake` (auto-created on startup)
177+
178+
The included `duckgres.yaml` is pre-configured to use both services.
179+
180+
### Object Storage Configuration
181+
182+
DuckLake can store data files in S3-compatible object storage (AWS S3, MinIO, etc.). Two credential providers are supported:
183+
184+
#### Option 1: Explicit Credentials (MinIO / Access Keys)
185+
186+
```yaml
187+
ducklake:
188+
metadata_store: "postgres:host=localhost port=5433 user=ducklake password=ducklake dbname=ducklake"
189+
object_store: "s3://ducklake/data/"
190+
s3_provider: "config" # Explicit credentials (default if s3_access_key is set)
191+
s3_endpoint: "localhost:9000" # MinIO or custom S3 endpoint
192+
s3_access_key: "minioadmin"
193+
s3_secret_key: "minioadmin"
194+
s3_region: "us-east-1"
195+
s3_use_ssl: false
196+
s3_url_style: "path" # "path" for MinIO, "vhost" for AWS S3
197+
```
198+
199+
#### Option 2: AWS Credential Chain (IAM Roles / Environment)
200+
201+
For AWS S3 with IAM roles, environment variables, or config files:
202+
203+
```yaml
204+
ducklake:
205+
metadata_store: "postgres:host=localhost user=ducklake password=ducklake dbname=ducklake"
206+
object_store: "s3://my-bucket/ducklake/"
207+
s3_provider: "credential_chain" # AWS SDK credential chain
208+
s3_chain: "env;config" # Which sources to check (optional)
209+
s3_profile: "my-profile" # AWS profile name (optional)
210+
s3_region: "us-west-2" # Override auto-detected region (optional)
211+
```
212+
213+
The credential chain checks these sources in order:
214+
- `env` - Environment variables (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`)
215+
- `config` - AWS config files (`~/.aws/credentials`, `~/.aws/config`)
216+
- `sts` - AWS STS assume role
217+
- `sso` - AWS Single Sign-On
218+
- `instance` - EC2 instance metadata (IAM roles)
219+
- `process` - External process credentials
220+
221+
See [DuckDB S3 API docs](https://duckdb.org/docs/stable/core_extensions/httpfs/s3api#credential_chain-provider) for details.
222+
223+
#### Environment Variables
224+
225+
All S3 settings can be configured via environment variables:
226+
- `DUCKGRES_DUCKLAKE_OBJECT_STORE` - S3 path (e.g., `s3://bucket/path/`)
227+
- `DUCKGRES_DUCKLAKE_S3_PROVIDER` - `config` or `credential_chain`
228+
- `DUCKGRES_DUCKLAKE_S3_ENDPOINT` - S3 endpoint (for MinIO)
229+
- `DUCKGRES_DUCKLAKE_S3_ACCESS_KEY` - Access key ID
230+
- `DUCKGRES_DUCKLAKE_S3_SECRET_KEY` - Secret access key
231+
- `DUCKGRES_DUCKLAKE_S3_REGION` - AWS region
232+
- `DUCKGRES_DUCKLAKE_S3_USE_SSL` - Use HTTPS (true/false)
233+
- `DUCKGRES_DUCKLAKE_S3_URL_STYLE` - `path` or `vhost`
234+
- `DUCKGRES_DUCKLAKE_S3_CHAIN` - Credential chain sources
235+
- `DUCKGRES_DUCKLAKE_S3_PROFILE` - AWS profile name
236+
237+
### Seeding Sample Data
238+
239+
A seed script is provided to populate DuckLake with sample e-commerce and analytics data:
240+
241+
```bash
242+
# Seed with default connection (localhost:5432, postgres/postgres)
243+
./scripts/seed_ducklake.sh
244+
245+
# Seed with custom connection
246+
./scripts/seed_ducklake.sh --host 127.0.0.1 --port 5432 --user postgres --password postgres
247+
248+
# Clean existing tables and reseed
249+
./scripts/seed_ducklake.sh --clean
250+
```
251+
252+
The script creates the following tables:
253+
- `categories` - Product categories (5 rows)
254+
- `products` - E-commerce products (15 rows)
255+
- `customers` - Customer records (10 rows)
256+
- `orders` - Order headers (12 rows)
257+
- `order_items` - Order line items (20 rows)
258+
- `events` - Analytics events with JSON properties (15 rows)
259+
- `page_views` - Web analytics data (15 rows)
260+
261+
Example queries after seeding:
262+
263+
```sql
264+
-- Top products by price
265+
SELECT name, price FROM products ORDER BY price DESC LIMIT 5;
266+
267+
-- Orders with customer info
268+
SELECT o.id, c.first_name, c.last_name, o.total_amount, o.status
269+
FROM orders o JOIN customers c ON o.customer_id = c.id;
270+
271+
-- Event funnel analysis
272+
SELECT event_name, COUNT(*) FROM events GROUP BY event_name ORDER BY COUNT(*) DESC;
273+
```
274+
145275
## COPY Protocol
146276

147277
Duckgres supports PostgreSQL's COPY protocol for efficient bulk data import and export:

docker-compose.yaml

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
services:
2+
postgres:
3+
image: postgres:16-alpine
4+
container_name: ducklake-metadata
5+
environment:
6+
POSTGRES_USER: ducklake
7+
POSTGRES_PASSWORD: ducklake
8+
POSTGRES_DB: ducklake
9+
ports:
10+
- "5433:5432"
11+
volumes:
12+
- ducklake-data:/var/lib/postgresql/data
13+
healthcheck:
14+
test: ["CMD-SHELL", "pg_isready -U ducklake -d ducklake"]
15+
interval: 5s
16+
timeout: 5s
17+
retries: 5
18+
19+
minio:
20+
image: minio/minio:latest
21+
container_name: ducklake-storage
22+
command: server /data --console-address ":9001"
23+
environment:
24+
MINIO_ROOT_USER: minioadmin
25+
MINIO_ROOT_PASSWORD: minioadmin
26+
ports:
27+
- "9000:9000" # S3 API
28+
- "9001:9001" # Web console
29+
volumes:
30+
- minio-data:/data
31+
healthcheck:
32+
test: ["CMD", "mc", "ready", "local"]
33+
interval: 5s
34+
timeout: 5s
35+
retries: 5
36+
37+
# Creates the ducklake bucket on startup
38+
minio-init:
39+
image: minio/mc:latest
40+
container_name: ducklake-storage-init
41+
depends_on:
42+
minio:
43+
condition: service_healthy
44+
entrypoint: >
45+
/bin/sh -c "
46+
mc alias set minio http://minio:9000 minioadmin minioadmin;
47+
mc mb minio/ducklake --ignore-existing;
48+
mc anonymous set download minio/ducklake;
49+
echo 'Bucket ducklake created successfully';
50+
"
51+
52+
volumes:
53+
ducklake-data:
54+
minio-data:

duckgres.example.yaml

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,30 @@ ducklake:
3838
# - "postgres:host=ducklake.example.com user=ducklake password=secret dbname=ducklake"
3939
# metadata_store: "postgres:host=localhost user=ducklake password=secret dbname=ducklake"
4040

41+
# S3-compatible object storage for data files (optional)
42+
# If not specified, data is stored alongside the metadata
43+
# object_store: "s3://bucket/path/"
44+
45+
# S3 credential provider: "config" (explicit) or "credential_chain" (AWS SDK)
46+
# Default: "config" if s3_access_key is set, otherwise "credential_chain"
47+
# s3_provider: "config"
48+
49+
# Option 1: Explicit credentials (for MinIO or when you have access keys)
50+
# s3_endpoint: "localhost:9000" # MinIO or custom S3 endpoint
51+
# s3_access_key: "minioadmin" # Access key ID
52+
# s3_secret_key: "minioadmin" # Secret access key
53+
# s3_region: "us-east-1" # AWS region (default: us-east-1)
54+
# s3_use_ssl: false # Use HTTPS for S3 connections
55+
# s3_url_style: "path" # "path" or "vhost" (default: path)
56+
57+
# Option 2: AWS credential chain (for AWS S3 with IAM roles, env vars, etc.)
58+
# Uses AWS SDK credential chain: env vars -> config files -> instance metadata
59+
# See: https://duckdb.org/docs/stable/core_extensions/httpfs/s3api#credential_chain-provider
60+
# s3_provider: "credential_chain"
61+
# s3_chain: "env;config" # Which sources to check (env, config, sts, sso, instance, process)
62+
# s3_profile: "my-profile" # AWS profile name (for config chain)
63+
# s3_region: "us-west-2" # Override auto-detected region
64+
4165
# Rate limiting configuration (optional - these are the defaults)
4266
rate_limit:
4367
# Max failed auth attempts before banning an IP

go.mod

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,11 @@ module github.com/posthog/duckgres
22

33
go 1.25.4
44

5+
require (
6+
github.com/duckdb/duckdb-go/v2 v2.5.3
7+
gopkg.in/yaml.v3 v3.0.1
8+
)
9+
510
require (
611
github.com/apache/arrow-go/v18 v18.4.1 // indirect
712
github.com/duckdb/duckdb-go-bindings v0.1.23 // indirect
@@ -12,7 +17,6 @@ require (
1217
github.com/duckdb/duckdb-go-bindings/windows-amd64 v0.1.23 // indirect
1318
github.com/duckdb/duckdb-go/arrowmapping v0.0.26 // indirect
1419
github.com/duckdb/duckdb-go/mapping v0.0.25 // indirect
15-
github.com/duckdb/duckdb-go/v2 v2.5.3 // indirect
1620
github.com/go-viper/mapstructure/v2 v2.4.0 // indirect
1721
github.com/goccy/go-json v0.10.5 // indirect
1822
github.com/google/flatbuffers v25.2.10+incompatible // indirect
@@ -28,5 +32,4 @@ require (
2832
golang.org/x/sys v0.35.0 // indirect
2933
golang.org/x/tools v0.36.0 // indirect
3034
golang.org/x/xerrors v0.0.0-20240903120638-7835f813f4da // indirect
31-
gopkg.in/yaml.v3 v3.0.1 // indirect
3235
)

main.go

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,22 @@ type RateLimitFileConfig struct {
4040

4141
type DuckLakeFileConfig struct {
4242
MetadataStore string `yaml:"metadata_store"` // e.g., "postgres:host=localhost user=ducklake password=secret dbname=ducklake"
43+
ObjectStore string `yaml:"object_store"` // e.g., "s3://bucket/path/" for S3/MinIO storage
44+
45+
// S3 credential provider: "config" (explicit) or "credential_chain" (AWS SDK)
46+
S3Provider string `yaml:"s3_provider"`
47+
48+
// Config provider settings (explicit credentials)
49+
S3Endpoint string `yaml:"s3_endpoint"` // e.g., "localhost:9000" for MinIO
50+
S3AccessKey string `yaml:"s3_access_key"` // S3 access key ID
51+
S3SecretKey string `yaml:"s3_secret_key"` // S3 secret access key
52+
S3Region string `yaml:"s3_region"` // S3 region (default: us-east-1)
53+
S3UseSSL bool `yaml:"s3_use_ssl"` // Use HTTPS for S3 connections
54+
S3URLStyle string `yaml:"s3_url_style"` // "path" or "vhost" (default: path)
55+
56+
// Credential chain provider settings (AWS SDK credential chain)
57+
S3Chain string `yaml:"s3_chain"` // e.g., "env;config" - which credential sources to check
58+
S3Profile string `yaml:"s3_profile"` // AWS profile name for config chain
4359
}
4460

4561
// loadConfigFile loads configuration from a YAML file
@@ -177,6 +193,34 @@ func main() {
177193
if fileCfg.DuckLake.MetadataStore != "" {
178194
cfg.DuckLake.MetadataStore = fileCfg.DuckLake.MetadataStore
179195
}
196+
if fileCfg.DuckLake.ObjectStore != "" {
197+
cfg.DuckLake.ObjectStore = fileCfg.DuckLake.ObjectStore
198+
}
199+
if fileCfg.DuckLake.S3Provider != "" {
200+
cfg.DuckLake.S3Provider = fileCfg.DuckLake.S3Provider
201+
}
202+
if fileCfg.DuckLake.S3Endpoint != "" {
203+
cfg.DuckLake.S3Endpoint = fileCfg.DuckLake.S3Endpoint
204+
}
205+
if fileCfg.DuckLake.S3AccessKey != "" {
206+
cfg.DuckLake.S3AccessKey = fileCfg.DuckLake.S3AccessKey
207+
}
208+
if fileCfg.DuckLake.S3SecretKey != "" {
209+
cfg.DuckLake.S3SecretKey = fileCfg.DuckLake.S3SecretKey
210+
}
211+
if fileCfg.DuckLake.S3Region != "" {
212+
cfg.DuckLake.S3Region = fileCfg.DuckLake.S3Region
213+
}
214+
cfg.DuckLake.S3UseSSL = fileCfg.DuckLake.S3UseSSL
215+
if fileCfg.DuckLake.S3URLStyle != "" {
216+
cfg.DuckLake.S3URLStyle = fileCfg.DuckLake.S3URLStyle
217+
}
218+
if fileCfg.DuckLake.S3Chain != "" {
219+
cfg.DuckLake.S3Chain = fileCfg.DuckLake.S3Chain
220+
}
221+
if fileCfg.DuckLake.S3Profile != "" {
222+
cfg.DuckLake.S3Profile = fileCfg.DuckLake.S3Profile
223+
}
180224
}
181225

182226
// Apply environment variables (override config file)
@@ -200,6 +244,36 @@ func main() {
200244
if v := os.Getenv("DUCKGRES_DUCKLAKE_METADATA_STORE"); v != "" {
201245
cfg.DuckLake.MetadataStore = v
202246
}
247+
if v := os.Getenv("DUCKGRES_DUCKLAKE_OBJECT_STORE"); v != "" {
248+
cfg.DuckLake.ObjectStore = v
249+
}
250+
if v := os.Getenv("DUCKGRES_DUCKLAKE_S3_PROVIDER"); v != "" {
251+
cfg.DuckLake.S3Provider = v
252+
}
253+
if v := os.Getenv("DUCKGRES_DUCKLAKE_S3_ENDPOINT"); v != "" {
254+
cfg.DuckLake.S3Endpoint = v
255+
}
256+
if v := os.Getenv("DUCKGRES_DUCKLAKE_S3_ACCESS_KEY"); v != "" {
257+
cfg.DuckLake.S3AccessKey = v
258+
}
259+
if v := os.Getenv("DUCKGRES_DUCKLAKE_S3_SECRET_KEY"); v != "" {
260+
cfg.DuckLake.S3SecretKey = v
261+
}
262+
if v := os.Getenv("DUCKGRES_DUCKLAKE_S3_REGION"); v != "" {
263+
cfg.DuckLake.S3Region = v
264+
}
265+
if v := os.Getenv("DUCKGRES_DUCKLAKE_S3_USE_SSL"); v == "true" || v == "1" {
266+
cfg.DuckLake.S3UseSSL = true
267+
}
268+
if v := os.Getenv("DUCKGRES_DUCKLAKE_S3_URL_STYLE"); v != "" {
269+
cfg.DuckLake.S3URLStyle = v
270+
}
271+
if v := os.Getenv("DUCKGRES_DUCKLAKE_S3_CHAIN"); v != "" {
272+
cfg.DuckLake.S3Chain = v
273+
}
274+
if v := os.Getenv("DUCKGRES_DUCKLAKE_S3_PROFILE"); v != "" {
275+
cfg.DuckLake.S3Profile = v
276+
}
203277

204278
// Apply CLI flags (highest priority)
205279
if *host != "" {

0 commit comments

Comments
 (0)