Skip to content

Commit 1cf3329

Browse files
Add TTL-based snapshot caching for DuckLake catalog
1 parent 80883e4 commit 1cf3329

4 files changed

Lines changed: 349 additions & 16 deletions

File tree

CLAUDE.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -44,10 +44,11 @@ The codebase follows a layered architecture with clear separation of concerns:
4444

4545
2. **DataFusion Integration Layer** (`src/catalog.rs`, `src/schema.rs`, `src/table.rs`)
4646
- Bridges DuckLake concepts to DataFusion's catalog system
47-
- `DuckLakeCatalog`: Implements `CatalogProvider`, uses dynamic metadata lookup (queries on every call to `schema()` and `schema_names()`)
47+
- `DuckLakeCatalog`: Implements `CatalogProvider`, uses dynamic metadata lookup with configurable snapshot resolution
4848
- `DuckLakeSchema`: Implements `SchemaProvider`, uses dynamic metadata lookup (queries on every call to `table()` and `table_names()`)
4949
- `DuckLakeTable`: Implements `TableProvider`, caches table structure and file lists at creation time
5050
- **No HashMaps**: Catalog and schema providers query metadata on-demand rather than caching
51+
- **Snapshot Resolution**: Configurable TTL (time-to-live) for balancing freshness and performance
5152

5253
3. **Path Resolution** (`src/path_resolver.rs`)
5354
- Centralized utilities for parsing object store URLs and resolving hierarchical paths
@@ -77,7 +78,8 @@ The catalog uses a **pure dynamic lookup** approach with no caching at the catal
7778
- **DuckLakeCatalog** (`catalog.rs`):
7879
- `schema_names()`: Queries `list_schemas()` on every call
7980
- `schema()`: Queries `get_schema_by_name()` on every call
80-
- `new()`: O(1) - only fetches snapshot ID and data_path
81+
- `new()`: O(1) - only fetches data_path
82+
- **Snapshot Resolution**: Configurable via `SnapshotConfig`
8183

8284
- **DuckLakeSchema** (`schema.rs`):
8385
- `table_names()`: Queries `list_tables()` on every call
@@ -92,12 +94,12 @@ The catalog uses a **pure dynamic lookup** approach with no caching at the catal
9294
**Benefits**:
9395
- O(1) memory usage regardless of catalog size
9496
- Fast catalog startup (no upfront schema/table listing)
95-
- Always fresh metadata (no stale cache issues)
96-
- Simple implementation (no cache invalidation logic)
97+
- Configurable freshness vs performance trade-off
98+
- Simple implementation (no complex cache invalidation logic)
9799

98100
**Trade-offs**:
99101
- Small query overhead per metadata lookup (acceptable for read-only DuckDB connections)
100-
- Future optimization: Add optional caching layer via wrapper implementation
102+
- Snapshot resolution adds one SQL query per catalog operation (configurable via TTL)
101103

102104
### Data Flow
103105

examples/basic_query.rs

Lines changed: 17 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,9 @@
22
//!
33
//! This example demonstrates how to:
44
//! 1. Create a DuckLake catalog from a DuckDB catalog file
5-
//! 2. Register it with DataFusion
6-
//! 3. Execute a simple SELECT query
5+
//! 2. Configure snapshot resolution with TTL (time-to-live)
6+
//! 3. Register it with DataFusion
7+
//! 4. Execute a simple SELECT query
78
//!
89
//! To run this example, you need:
910
//! - A DuckDB database file with DuckLake tables
@@ -14,6 +15,8 @@
1415
use datafusion::execution::runtime_env::RuntimeEnv;
1516
use datafusion::prelude::*;
1617
use datafusion_ducklake::{DuckLakeCatalog, DuckdbMetadataProvider};
18+
// Uncomment when using custom snapshot config:
19+
// use datafusion_ducklake::SnapshotConfig;
1720
use object_store::ObjectStore;
1821
use object_store::aws::AmazonS3Builder;
1922
use std::env;
@@ -56,9 +59,20 @@ async fn main() -> Result<(), Box<dyn std::error::Error>> {
5659
);
5760
runtime.register_object_store(&Url::parse("s3://ducklake-data/")?, s3);
5861

59-
// Create the DuckLake catalog
62+
// Configure snapshot resolution behavior
63+
//
64+
// Option 1: Default configuration (TTL=0) - Always fresh, queries snapshot on every access
6065
let ducklake_catalog = DuckLakeCatalog::new(provider)?;
6166

67+
// Option 2: Custom TTL - Balance freshness and performance
68+
// Caches snapshot for 5 seconds, then refreshes
69+
// let config = SnapshotConfig { ttl_seconds: Some(5) };
70+
// let ducklake_catalog = DuckLakeCatalog::new_with_config(provider, config)?;
71+
72+
// Option 3: Cache forever - Maximum performance, snapshot frozen at catalog creation
73+
// let config = SnapshotConfig { ttl_seconds: None };
74+
// let ducklake_catalog = DuckLakeCatalog::new_with_config(provider, config)?;
75+
6276
println!("✓ Connected to DuckLake catalog");
6377

6478
let config = SessionConfig::new().with_default_catalog_and_schema("ducklake", "main");

0 commit comments

Comments
 (0)