Skip to content

Commit 3bbdda5

Browse files
authored
refactor: Migrate to cpp based duckdb layout (#5)
* Save work Signed-off-by: Xuanwo <github@xuanwo.io> * Fix read Signed-off-by: Xuanwo <github@xuanwo.io> * Fix Signed-off-by: Xuanwo <github@xuanwo.io> * Refactor code Signed-off-by: Xuanwo <github@xuanwo.io> * Fix build Signed-off-by: Xuanwo <github@xuanwo.io> * Fix reader Signed-off-by: Xuanwo <github@xuanwo.io> * Remove locks Signed-off-by: Xuanwo <github@xuanwo.io> * Refactor Signed-off-by: Xuanwo <github@xuanwo.io> * Update claude Signed-off-by: Xuanwo <github@xuanwo.io> * Format code Signed-off-by: Xuanwo <github@xuanwo.io> * Fix CI Signed-off-by: Xuanwo <github@xuanwo.io> * Make clippy happy Signed-off-by: Xuanwo <github@xuanwo.io> * Fix CI Signed-off-by: Xuanwo <github@xuanwo.io> * Fix Ci Signed-off-by: Xuanwo <github@xuanwo.io> * Fix cmake Signed-off-by: Xuanwo <github@xuanwo.io> * Use release for test Signed-off-by: Xuanwo <github@xuanwo.io> * try fix ci Signed-off-by: Xuanwo <github@xuanwo.io> * Fix for windows Signed-off-by: Xuanwo <github@xuanwo.io> * Fix CMake Signed-off-by: Xuanwo <github@xuanwo.io> * Add target dir Signed-off-by: Xuanwo <github@xuanwo.io> * Save work Signed-off-by: Xuanwo <github@xuanwo.io> * Try CI Signed-off-by: Xuanwo <github@xuanwo.io> * Disable wasm Signed-off-by: Xuanwo <github@xuanwo.io> * We will tidy by ourselves Signed-off-by: Xuanwo <github@xuanwo.io> * Fix CI Signed-off-by: Xuanwo <github@xuanwo.io> * Fix CI Signed-off-by: Xuanwo <github@xuanwo.io> --------- Signed-off-by: Xuanwo <github@xuanwo.io>
1 parent 4fc98c8 commit 3bbdda5

32 files changed

Lines changed: 1592 additions & 1377 deletions
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
name: Main Extension Distribution Pipeline
2+
on:
3+
push:
4+
branches: [main]
5+
pull_request:
6+
workflow_dispatch:
7+
8+
concurrency:
9+
group: ${{ github.workflow }}-${{ github.ref }}-${{ github.head_ref || '' }}-${{ github.base_ref || '' }}-${{ github.ref != 'refs/heads/main' && github.sha || '' }}
10+
cancel-in-progress: true
11+
12+
jobs:
13+
duckdb-stable-build:
14+
name: Build extension binaries
15+
uses: duckdb/extension-ci-tools/.github/workflows/_extension_distribution.yml@v1.3.2
16+
with:
17+
duckdb_version: v1.3.2
18+
ci_tools_version: v1.3.2
19+
extension_name: lance
20+
exclude_archs: "wasm_mvp;wasm_eh;wasm_threads"

.github/workflows/build.yml

Lines changed: 0 additions & 71 deletions
This file was deleted.

.gitmodules

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,8 @@
1+
[submodule "duckdb"]
2+
path = duckdb
3+
url = https://github.com/duckdb/duckdb
4+
branch = main
15
[submodule "extension-ci-tools"]
26
path = extension-ci-tools
37
url = https://github.com/duckdb/extension-ci-tools
8+
branch = main

CLAUDE.md

Lines changed: 11 additions & 85 deletions
Original file line numberDiff line numberDiff line change
@@ -12,27 +12,29 @@ This is a DuckDB extension written in Rust that enables native SQL querying of L
1212
```bash
1313
# Initial setup (only needed once)
1414
git submodule update --init --recursive
15-
make configure
1615

1716
# Build commands
18-
make release # Production build → build/release/lance.duckdb_extension
19-
make debug # Debug build → build/debug/lance.duckdb_extension
20-
make clean # Clean build artifacts
21-
make clean_all # Clean everything including configure
17+
GEN=ninja make release # Production build → build/release/lance.duckdb_extension
18+
GEN=ninja make debug # Debug build → build/debug/lance.duckdb_extension
19+
GEN=ninja make clean # Clean build artifacts
20+
GEN=ninja make clean_all # Clean everything including configure
2221

2322
# Quick Rust checks (without full build)
2423
cargo check
2524
cargo clippy --all-targets --all-features
2625
```
2726

2827
### Testing
28+
29+
release build can be slow, use `test_debug` for quick test.
30+
2931
```bash
3032
# Run all tests (builds release and runs sqllogictest)
31-
make test
33+
GEN=ninja make test
3234

3335
# Run with specific build
34-
make test_debug # Test with debug build
35-
make test_release # Test with release build
36+
GEN=ninja make test_debug # Test with debug build
37+
GEN=ninja make test_release # Test with release build
3638

3739
# Run DuckDB with extension for manual testing
3840
duckdb -unsigned -c "LOAD 'build/release/lance.duckdb_extension'; SELECT * FROM lance_scan('test/test_data.lance');"
@@ -41,7 +43,7 @@ duckdb -unsigned -c "LOAD 'build/release/lance.duckdb_extension'; SELECT * FROM
4143
### Development Iteration
4244
```bash
4345
# Fast iteration cycle
44-
cargo build --release && make test_release
46+
cargo build --release && make test_debug
4547

4648
# Check for issues without full build
4749
cargo clippy --all-targets --all-features
@@ -73,59 +75,11 @@ The extension follows a three-layer architecture:
7375
The extension uses different names to avoid conflicts:
7476
- **Extension name**: `lance` (what users see)
7577
- **Rust crate name**: `lance_duckdb` (avoids crate conflict)
76-
- **Entry point**: `lance_init_c_api` (generated from extension name)
77-
78-
This is controlled in `Makefile`:
79-
```makefile
80-
EXTENSION_NAME=lance
81-
RUST_CRATE_NAME=lance_duckdb
82-
```
83-
84-
#### Async Bridge Pattern
85-
Lance uses async APIs while DuckDB extensions are synchronous:
86-
```rust
87-
// Create runtime in init
88-
let runtime = Arc::new(Runtime::new()?);
89-
90-
// Block on async operations
91-
let dataset = runtime.block_on(async {
92-
Dataset::open(&path).await
93-
})?;
94-
```
95-
96-
#### Current Data Loading Strategy
97-
**Important**: Currently loads ALL data into memory during `init()`:
98-
```rust
99-
// In LanceScanInitData
100-
batches: Arc<Mutex<Vec<RecordBatch>>>, // All data loaded here
101-
```
102-
103-
This works for small-medium datasets but needs streaming for production.
104-
105-
### Dependency Version Constraints
106-
107-
**Critical**: Arrow versions MUST match exactly between Lance and DuckDB:
108-
- Lance 0.32.1 → Arrow 55.1
109-
- No version ranges allowed (use exact versions)
110-
111-
### Known Limitations
112-
113-
1. **Replacement Scan**: Not implemented due to `duckdb-rs` API limitations
114-
- Users must use `lance_scan('file.lance')` instead of `FROM 'file.lance'`
115-
- Requires access to raw database handle not exposed by duckdb-rs
116-
117-
2. **Type Conversion**: Currently simplified to strings
118-
- Production needs direct Arrow→DuckDB memory mapping
119-
120-
3. **Memory Usage**: Loads entire dataset into memory
121-
- Needs streaming implementation for large datasets
12278

12379
## Test Data & Testing
12480

12581
### Test Dataset
12682
Location: `test/test_data.lance`
127-
- 5 records: id (1-5), name (Alice-Eve), age (25-45), score (78.5-95.5)
128-
- Created by: `cargo run --example create_test_data`
12983

13084
### Test Format
13185
Uses DuckDB's sqllogictest format in `test/sql/`:
@@ -135,37 +89,9 @@ Uses DuckDB's sqllogictest format in `test/sql/`:
13589

13690
## Common Issues & Solutions
13791

138-
### Build Failures
139-
1. **Cargo hangs**: Kill with `pkill -9 cargo rustc`, then `make clean`
140-
2. **Version mismatch**: Check `TARGET_DUCKDB_VERSION=v1.3.2` in Makefile
141-
3. **Missing symbols**: Ensure `USE_UNSTABLE_C_API=1` is set
142-
14392
### Extension Loading
14493
```sql
14594
-- Always use -unsigned flag for local builds
14695
duckdb -unsigned
14796
LOAD 'build/release/lance.duckdb_extension';
14897
```
149-
150-
### Type Errors
151-
Current implementation converts to strings. If seeing type mismatches, check:
152-
1. Arrow schema extraction in `bind()`
153-
2. Type mapping in `types.rs`
154-
3. Data conversion in `func()`
155-
156-
## Future Improvements Priority
157-
158-
1. **High Priority**
159-
- Streaming reads (replace Vec<RecordBatch> with iterator)
160-
- Proper Arrow→DuckDB type mapping
161-
- Predicate pushdown to Lance
162-
163-
2. **Medium Priority**
164-
- Replacement scan when API available
165-
- Projection pushdown
166-
- Better error messages
167-
168-
3. **Low Priority**
169-
- Write support (COPY TO)
170-
- Vector index integration
171-
- Statistics for query optimization

CMakeLists.txt

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
cmake_minimum_required(VERSION 3.5)
2+
3+
# Set extension name here
4+
set(TARGET_NAME lance)
5+
6+
# No external dependencies for now
7+
8+
set(EXTENSION_NAME ${TARGET_NAME}_extension)
9+
set(LOADABLE_EXTENSION_NAME ${TARGET_NAME}_loadable_extension)
10+
11+
project(${TARGET_NAME})
12+
include_directories(src/include)
13+
14+
set(EXTENSION_SOURCES src/lance_extension.cpp src/lance_scan.cpp)
15+
16+
build_static_extension(${TARGET_NAME} ${EXTENSION_SOURCES})
17+
build_loadable_extension(${TARGET_NAME} " " ${EXTENSION_SOURCES})
18+
19+
# No external libraries to link
20+
21+
install(
22+
TARGETS ${EXTENSION_NAME}
23+
EXPORT "${DUCKDB_EXPORT_SET}"
24+
LIBRARY DESTINATION "${INSTALL_LIB_DIR}"
25+
ARCHIVE DESTINATION "${INSTALL_LIB_DIR}")

0 commit comments

Comments
 (0)