Skip to content

Commit d5133a8

Browse files
authored
feat: add lance_dataset_drop_columns for metadata-only column removal (#42)
## Summary First of three PRs against #41 (schema evolution). Exposes upstream's `drop_columns` — a metadata-only manifest commit that removes the named columns from the schema without rewriting any data files. Materializing the projection is left to a later `_compact_files` (and a future cleanup operation, once exposed, removes the old version's files). Mutates the dataset in place under an exclusive write lock; scanners already in flight keep their pre-drop snapshot view via the existing Arc clone-on-write, same as `_delete` / `_update` / `_compact_files`. ## Surface ```c int32_t lance_dataset_drop_columns( LanceDataset* dataset, const char* const* columns, size_t num_columns ); ``` Inputs are validated up front with per-index error messages so the precise cause is observable from `lance_last_error_message()`. NULL handle, NULL pointer array, zero count, NULL or empty-string entries, and non-UTF-8 names all return `LANCE_ERR_INVALID_ARGUMENT`; upstream's own rejections (unknown column, attempt to drop every column) map to the same code. The C++ wrapper takes `const std::vector<std::string>&` and follows the `update` / `merge_insert` sibling convention — passes `col_ptrs.data()` unconditionally. An empty vector flows through the Rust-side `num_columns == 0` guard so the error message says "num_columns must be > 0" rather than the misleading "columns must not be NULL". ## Tests Eleven new Rust integration tests covering single-drop, multi-drop, version bump, data preservation (downcasts the surviving Arrow columns and checks the actual values, not just shape), and the full rejection surface (NULL dataset / NULL array / zero count / NULL entry / empty-string entry / unknown column / drop-all). C and C++ smoke tests snapshot `ArrowSchema.n_children` pre/post drop, exercise the drop-last-column rejection path, and verify the version is unchanged when a drop fails. `cargo test` and `cargo test --test compile_and_run_test -- --ignored` both green. ## Follow-ups - `lance_dataset_alter_columns` — rename / nullability / type change - `lance_dataset_add_columns` — SQL expressions / AllNulls / ArrowArrayStream The README roadmap entry stays unticked until all three ship.
1 parent bd01a95 commit d5133a8

7 files changed

Lines changed: 565 additions & 0 deletions

File tree

include/lance/lance.h

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -453,6 +453,37 @@ int32_t lance_dataset_compact_files(
453453
LanceCompactionMetrics* out_metrics
454454
);
455455

456+
/* ─── lance_dataset_drop_columns ──────────────────────────────────────────── */
457+
458+
/**
459+
* Drop one or more columns from the dataset's schema, committing a new
460+
* manifest. This is a metadata-only operation: the data files on storage
461+
* are not rewritten until a later `lance_dataset_compact_files` call
462+
* materializes the projection (after which the previous version's files
463+
* can be removed by a future cleanup operation).
464+
*
465+
* Mutates `dataset` in place — the same handle remains valid afterward
466+
* and sees the new version. Scanners already in flight against this
467+
* dataset keep their pre-drop schema view.
468+
*
469+
* @param dataset Open dataset (not consumed). Mutated in place to
470+
* see the new version. Must not be NULL.
471+
* @param columns Array of NUL-terminated UTF-8 column names to drop.
472+
* Must not be NULL; entries must be non-NULL and
473+
* non-empty.
474+
* @param num_columns Length of `columns`. Must be > 0.
475+
* @return 0 on success, -1 on error. Error codes:
476+
* LANCE_ERR_INVALID_ARGUMENT for NULL/empty inputs, NULL or empty
477+
* entries, non-UTF-8 column names, unknown columns, or an attempt
478+
* to drop every column;
479+
* LANCE_ERR_COMMIT_CONFLICT for a concurrent writer.
480+
*/
481+
int32_t lance_dataset_drop_columns(
482+
LanceDataset* dataset,
483+
const char* const* columns,
484+
size_t num_columns
485+
);
486+
456487
/**
457488
* Export the dataset schema via Arrow C Data Interface.
458489
* @param out Pointer to caller-allocated ArrowSchema struct

include/lance/lance.hpp

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -437,6 +437,32 @@ class Dataset {
437437
return metrics;
438438
}
439439

440+
/// Drop columns from the dataset's schema and commit a new manifest.
441+
/// Metadata-only — data files remain until a later `compact_files()`
442+
/// call rewrites them. Mutates this dataset in place; the handle
443+
/// continues to point at the new version.
444+
///
445+
/// `columns` must be non-empty. Throws lance::Error on failure (empty
446+
/// list, unknown column, attempt to drop every column, commit
447+
/// conflict, ...).
448+
void drop_columns(const std::vector<std::string>& columns) {
449+
std::vector<const char*> col_ptrs;
450+
col_ptrs.reserve(columns.size());
451+
for (const auto& c : columns) {
452+
col_ptrs.push_back(c.c_str());
453+
}
454+
// Pass `col_ptrs.data()` unconditionally — matches the `update`
455+
// and `merge_insert` siblings whose inputs are also required to
456+
// be non-empty. The Rust layer rejects `num_columns == 0` before
457+
// dereferencing the pointer, so an empty vector still surfaces
458+
// INVALID_ARGUMENT with the precise "num_columns must be > 0"
459+
// message rather than the misleading "columns must not be NULL".
460+
if (lance_dataset_drop_columns(
461+
handle_.get(), col_ptrs.data(), columns.size()) != 0) {
462+
check_error();
463+
}
464+
}
465+
440466
/// Export the schema as an Arrow C Data Interface struct.
441467
void schema(ArrowSchema* out) const {
442468
if (lance_dataset_schema(handle_.get(), out) != 0) {

src/drop_columns.rs

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
// SPDX-License-Identifier: Apache-2.0
2+
// SPDX-FileCopyrightText: Copyright The Lance Authors
3+
4+
//! Drop columns C API: remove columns from the dataset's schema, committing a
5+
//! new manifest. Metadata-only — data files are not rewritten until a later
6+
//! `lance_dataset_compact_files` call materializes the projection.
7+
//!
8+
//! Mutates the dataset in place under an exclusive write lock; existing
9+
//! scanners that already cloned the inner Arc keep their pre-drop schema view.
10+
11+
use std::ffi::c_char;
12+
13+
use lance_core::Result;
14+
use snafu::location;
15+
16+
use crate::dataset::LanceDataset;
17+
use crate::error::ffi_try;
18+
use crate::helpers;
19+
use crate::runtime::block_on;
20+
21+
/// Drop one or more columns from the dataset's schema and commit a new
22+
/// manifest. This is a metadata-only operation: data files remain on storage
23+
/// until they are rewritten by `lance_dataset_compact_files` (and then
24+
/// cleaned up by version cleanup).
25+
///
26+
/// - `dataset`: Open dataset (mutated; same handle remains valid afterward).
27+
/// Must not be NULL.
28+
/// - `columns`: Pointer to an array of NUL-terminated C strings naming the
29+
/// columns to drop. Must not be NULL. Entries must be non-NULL and
30+
/// non-empty UTF-8.
31+
/// - `num_columns`: Length of the `columns` array. Must be non-zero.
32+
///
33+
/// Returns 0 on success, -1 on error. Error codes:
34+
/// `LANCE_ERR_INVALID_ARGUMENT` for NULL/empty args, NULL or empty entries,
35+
/// non-UTF-8 names, unknown columns, or an attempt to drop every column
36+
/// (upstream rejects that since a Lance dataset must retain at least one
37+
/// field). `LANCE_ERR_COMMIT_CONFLICT` for a concurrent writer.
38+
#[unsafe(no_mangle)]
39+
pub unsafe extern "C" fn lance_dataset_drop_columns(
40+
dataset: *mut LanceDataset,
41+
columns: *const *const c_char,
42+
num_columns: usize,
43+
) -> i32 {
44+
ffi_try!(
45+
unsafe { drop_columns_inner(dataset, columns, num_columns) },
46+
neg
47+
)
48+
}
49+
50+
unsafe fn drop_columns_inner(
51+
dataset: *mut LanceDataset,
52+
columns: *const *const c_char,
53+
num_columns: usize,
54+
) -> Result<i32> {
55+
if dataset.is_null() {
56+
return Err(lance_core::Error::InvalidInput {
57+
source: "dataset must not be NULL".into(),
58+
location: location!(),
59+
});
60+
}
61+
if columns.is_null() {
62+
return Err(lance_core::Error::InvalidInput {
63+
source: "columns must not be NULL".into(),
64+
location: location!(),
65+
});
66+
}
67+
if num_columns == 0 {
68+
return Err(lance_core::Error::InvalidInput {
69+
source: "num_columns must be > 0".into(),
70+
location: location!(),
71+
});
72+
}
73+
74+
// Materialize the column names up front so any per-index validation
75+
// error fires before the dataset's write lock is taken — matches the
76+
// pre-lock validation pattern used by `update.rs`.
77+
let mut names: Vec<String> = Vec::with_capacity(num_columns);
78+
for i in 0..num_columns {
79+
// SAFETY: `columns` is non-NULL (checked above) and the caller
80+
// guarantees the array has at least `num_columns` entries.
81+
let entry = unsafe { *columns.add(i) };
82+
// SAFETY: each entry is either NULL (rejected below) or a
83+
// NUL-terminated C string the caller keeps alive for this call.
84+
let name = unsafe { helpers::parse_c_string(entry)? }
85+
.filter(|s| !s.is_empty())
86+
.ok_or_else(|| lance_core::Error::InvalidInput {
87+
source: format!("columns[{i}] must not be NULL or empty").into(),
88+
location: location!(),
89+
})?;
90+
names.push(name.to_string());
91+
}
92+
93+
// SAFETY: `dataset` is non-NULL (checked above) and the caller guarantees
94+
// it points to a live `LanceDataset`. `with_mut` takes an exclusive
95+
// write lock on the inner `Arc<Dataset>` before yielding `&mut Dataset`,
96+
// so a shared `&*dataset` borrow here is sound — interior mutability
97+
// is the synchronization point.
98+
let ds = unsafe { &*dataset };
99+
let names_refs: Vec<&str> = names.iter().map(String::as_str).collect();
100+
ds.with_mut(|d| block_on(d.drop_columns(&names_refs)))?;
101+
Ok(0)
102+
}

src/lib.rs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ mod batch;
2020
mod compact;
2121
mod dataset;
2222
mod delete;
23+
mod drop_columns;
2324
mod error;
2425
mod fragment_writer;
2526
mod helpers;
@@ -37,6 +38,7 @@ pub use batch::*;
3738
pub use compact::*;
3839
pub use dataset::*;
3940
pub use delete::*;
41+
pub use drop_columns::*;
4042
pub use error::{
4143
LanceErrorCode, lance_free_string, lance_last_error_code, lance_last_error_message,
4244
};

0 commit comments

Comments
 (0)