Environment
- duckdb CLI: v1.4.4
- DuckDB lance extension: version 7d49b91 (core, not community)
- pylance: 4.0.1
- lancedb: 0.30.2
- Lance file format: 2.2 (data_storage_version="2.2" / storage_options={"new_table_data_storage_version": "2.2"})
- OS: macOS (Darwin 24.6.0, arm64)
- Python: 3.14
Symptom
ATTACH succeeds. SHOW ALL TABLES works. Selecting non-JSON columns works. Any query that projects the JSON column — or even operates on it (octet_length, CAST, hex, substring) — fails:
Conversion Error: DuckDB does not support Strings over 4GB
The column is exposed as BLOB in DuckDB but the values are not actually oversized — the test rows are tens of bytes each. The error appears to be a misclassification at the Arrow→DuckDB bridge: the lance.json encoding (large_binary with ARROW:extension:name="lance.json") is being handed to DuckDB's string path without the length/bounds normalization that
regular LargeBinary → BLOB goes through.
Minimal reproducer
import lancedb, pyarrow as pa, json, subprocess, tempfile
with tempfile.TemporaryDirectory() as d:
db = lancedb.connect(
d,
storage_options={"new_table_data_storage_version": "2.2"},
)
schema = pa.schema([
pa.field("id", pa.int64()),
pa.field("attrs", pa.json_()),
])
rows = [
{"id": i, "attrs": json.dumps({"amount": i * 10, "kind": "x" if i % 2 else "y"})}
for i in range(1, 6)
]
db.create_table("t", data=rows, schema=schema)
# Verify the on-disk schema is the JSONB ext type
# (large_binary + ARROW:extension:name == b'lance.json')
tbl = db.open_table("t")
print(tbl.schema.field("attrs").metadata)
# => {b'ARROW:extension:name': b'lance.json', b'ARROW:extension:metadata': b''}
# SHOW ALL TABLES works — exposed as BLOB
print(subprocess.run(
["duckdb", "-c",
f"LOAD lance; ATTACH '{d}' AS ldb (TYPE LANCE); SHOW ALL TABLES;"],
capture_output=True, text=True).stdout)
# => id BIGINT, attrs BLOB ✅
# SELECT on non-JSON column works
print(subprocess.run(
["duckdb", "-c",
f"LOAD lance; ATTACH '{d}' AS ldb (TYPE LANCE); SELECT id FROM ldb.t;"],
capture_output=True, text=True).stdout)
# => 1..5 ✅
# SELECT including attrs fails
r = subprocess.run(
["duckdb", "-c",
f"LOAD lance; ATTACH '{d}' AS ldb (TYPE LANCE); SELECT id, attrs FROM ldb.t;"],
capture_output=True, text=True)
print("stderr:", r.stderr)
# => Conversion Error: DuckDB does not support Strings over 4GB ❌
# All of these fail with the same error:
for expr in [
"attrs",
"octet_length(attrs)",
"CAST(attrs AS VARCHAR)",
"hex(attrs)",
"substring(CAST(attrs AS VARCHAR), 1, 10)",
]:
r = subprocess.run(
["duckdb", "-c",
f"LOAD lance; ATTACH '{d}' AS ldb (TYPE LANCE); SELECT id, {expr} FROM ldb.t;"],
capture_output=True, text=True)
print(expr, "->", r.stderr.strip().splitlines()[0])
Expected output of that last loop:
attrs -> Conversion Error: DuckDB does not support Strings over 4GB
octet_length(attrs) -> Conversion Error: DuckDB does not support Strings over 4GB
CAST(attrs AS VARCHAR) -> Conversion Error: DuckDB does not support Strings over 4GB
hex(attrs) -> Conversion Error: DuckDB does not support Strings over 4GB
substring(CAST(attrs AS VARCHAR), 1, 10) -> Conversion Error: DuckDB does not support Strings over 4GB
Data that actually lives on disk
Payloads are tiny — the largest attrs value is {"amount":50,"kind":"y"} (~24 bytes). So the "over 4GB" text is misleading; it's a type-path error, not a size problem.
Reading the dataset directly via pylance returns the values as JSON strings without issue:
ds = tbl.to_lance()
ds.to_table().column("attrs").to_pylist()
=> ['{"amount":10,"kind":"x"}', '{"amount":20,"kind":"y"}', ...]
And DataFusion SQL against the same dataset works perfectly:
ds.sql("SELECT json_extract(attrs, '$.amount') AS a FROM dataset")
.build().to_stream_reader().read_all().to_pydict()
=> {'a': ['10', '20', '30', '40', '50']}
So the issue is specifically the DuckDB extension's handling of the lance.json / arrow.json extension-typed column.
Expected behavior
One of:
- Expose the column as VARCHAR (or JSON) — lance.json values are already UTF-8 JSON text; the underlying storage type is large_binary but the bytes are valid UTF-8.
- Expose it as BLOB but handle it correctly in projections and scalar functions that accept BLOB / VARCHAR casts.
References
Environment
Symptom
ATTACH succeeds. SHOW ALL TABLES works. Selecting non-JSON columns works. Any query that projects the JSON column — or even operates on it (octet_length, CAST, hex, substring) — fails:
Conversion Error: DuckDB does not support Strings over 4GB
The column is exposed as BLOB in DuckDB but the values are not actually oversized — the test rows are tens of bytes each. The error appears to be a misclassification at the Arrow→DuckDB bridge: the lance.json encoding (large_binary with ARROW:extension:name="lance.json") is being handed to DuckDB's string path without the length/bounds normalization that
regular LargeBinary → BLOB goes through.
Minimal reproducer
Expected output of that last loop:
attrs -> Conversion Error: DuckDB does not support Strings over 4GB
octet_length(attrs) -> Conversion Error: DuckDB does not support Strings over 4GB
CAST(attrs AS VARCHAR) -> Conversion Error: DuckDB does not support Strings over 4GB
hex(attrs) -> Conversion Error: DuckDB does not support Strings over 4GB
substring(CAST(attrs AS VARCHAR), 1, 10) -> Conversion Error: DuckDB does not support Strings over 4GB
Data that actually lives on disk
Payloads are tiny — the largest attrs value is {"amount":50,"kind":"y"} (~24 bytes). So the "over 4GB" text is misleading; it's a type-path error, not a size problem.
Reading the dataset directly via pylance returns the values as JSON strings without issue:
ds = tbl.to_lance()
ds.to_table().column("attrs").to_pylist()
=> ['{"amount":10,"kind":"x"}', '{"amount":20,"kind":"y"}', ...]
And DataFusion SQL against the same dataset works perfectly:
ds.sql("SELECT json_extract(attrs, '$.amount') AS a FROM dataset")
.build().to_stream_reader().read_all().to_pydict()
=> {'a': ['10', '20', '30', '40', '50']}
So the issue is specifically the DuckDB extension's handling of the lance.json / arrow.json extension-typed column.
Expected behavior
One of:
References