ATTACH (TYPE LANCE) on a dataset with lance.json (pa.json_()) columns fails with "DuckDB does not support Strings over 4GB" on any query that projects the JSON column

Environment

  - duckdb CLI: v1.4.4
  - DuckDB lance extension: version 7d49b91 (core, not community)
  - pylance: 4.0.1
  - lancedb: 0.30.2
  - Lance file format: 2.2 (data_storage_version="2.2" / storage_options={"new_table_data_storage_version": "2.2"})
  - OS: macOS (Darwin 24.6.0, arm64)
  - Python: 3.14

  Symptom

  ATTACH succeeds. SHOW ALL TABLES works. Selecting non-JSON columns works. Any query that projects the JSON column — or even operates on it (octet_length, CAST, hex, substring) — fails:

  Conversion Error: DuckDB does not support Strings over 4GB

  The column is exposed as BLOB in DuckDB but the values are not actually oversized — the test rows are tens of bytes each. The error appears to be a misclassification at the Arrow→DuckDB bridge: the lance.json encoding (large_binary with ARROW:extension:name="lance.json") is being handed to DuckDB's string path without the length/bounds normalization that
  regular LargeBinary → BLOB goes through.


  Minimal reproducer
```python
  import lancedb, pyarrow as pa, json, subprocess, tempfile

  with tempfile.TemporaryDirectory() as d:
      db = lancedb.connect(
          d,
          storage_options={"new_table_data_storage_version": "2.2"},
      )
      schema = pa.schema([
          pa.field("id", pa.int64()),
          pa.field("attrs", pa.json_()),
      ])
      rows = [
          {"id": i, "attrs": json.dumps({"amount": i * 10, "kind": "x" if i % 2 else "y"})}
          for i in range(1, 6)
      ]
      db.create_table("t", data=rows, schema=schema)

      # Verify the on-disk schema is the JSONB ext type
      # (large_binary + ARROW:extension:name == b'lance.json')
      tbl = db.open_table("t")
      print(tbl.schema.field("attrs").metadata)
      # => {b'ARROW:extension:name': b'lance.json', b'ARROW:extension:metadata': b''}

      # SHOW ALL TABLES works — exposed as BLOB
      print(subprocess.run(
          ["duckdb", "-c",
           f"LOAD lance; ATTACH '{d}' AS ldb (TYPE LANCE); SHOW ALL TABLES;"],
          capture_output=True, text=True).stdout)
      # => id BIGINT, attrs BLOB  ✅

      # SELECT on non-JSON column works
      print(subprocess.run(
          ["duckdb", "-c",
           f"LOAD lance; ATTACH '{d}' AS ldb (TYPE LANCE); SELECT id FROM ldb.t;"],
          capture_output=True, text=True).stdout)
      # => 1..5  ✅

      # SELECT including attrs fails
      r = subprocess.run(
          ["duckdb", "-c",
           f"LOAD lance; ATTACH '{d}' AS ldb (TYPE LANCE); SELECT id, attrs FROM ldb.t;"],
          capture_output=True, text=True)
      print("stderr:", r.stderr)
      # => Conversion Error: DuckDB does not support Strings over 4GB  ❌

      # All of these fail with the same error:
      for expr in [
          "attrs",
          "octet_length(attrs)",
          "CAST(attrs AS VARCHAR)",
          "hex(attrs)",
          "substring(CAST(attrs AS VARCHAR), 1, 10)",
      ]:
          r = subprocess.run(
              ["duckdb", "-c",
               f"LOAD lance; ATTACH '{d}' AS ldb (TYPE LANCE); SELECT id, {expr} FROM ldb.t;"],
              capture_output=True, text=True)
          print(expr, "->", r.stderr.strip().splitlines()[0])
```

  Expected output of that last loop:
  attrs -> Conversion Error: DuckDB does not support Strings over 4GB
  octet_length(attrs) -> Conversion Error: DuckDB does not support Strings over 4GB
  CAST(attrs AS VARCHAR) -> Conversion Error: DuckDB does not support Strings over 4GB
  hex(attrs) -> Conversion Error: DuckDB does not support Strings over 4GB
  substring(CAST(attrs AS VARCHAR), 1, 10) -> Conversion Error: DuckDB does not support Strings over 4GB

  Data that actually lives on disk

  Payloads are tiny — the largest attrs value is {"amount":50,"kind":"y"} (~24 bytes). So the "over 4GB" text is misleading; it's a type-path error, not a size problem.

  Reading the dataset directly via pylance returns the values as JSON strings without issue:

  ds = tbl.to_lance()
  ds.to_table().column("attrs").to_pylist()
  # => ['{"amount":10,"kind":"x"}', '{"amount":20,"kind":"y"}', ...]

  And DataFusion SQL against the same dataset works perfectly:

  ds.sql("SELECT json_extract(attrs, '$.amount') AS a FROM dataset") \
    .build().to_stream_reader().read_all().to_pydict()
  # => {'a': ['10', '20', '30', '40', '50']}

  So the issue is specifically the DuckDB extension's handling of the lance.json / arrow.json extension-typed column.

  Expected behavior

  One of:
  1. Expose the column as VARCHAR (or JSON) — lance.json values are already UTF-8 JSON text; the underlying storage type is large_binary but the bytes are valid UTF-8.
  2. Expose it as BLOB but handle it correctly in projections and scalar functions that accept BLOB / VARCHAR casts.

  References

  - LanceDB blog introducing JSONB under file format 2.2: https://www.lancedb.com/blog/lance-json-support-why-you-might-not-really-need-variant
  - File format 2.2 announcement: https://www.lancedb.com/blog/lance-file-format-2-2-taming-complex-data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ATTACH (TYPE LANCE) on a dataset with lance.json (pa.json_()) columns fails with "DuckDB does not support Strings over 4GB" on any query that projects the JSON column #201

=> ['{"amount":10,"kind":"x"}', '{"amount":20,"kind":"y"}', ...]

=> {'a': ['10', '20', '30', '40', '50']}

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ATTACH (TYPE LANCE) on a dataset with lance.json (pa.json_()) columns fails with "DuckDB does not support Strings over 4GB" on any query that projects the JSON column #201

Description

=> ['{"amount":10,"kind":"x"}', '{"amount":20,"kind":"y"}', ...]

=> {'a': ['10', '20', '30', '40', '50']}

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions