Skip to content

ATTACH (TYPE LANCE) on a dataset with lance.json (pa.json_()) columns fails with "DuckDB does not support Strings over 4GB" on any query that projects the JSON column #201

@erdinc-aura

Description

@erdinc-aura

Environment

  • duckdb CLI: v1.4.4
  • DuckDB lance extension: version 7d49b91 (core, not community)
  • pylance: 4.0.1
  • lancedb: 0.30.2
  • Lance file format: 2.2 (data_storage_version="2.2" / storage_options={"new_table_data_storage_version": "2.2"})
  • OS: macOS (Darwin 24.6.0, arm64)
  • Python: 3.14

Symptom

ATTACH succeeds. SHOW ALL TABLES works. Selecting non-JSON columns works. Any query that projects the JSON column — or even operates on it (octet_length, CAST, hex, substring) — fails:

Conversion Error: DuckDB does not support Strings over 4GB

The column is exposed as BLOB in DuckDB but the values are not actually oversized — the test rows are tens of bytes each. The error appears to be a misclassification at the Arrow→DuckDB bridge: the lance.json encoding (large_binary with ARROW:extension:name="lance.json") is being handed to DuckDB's string path without the length/bounds normalization that
regular LargeBinary → BLOB goes through.

Minimal reproducer

  import lancedb, pyarrow as pa, json, subprocess, tempfile

  with tempfile.TemporaryDirectory() as d:
      db = lancedb.connect(
          d,
          storage_options={"new_table_data_storage_version": "2.2"},
      )
      schema = pa.schema([
          pa.field("id", pa.int64()),
          pa.field("attrs", pa.json_()),
      ])
      rows = [
          {"id": i, "attrs": json.dumps({"amount": i * 10, "kind": "x" if i % 2 else "y"})}
          for i in range(1, 6)
      ]
      db.create_table("t", data=rows, schema=schema)

      # Verify the on-disk schema is the JSONB ext type
      # (large_binary + ARROW:extension:name == b'lance.json')
      tbl = db.open_table("t")
      print(tbl.schema.field("attrs").metadata)
      # => {b'ARROW:extension:name': b'lance.json', b'ARROW:extension:metadata': b''}

      # SHOW ALL TABLES works — exposed as BLOB
      print(subprocess.run(
          ["duckdb", "-c",
           f"LOAD lance; ATTACH '{d}' AS ldb (TYPE LANCE); SHOW ALL TABLES;"],
          capture_output=True, text=True).stdout)
      # => id BIGINT, attrs BLOB  ✅

      # SELECT on non-JSON column works
      print(subprocess.run(
          ["duckdb", "-c",
           f"LOAD lance; ATTACH '{d}' AS ldb (TYPE LANCE); SELECT id FROM ldb.t;"],
          capture_output=True, text=True).stdout)
      # => 1..5  ✅

      # SELECT including attrs fails
      r = subprocess.run(
          ["duckdb", "-c",
           f"LOAD lance; ATTACH '{d}' AS ldb (TYPE LANCE); SELECT id, attrs FROM ldb.t;"],
          capture_output=True, text=True)
      print("stderr:", r.stderr)
      # => Conversion Error: DuckDB does not support Strings over 4GB  ❌

      # All of these fail with the same error:
      for expr in [
          "attrs",
          "octet_length(attrs)",
          "CAST(attrs AS VARCHAR)",
          "hex(attrs)",
          "substring(CAST(attrs AS VARCHAR), 1, 10)",
      ]:
          r = subprocess.run(
              ["duckdb", "-c",
               f"LOAD lance; ATTACH '{d}' AS ldb (TYPE LANCE); SELECT id, {expr} FROM ldb.t;"],
              capture_output=True, text=True)
          print(expr, "->", r.stderr.strip().splitlines()[0])

Expected output of that last loop:
attrs -> Conversion Error: DuckDB does not support Strings over 4GB
octet_length(attrs) -> Conversion Error: DuckDB does not support Strings over 4GB
CAST(attrs AS VARCHAR) -> Conversion Error: DuckDB does not support Strings over 4GB
hex(attrs) -> Conversion Error: DuckDB does not support Strings over 4GB
substring(CAST(attrs AS VARCHAR), 1, 10) -> Conversion Error: DuckDB does not support Strings over 4GB

Data that actually lives on disk

Payloads are tiny — the largest attrs value is {"amount":50,"kind":"y"} (~24 bytes). So the "over 4GB" text is misleading; it's a type-path error, not a size problem.

Reading the dataset directly via pylance returns the values as JSON strings without issue:

ds = tbl.to_lance()
ds.to_table().column("attrs").to_pylist()

=> ['{"amount":10,"kind":"x"}', '{"amount":20,"kind":"y"}', ...]

And DataFusion SQL against the same dataset works perfectly:

ds.sql("SELECT json_extract(attrs, '$.amount') AS a FROM dataset")
.build().to_stream_reader().read_all().to_pydict()

=> {'a': ['10', '20', '30', '40', '50']}

So the issue is specifically the DuckDB extension's handling of the lance.json / arrow.json extension-typed column.

Expected behavior

One of:

  1. Expose the column as VARCHAR (or JSON) — lance.json values are already UTF-8 JSON text; the underlying storage type is large_binary but the bytes are valid UTF-8.
  2. Expose it as BLOB but handle it correctly in projections and scalar functions that accept BLOB / VARCHAR casts.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions