Skip to content

V1.5 variegata -> main#214

Closed
Dtenwolde wants to merge 6073 commits intomainfrom
v1.5-variegata
Closed

V1.5 variegata -> main#214
Dtenwolde wants to merge 6073 commits intomainfrom
v1.5-variegata

Conversation

@Dtenwolde
Copy link
Collaborator

No description provided.

Laurens Kuiper and others added 30 commits February 13, 2026 14:17
Commit should have been lost in merge somewhere
… type) and use column data count instead of row group count for shredding
…nup relevant HTTP-transaction specific state
…ection - call SetAppendRequiresNewRowGroup so that new appends will go to a new row group
…imitive types to be inlined (duckdb#20951)

Follow-up / includes duckdb#20947 (actual
diff [here](
https://github.com/Mytherin/duckdb/compare/swapuntypedtyped...Mytherin:duckdb:optionaluntyped?expand=1))

When shredding variants we store them using the following schema
(shredded on `STRUCT(a DATE, b INT)`):

```sql
shredded STRUCT(
	typed_value STRUCT(
		a STRUCT(
			typed_value DATE,
			untyped_value_index UINTEGER
		),
		b STRUCT(
			typed_value INT,
			untyped_value_index UINTEGER
		)
	),
	untyped_value_index UINTEGER
)
```

When they are fully shredded, `untyped_value_index` is entirely `NULL`.
As such, it doesn't contain any useful information. This PR reworks our
shredding and unshredding code to allow:

* (1) `untyped_value_index` to be left out entirely if the type is fully
shredded
* (2) `typed_value` to be inlined in the parent if `typed_value` has a
primitive (i.e. non-nested) type

If the above struct is fully shredded, we would instead get the
following representation:

```sql
shredded STRUCT(
	typed_value STRUCT(
		a DATE,
		b INT
	)
)
```

This doesn't save much storage space, given that the various
`untyped_value_index` layers are all stored as constant `NULL` in this
situation, but it saves us a bunch of unnecessary column metadata. More
importantly, this allows us to look only at the shredded schema instead
of looking at the variant stats to realize a field is fully shredded.
This is useful for simplifying shredded execution. We also save on
allocating many empty vectors for reading fully shredded data.
Noticed in duckdb-wasm, where the two types have different sizes
The current way to enable (opt-in) and disable the PEG parser is not
really intuitive (`SET allow_parser_override_extension=strict`). So I've
added options, which are essentially aliases for the old `SET`, to
enable and disable the PEG Parser.
`enable_peg_parser` and `disable_peg_parser`

Internally they call `SET
allow_parser_override_extension=strict/default`
`strict` = Exclusively use the PEG Parser
`default` = Exclusively use the old parser
At least for now that's the default behaviour. 
It also gives a nicer error message when the user tries this setting
when the `autocomplete` extension is not loaded
With duckdb/extension-ci-tools#306 PR integrated it is now necessary to
specify both `linux_amd64_musl` and `linux_arm64_musl` as "opt-in" for
extensions build (they are now disabled by default).

Ref: duckdblabs/duckdb-internal#4616
pdet and others added 29 commits February 25, 2026 11:20
…uckdb#21069)

Three independent fixes:

* 1st commit, by @ccfelius: avoid requiring GetEncryptionUtil when
reading non-encrypted parquet files
* 2nd and 3rd commit, be me: given in the read path the mbedTLS (that is
always available) encryption util is workable, attempt to `auto-LOAD
httpfs`, but do not `auto-INSTALL httpfs`.
* 4th commit, together with @ccfelius, fixing the error message to make
clear one might get there also while writing encrypted parquet files.

After this PR, the behaviour will be as follow:

* when reading plain parquet files, httpfs is NOT required for
encryption (it might for file system)
* when writing plain parquet files, httpfs is NOT required for
encryption
* when reading encrypted parquet files, httpfs is loaded IFF available
locally AND autoload_known_extensions is set, otherwise fall back to
mbedTLS (that might be slower but always available)
* when writing encrypted parquet files, httpfs is loaded IFF available
locally (and setting allows), otherwise it's INSTALLED and LOADED
(according to settings), otherwise one fails with relevant error message

Same behaviour should hold for plaintext or encrypted duckdb files.
Basically writing a file it's mean as instruction to autoinstall and
autoload `httpfs` (after checking settings), while read path it's
implicitly OK to load (to speed up decompression) but no install should
be attempted (that is, no networking should be triggered).
…age (duckdb#21020)

This PR is a work in progress.

Originally when adding the `GEOMETRY` type to core, the idea was to keep
the new type logically separate from the old type (as defined in
`spatial`), but introduce implicit casts between the two to make this
distinction mostly transparent to the end user.

In this PR we instead "merge" the two types by moving the conversion
down to the storage layer (and serialization boundaries) instead of the
execution engine. This makes the migration completely seamless. Newer
DuckDB versions targeting an older storage version writes the exact same
geometry format that older versions of DuckDB can read/write with
spatial. But in-memory and all throughout the execution engine (e.g.
where users/extensions/client libraries interface with DuckDB) you still
get the new representation.

As far as I can tell there are 4 places where we need to perform the
conversion between new and old format.

- `Vector::De/Serialize`
- `Value::De/Serialize`
- ART index keys
- Storage (`ColumnData`)

However this gets a bit tricky as the new and the old geometry type can
no longer be distinguished from one another. They have the same logical
type, same physical type, and while their byte representations differs
it is not possible to easily determine if a given binary blob is one or
the other. How do we even know if we need to perform the conversion or
not?

- During Vector/Value de/serialization we solve this by writing a new
optional field specifying the format if we are targeting a newer
serialization version. When reading we can then determine if we get old
or new geometry data based on if this field is present or not.

- For ART indexes ~~we can detect the storage version of the database
that the ART index is in~~ we now persist a "storage_version" optional
value in the `IndexStorageInfo` which allows us to track which storage
version the index was created in, and therefore know if we need to
convert to the old format when creating art keys for insertion or
lookups.

- In the storage we can detect the storage version of the database the
table is in, and make use of the new "geometry shredding" functionality
to simply treat the old storage format as if it is just any other
shredding layout. This works well because the `GeoColumnData`
abstraction ensures that the disk representation of a shredded geometry
column looks almost identical to a standard column of the same type as
the shredding layout. We then employ a similar pattern as the
vector/value serialization by treating the
`GeometryPersistentColumnData` where we normally store the shredding
layout type for the segment as the indicator that this is a new or old
geometry column based on if its present or not.


There is one place left where things get a bit more complex, and that is
statistics.
The old geometry type (as defined in spatial) is just a type alias over
`BLOB`, and therefore contain `StringStats` (which was never used for
any optimizations), while the new geometry type has its own
`GeometryStats` type (which is very useful for optimizations!). While
it's not possible to convert from one to the other, we can simply write
`StringStats::Unknown()` or `GeometryStats::Unknown()` depending on
which way we go when passing the conversion boundary, so thats fine.

The hard part is to figure out _what_ stats we actually get during
deserialization. Stats are stored at multiple levels in our storage
(both at table level and row-group level), but don't actually store
their "type" themselves, as that is determined by the table schema. But
the table schema won't tell you if this is an old or new geometry type
(thus having string or geometry stats) since the types are now the same.
I've tried to keep it that way and therefore _not_ introduced a new
discriminator field like I did for value/vector serialization and
~~instead tried to infer the stats we expect by passing down the
`Catalog` in the deserializer. If the catalog has an older storage
format, we know we're about to deserialize string stats, otherwise we
expect geometry stats.~~

~~This works for de/serializing stats in our storage, but Im not sure if
this is the best way to go about it or if we always can rely on a
catalog being present when deserializing stats elsewhere. AFAIK stats
serialization is busted when serializing plans anyway, but maybe there
is some other way to infer if a blob of serialized geometry (or string)
stats come from an older or newer duckdb.~~

__Update__: Ive added a `HasProperty` method to the `Deserializer` which
allows us to peek the next field id. With this we can detect when the
stats are old string stats or new geometry stats without additional
context. This should squash my doubts re: stats as part of plan
serialization.

-----------------------

Additionally this PR also introduces some guards when trying to create
tables using newly added types (such as `VARIANT` and `GEOMETRY(<WITH
CRS>)` when targeting older storage formats, and adjusts tests
accordingly.
* Re-enable regular comparison joins for simple AsOf joins.
Change the syntax of `ALTER DATABASE <name> RENAME TO <alias>` to `ALTER
DATABASE <name> SET ALIAS TO <alias>` to better reflect what's going on.

```
if (strcasecmp($7, "alias") != 0) {
ereport(ERROR,
		(errcode(PG_ERRCODE_SYNTAX_ERROR),
			errmsg("expected SET ALIAS TO, got SET %s TO", $7),
			parser_errposition(@7)));
}
```

This is a bit of a hacky solution, but adding `ALIAS` to the list of
unreserved keywords would disallow using `ALIAS` as an implicit alias
(`SELECT 1 alias`), since those can only be identifiers (non-keywords)
Adding the newly added `order_options` to be copied over when copying
`TableScanBindData`. Making sure we don't run into unexpected errors ..
* Re-enable regular comparison joins for simple AsOf joins.
…uckdb#21018)

Per the ADBC specification
(https://github.com/apache/arrow-adbc/blob/7b38cf4543330592a40a5023b4e6b93f8f34d7ff/c/include/arrow-adbc/adbc.h#L1561-L1664),
list-typed fields at or above the requested depth must be empty lists,
not null, even when filters exclude all items.
Null is reserved for fields below the requested depth level.

Found with the adbc driver validation test suite.
https://github.com/adbc-drivers/validation

cc @lidavidm
)

This fixes up
duckdb@def4d91
for cases where a previous test happen to fail.
@Dtenwolde Dtenwolde closed this Feb 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.