V1.5 variegata -> main by Dtenwolde · Pull Request #214 · cwida/duckdb-pgq

Dtenwolde · 2026-02-27T15:28:04Z

No description provided.

…e types in shredded variants

Commit should have been lost in merge somewhere

now throws useful error

… type) and use column data count instead of row group count for shredding

…nup relevant HTTP-transaction specific state

…ection - call SetAppendRequiresNewRowGroup so that new appends will go to a new row group

…imitive types to be inlined (duckdb#20951) Follow-up / includes duckdb#20947 (actual diff [here]( https://github.com/Mytherin/duckdb/compare/swapuntypedtyped...Mytherin:duckdb:optionaluntyped?expand=1)) When shredding variants we store them using the following schema (shredded on `STRUCT(a DATE, b INT)`): ```sql shredded STRUCT( typed_value STRUCT( a STRUCT( typed_value DATE, untyped_value_index UINTEGER ), b STRUCT( typed_value INT, untyped_value_index UINTEGER ) ), untyped_value_index UINTEGER ) ``` When they are fully shredded, `untyped_value_index` is entirely `NULL`. As such, it doesn't contain any useful information. This PR reworks our shredding and unshredding code to allow: * (1) `untyped_value_index` to be left out entirely if the type is fully shredded * (2) `typed_value` to be inlined in the parent if `typed_value` has a primitive (i.e. non-nested) type If the above struct is fully shredded, we would instead get the following representation: ```sql shredded STRUCT( typed_value STRUCT( a DATE, b INT ) ) ``` This doesn't save much storage space, given that the various `untyped_value_index` layers are all stored as constant `NULL` in this situation, but it saves us a bunch of unnecessary column metadata. More importantly, this allows us to look only at the shredded schema instead of looking at the variant stats to realize a field is fully shredded. This is useful for simplifying shredded execution. We also save on allocating many empty vectors for reading fully shredded data.

…d a compiler error??

Noticed in duckdb-wasm, where the two types have different sizes

…ext-page-pointer, which we are using here

The current way to enable (opt-in) and disable the PEG parser is not really intuitive (`SET allow_parser_override_extension=strict`). So I've added options, which are essentially aliases for the old `SET`, to enable and disable the PEG Parser. `enable_peg_parser` and `disable_peg_parser` Internally they call `SET allow_parser_override_extension=strict/default` `strict` = Exclusively use the PEG Parser `default` = Exclusively use the old parser At least for now that's the default behaviour. It also gives a nicer error message when the user tries this setting when the `autocomplete` extension is not loaded

The update includes only the duckdb/odbc-scanner#154 PR with the CMake fix. Ref: duckdblabs/duckdb-internal#7433

With duckdb/extension-ci-tools#306 PR integrated it is now necessary to specify both `linux_amd64_musl` and `linux_arm64_musl` as "opt-in" for extensions build (they are now disabled by default). Ref: duckdblabs/duckdb-internal#4616

@ccfelius

…uckdb#21069) Three independent fixes: * 1st commit, by @ccfelius: avoid requiring GetEncryptionUtil when reading non-encrypted parquet files * 2nd and 3rd commit, be me: given in the read path the mbedTLS (that is always available) encryption util is workable, attempt to `auto-LOAD httpfs`, but do not `auto-INSTALL httpfs`. * 4th commit, together with @ccfelius, fixing the error message to make clear one might get there also while writing encrypted parquet files. After this PR, the behaviour will be as follow: * when reading plain parquet files, httpfs is NOT required for encryption (it might for file system) * when writing plain parquet files, httpfs is NOT required for encryption * when reading encrypted parquet files, httpfs is loaded IFF available locally AND autoload_known_extensions is set, otherwise fall back to mbedTLS (that might be slower but always available) * when writing encrypted parquet files, httpfs is loaded IFF available locally (and setting allows), otherwise it's INSTALLED and LOADED (according to settings), otherwise one fails with relevant error message Same behaviour should hold for plaintext or encrypted duckdb files. Basically writing a file it's mean as instruction to autoinstall and autoload `httpfs` (after checking settings), while read path it's implicitly OK to load (to speed up decompression) but no install should be attempted (that is, no networking should be triggered).

…age (duckdb#21020) This PR is a work in progress. Originally when adding the `GEOMETRY` type to core, the idea was to keep the new type logically separate from the old type (as defined in `spatial`), but introduce implicit casts between the two to make this distinction mostly transparent to the end user. In this PR we instead "merge" the two types by moving the conversion down to the storage layer (and serialization boundaries) instead of the execution engine. This makes the migration completely seamless. Newer DuckDB versions targeting an older storage version writes the exact same geometry format that older versions of DuckDB can read/write with spatial. But in-memory and all throughout the execution engine (e.g. where users/extensions/client libraries interface with DuckDB) you still get the new representation. As far as I can tell there are 4 places where we need to perform the conversion between new and old format. - `Vector::De/Serialize` - `Value::De/Serialize` - ART index keys - Storage (`ColumnData`) However this gets a bit tricky as the new and the old geometry type can no longer be distinguished from one another. They have the same logical type, same physical type, and while their byte representations differs it is not possible to easily determine if a given binary blob is one or the other. How do we even know if we need to perform the conversion or not? - During Vector/Value de/serialization we solve this by writing a new optional field specifying the format if we are targeting a newer serialization version. When reading we can then determine if we get old or new geometry data based on if this field is present or not. - For ART indexes ~~we can detect the storage version of the database that the ART index is in~~ we now persist a "storage_version" optional value in the `IndexStorageInfo` which allows us to track which storage version the index was created in, and therefore know if we need to convert to the old format when creating art keys for insertion or lookups. - In the storage we can detect the storage version of the database the table is in, and make use of the new "geometry shredding" functionality to simply treat the old storage format as if it is just any other shredding layout. This works well because the `GeoColumnData` abstraction ensures that the disk representation of a shredded geometry column looks almost identical to a standard column of the same type as the shredding layout. We then employ a similar pattern as the vector/value serialization by treating the `GeometryPersistentColumnData` where we normally store the shredding layout type for the segment as the indicator that this is a new or old geometry column based on if its present or not. There is one place left where things get a bit more complex, and that is statistics. The old geometry type (as defined in spatial) is just a type alias over `BLOB`, and therefore contain `StringStats` (which was never used for any optimizations), while the new geometry type has its own `GeometryStats` type (which is very useful for optimizations!). While it's not possible to convert from one to the other, we can simply write `StringStats::Unknown()` or `GeometryStats::Unknown()` depending on which way we go when passing the conversion boundary, so thats fine. The hard part is to figure out _what_ stats we actually get during deserialization. Stats are stored at multiple levels in our storage (both at table level and row-group level), but don't actually store their "type" themselves, as that is determined by the table schema. But the table schema won't tell you if this is an old or new geometry type (thus having string or geometry stats) since the types are now the same. I've tried to keep it that way and therefore _not_ introduced a new discriminator field like I did for value/vector serialization and ~~instead tried to infer the stats we expect by passing down the `Catalog` in the deserializer. If the catalog has an older storage format, we know we're about to deserialize string stats, otherwise we expect geometry stats.~~ ~~This works for de/serializing stats in our storage, but Im not sure if this is the best way to go about it or if we always can rely on a catalog being present when deserializing stats elsewhere. AFAIK stats serialization is busted when serializing plans anyway, but maybe there is some other way to infer if a blob of serialized geometry (or string) stats come from an older or newer duckdb.~~ __Update__: Ive added a `HasProperty` method to the `Deserializer` which allows us to peek the next field id. With this we can detect when the stats are old string stats or new geometry stats without additional context. This should squash my doubts re: stats as part of plan serialization. ----------------------- Additionally this PR also introduces some guards when trying to create tables using newly added types (such as `VARIANT` and `GEOMETRY(<WITH CRS>)` when targeting older storage formats, and adjusts tests accordingly.

…ptimistically ;) (duckdb#21067)

* Re-enable regular comparison joins for simple AsOf joins.

See PRs duckdb/duckdb-httpfs#263 and duckdb/duckdb-httpfs#262

@7

Change the syntax of `ALTER DATABASE <name> RENAME TO <alias>` to `ALTER DATABASE <name> SET ALIAS TO <alias>` to better reflect what's going on. ``` if (strcasecmp($7, "alias") != 0) { ereport(ERROR, (errcode(PG_ERRCODE_SYNTAX_ERROR), errmsg("expected SET ALIAS TO, got SET %s TO", $7), parser_errposition(@7))); } ``` This is a bit of a hacky solution, but adding `ALIAS` to the list of unreserved keywords would disallow using `ALIAS` as an implicit alias (`SELECT 1 alias`), since those can only be identifiers (non-keywords)

… the field

Includes: duckdb/duckdb-httpfs#259 duckdb/duckdb-httpfs#263 duckdb/duckdb-httpfs#262

Adding the newly added `order_options` to be copied over when copying `TableScanBindData`. Making sure we don't run into unexpected errors ..

…d from JSON "null" values (duckdb#21098) This PR fixes duckdb#21070

* Re-enable regular comparison joins for simple AsOf joins.

@lidavidm

…uckdb#21018) Per the ADBC specification (https://github.com/apache/arrow-adbc/blob/7b38cf4543330592a40a5023b4e6b93f8f34d7ff/c/include/arrow-adbc/adbc.h#L1561-L1664), list-typed fields at or above the requested depth must be empty lists, not null, even when filters exclude all items. Null is reserved for fields below the requested depth level. Found with the adbc driver validation test suite. https://github.com/adbc-drivers/validation cc @lidavidm

) This fixes up duckdb@def4d91 for cases where a previous test happen to fail.

Laurens Kuiper and others added 30 commits February 13, 2026 14:17

thread_local

e602cfb

Make the untyped_value_index optional, and allow inlining of primitiv…

8668a18

…e types in shredded variants

Re-enable Wasm builds for iceberg for v1.5-variegata

09f52a6

Commit should have been lost in merge somewhere

check db ptr and only parallel destroy if high count

682150c

remove anonymous namespace

198359b

Fix test

f711f85

add the tag

ea6d17f

Reporting furthest error position

8781442

Improve error messages, not specifying type or generated in create table

1964e74

now throws useful error

Add null count to shredded object count (nulls can be shredded as any…

392eeca

… type) and use column data count instead of row group count for shredding

Better formulated error message

48670c3

Add HTTPClient::Cleanup, currently nop but can be implemented to clea…

2b0e7d3

…nup relevant HTTP-transaction specific state

add placeholder error msg

1c6f863

Fix schema resolution (Targeting v1.5 release branch)

748082d

Disable test for vsize=2

075fb2f

If we merge a persistent (checkpointed) row group to a row group coll…

0116798

…ection - call SetAppendRequiresNewRowGroup so that new appends will go to a new row group

Bump httpfs further

7c439ad

add missing ducklake function

a75a80c

move structs to header

83bc8db

add missing headers

54df969

create an empty string through a different way, apparently this cause…

8671e57

…d a compiler error??

Fixup ParseLogicalType: do not implicitly convert to idx_t from size_t

fea676f

Noticed in duckdb-wasm, where the two types have different sizes

don't use GetWritableSpace(), as that doesn't include space for the n…

33ddceb

…ext-page-pointer, which we are using here

Fix test_query_profiler

623959e

httpfs: bump to b67b1dd9a60

f6e66ff

Fix test_query_profiler

8cbee32

Update odbc_scanner to fix CMake build race (duckdb#20930)

951cbec

The update includes only the duckdb/odbc-scanner#154 PR with the CMake fix. Ref: duckdblabs/duckdb-internal#7433

pdet and others added 29 commits February 25, 2026 11:20

move mark as modified around

faa54a8

remove storage manager line

9a10b92

update extension entries

78adff4

bump extensions

3d52d0f

extension entry settings

bbbe08f

copy order_options

1d72233

Alter database, set alias instead of rename

368d024

regenerate enum util

cff511f

[Fix] WAL corruption when reusing optimistically written blocks too o…

e79a6b6

…ptimistically ;) (duckdb#21067)

dont use binary literals

92a7a18

Generate peg grammar

805cabd

This works but is ugly

df8b343

Cleaner with import for msvc

424d25f

Internal duckdb#7568: AsOf Simple Joins

d866311

* Re-enable regular comparison joins for simple AsOf joins.

Exclude test here

0a87237

update ext entries script

cf252e1

Bump geo extensions (duckdb#21081)

5c9f6a1

Bump httpfs to include CURL fixes

296594f

See PRs duckdb/duckdb-httpfs#263 and duckdb/duckdb-httpfs#262

use the actual type of the value, not the type of the description for…

4c1ad29

… the field

Bump httpfs to include duckdb/duckdb-httpfs#259 (duckdb#21062)

83507cb

Includes: duckdb/duckdb-httpfs#259 duckdb/duckdb-httpfs#263 duckdb/duckdb-httpfs#262

Copy order_options when copying TableScanBindData (duckdb#21079)

9475963

Adding the newly added `order_options` to be copied over when copying `TableScanBindData`. Making sure we don't run into unexpected errors ..

[JSON] Fix problem that was causing bogus UBIGINT values to be emitte…

23f727b

…d from JSON "null" values (duckdb#21098) This PR fixes duckdb#21070

Internal duckdb#7585: AsOf Simple Joins (duckdb#21091)

4f000ad

* Re-enable regular comparison joins for simple AsOf joins.

Fixup vcpkg setup to run even if previous test have failed (duckdb#21058

b66d893

) This fixes up duckdb@def4d91 for cases where a previous test happen to fail.

Merge with upstream v1.5-variegata

a5b5034

Dtenwolde closed this Feb 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V1.5 variegata -> main#214

V1.5 variegata -> main#214
Dtenwolde wants to merge 6073 commits intomainfrom
v1.5-variegata

Dtenwolde commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

Dtenwolde commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants