Closed
Conversation
…e types in shredded variants
Commit should have been lost in merge somewhere
now throws useful error
… type) and use column data count instead of row group count for shredding
…nup relevant HTTP-transaction specific state
…ection - call SetAppendRequiresNewRowGroup so that new appends will go to a new row group
…imitive types to be inlined (duckdb#20951) Follow-up / includes duckdb#20947 (actual diff [here]( https://github.com/Mytherin/duckdb/compare/swapuntypedtyped...Mytherin:duckdb:optionaluntyped?expand=1)) When shredding variants we store them using the following schema (shredded on `STRUCT(a DATE, b INT)`): ```sql shredded STRUCT( typed_value STRUCT( a STRUCT( typed_value DATE, untyped_value_index UINTEGER ), b STRUCT( typed_value INT, untyped_value_index UINTEGER ) ), untyped_value_index UINTEGER ) ``` When they are fully shredded, `untyped_value_index` is entirely `NULL`. As such, it doesn't contain any useful information. This PR reworks our shredding and unshredding code to allow: * (1) `untyped_value_index` to be left out entirely if the type is fully shredded * (2) `typed_value` to be inlined in the parent if `typed_value` has a primitive (i.e. non-nested) type If the above struct is fully shredded, we would instead get the following representation: ```sql shredded STRUCT( typed_value STRUCT( a DATE, b INT ) ) ``` This doesn't save much storage space, given that the various `untyped_value_index` layers are all stored as constant `NULL` in this situation, but it saves us a bunch of unnecessary column metadata. More importantly, this allows us to look only at the shredded schema instead of looking at the variant stats to realize a field is fully shredded. This is useful for simplifying shredded execution. We also save on allocating many empty vectors for reading fully shredded data.
…d a compiler error??
Noticed in duckdb-wasm, where the two types have different sizes
…ext-page-pointer, which we are using here
The current way to enable (opt-in) and disable the PEG parser is not really intuitive (`SET allow_parser_override_extension=strict`). So I've added options, which are essentially aliases for the old `SET`, to enable and disable the PEG Parser. `enable_peg_parser` and `disable_peg_parser` Internally they call `SET allow_parser_override_extension=strict/default` `strict` = Exclusively use the PEG Parser `default` = Exclusively use the old parser At least for now that's the default behaviour. It also gives a nicer error message when the user tries this setting when the `autocomplete` extension is not loaded
The update includes only the duckdb/odbc-scanner#154 PR with the CMake fix. Ref: duckdblabs/duckdb-internal#7433
With duckdb/extension-ci-tools#306 PR integrated it is now necessary to specify both `linux_amd64_musl` and `linux_arm64_musl` as "opt-in" for extensions build (they are now disabled by default). Ref: duckdblabs/duckdb-internal#4616
…uckdb#21069) Three independent fixes: * 1st commit, by @ccfelius: avoid requiring GetEncryptionUtil when reading non-encrypted parquet files * 2nd and 3rd commit, be me: given in the read path the mbedTLS (that is always available) encryption util is workable, attempt to `auto-LOAD httpfs`, but do not `auto-INSTALL httpfs`. * 4th commit, together with @ccfelius, fixing the error message to make clear one might get there also while writing encrypted parquet files. After this PR, the behaviour will be as follow: * when reading plain parquet files, httpfs is NOT required for encryption (it might for file system) * when writing plain parquet files, httpfs is NOT required for encryption * when reading encrypted parquet files, httpfs is loaded IFF available locally AND autoload_known_extensions is set, otherwise fall back to mbedTLS (that might be slower but always available) * when writing encrypted parquet files, httpfs is loaded IFF available locally (and setting allows), otherwise it's INSTALLED and LOADED (according to settings), otherwise one fails with relevant error message Same behaviour should hold for plaintext or encrypted duckdb files. Basically writing a file it's mean as instruction to autoinstall and autoload `httpfs` (after checking settings), while read path it's implicitly OK to load (to speed up decompression) but no install should be attempted (that is, no networking should be triggered).
…age (duckdb#21020) This PR is a work in progress. Originally when adding the `GEOMETRY` type to core, the idea was to keep the new type logically separate from the old type (as defined in `spatial`), but introduce implicit casts between the two to make this distinction mostly transparent to the end user. In this PR we instead "merge" the two types by moving the conversion down to the storage layer (and serialization boundaries) instead of the execution engine. This makes the migration completely seamless. Newer DuckDB versions targeting an older storage version writes the exact same geometry format that older versions of DuckDB can read/write with spatial. But in-memory and all throughout the execution engine (e.g. where users/extensions/client libraries interface with DuckDB) you still get the new representation. As far as I can tell there are 4 places where we need to perform the conversion between new and old format. - `Vector::De/Serialize` - `Value::De/Serialize` - ART index keys - Storage (`ColumnData`) However this gets a bit tricky as the new and the old geometry type can no longer be distinguished from one another. They have the same logical type, same physical type, and while their byte representations differs it is not possible to easily determine if a given binary blob is one or the other. How do we even know if we need to perform the conversion or not? - During Vector/Value de/serialization we solve this by writing a new optional field specifying the format if we are targeting a newer serialization version. When reading we can then determine if we get old or new geometry data based on if this field is present or not. - For ART indexes ~~we can detect the storage version of the database that the ART index is in~~ we now persist a "storage_version" optional value in the `IndexStorageInfo` which allows us to track which storage version the index was created in, and therefore know if we need to convert to the old format when creating art keys for insertion or lookups. - In the storage we can detect the storage version of the database the table is in, and make use of the new "geometry shredding" functionality to simply treat the old storage format as if it is just any other shredding layout. This works well because the `GeoColumnData` abstraction ensures that the disk representation of a shredded geometry column looks almost identical to a standard column of the same type as the shredding layout. We then employ a similar pattern as the vector/value serialization by treating the `GeometryPersistentColumnData` where we normally store the shredding layout type for the segment as the indicator that this is a new or old geometry column based on if its present or not. There is one place left where things get a bit more complex, and that is statistics. The old geometry type (as defined in spatial) is just a type alias over `BLOB`, and therefore contain `StringStats` (which was never used for any optimizations), while the new geometry type has its own `GeometryStats` type (which is very useful for optimizations!). While it's not possible to convert from one to the other, we can simply write `StringStats::Unknown()` or `GeometryStats::Unknown()` depending on which way we go when passing the conversion boundary, so thats fine. The hard part is to figure out _what_ stats we actually get during deserialization. Stats are stored at multiple levels in our storage (both at table level and row-group level), but don't actually store their "type" themselves, as that is determined by the table schema. But the table schema won't tell you if this is an old or new geometry type (thus having string or geometry stats) since the types are now the same. I've tried to keep it that way and therefore _not_ introduced a new discriminator field like I did for value/vector serialization and ~~instead tried to infer the stats we expect by passing down the `Catalog` in the deserializer. If the catalog has an older storage format, we know we're about to deserialize string stats, otherwise we expect geometry stats.~~ ~~This works for de/serializing stats in our storage, but Im not sure if this is the best way to go about it or if we always can rely on a catalog being present when deserializing stats elsewhere. AFAIK stats serialization is busted when serializing plans anyway, but maybe there is some other way to infer if a blob of serialized geometry (or string) stats come from an older or newer duckdb.~~ __Update__: Ive added a `HasProperty` method to the `Deserializer` which allows us to peek the next field id. With this we can detect when the stats are old string stats or new geometry stats without additional context. This should squash my doubts re: stats as part of plan serialization. ----------------------- Additionally this PR also introduces some guards when trying to create tables using newly added types (such as `VARIANT` and `GEOMETRY(<WITH CRS>)` when targeting older storage formats, and adjusts tests accordingly.
* Re-enable regular comparison joins for simple AsOf joins.
Change the syntax of `ALTER DATABASE <name> RENAME TO <alias>` to `ALTER
DATABASE <name> SET ALIAS TO <alias>` to better reflect what's going on.
```
if (strcasecmp($7, "alias") != 0) {
ereport(ERROR,
(errcode(PG_ERRCODE_SYNTAX_ERROR),
errmsg("expected SET ALIAS TO, got SET %s TO", $7),
parser_errposition(@7)));
}
```
This is a bit of a hacky solution, but adding `ALIAS` to the list of
unreserved keywords would disallow using `ALIAS` as an implicit alias
(`SELECT 1 alias`), since those can only be identifiers (non-keywords)
Adding the newly added `order_options` to be copied over when copying `TableScanBindData`. Making sure we don't run into unexpected errors ..
…d from JSON "null" values (duckdb#21098) This PR fixes duckdb#21070
* Re-enable regular comparison joins for simple AsOf joins.
…uckdb#21018) Per the ADBC specification (https://github.com/apache/arrow-adbc/blob/7b38cf4543330592a40a5023b4e6b93f8f34d7ff/c/include/arrow-adbc/adbc.h#L1561-L1664), list-typed fields at or above the requested depth must be empty lists, not null, even when filters exclude all items. Null is reserved for fields below the requested depth level. Found with the adbc driver validation test suite. https://github.com/adbc-drivers/validation cc @lidavidm
) This fixes up duckdb@def4d91 for cases where a previous test happen to fail.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.