Motivation, performance and intended use?

@wjones127 this is very interesting, but having read https://github.com/apache/spark/tree/master/common/variant, I'm struggling to understand which queries would be fast with the Open Variant Type (OVT) — that document dives straight into the details of the format, without taking time to provide the motivation for the format — where they expect it to be useful. Let alone any quantative measures of how much faster it should be.

To make these questions more concrete, here are some direct questions:
* will OVT allow vectorised operations directly on the contained data?
* will serializing data to OVT be faster than to JSON, if so by how much?
* same for deserializing?
* what's the relative size of OVT vs JSON
* using a binary format like this means we have to decode rows to JSON or similar whenever we want to show them to users, any idea what the overhead will be of this? (I guess this is the same as the "deserializing" question)

Will the following queries be faster than `datafusion-functions-json`, if so by roughly how much
* `select count(*) from my_table where ovt_column ? 'foo'`
* `select count(*) from my_table where not(ovt_column ? 'foo')`
* `select count(*) from my_table where ovt_column->'foo'->'bar' ? 'spam'`
* `select count(*) from my_table where ovt_column->'foo'=42`
* `select count(*) from my_table where ovt_column->'foo'->'bar'->'spam'=42`
* `select sum(ovt_column->'foo') from my_table`

In particular, is there any way to exclude some rows from a filter by knowing they don't contain a specific key?

I guess benchmarks vs `datafusion-functions-json` would be extremely interesting.

cc @davidhewitt @adriangb

---

Some background, based on my experience building [jiter](https://github.com/pydantic/jiter) I have the kernel of an idea for a binary format to encode JSON data, but I want to avoid building it if OVT is going to really work, especially since OVT is (to one degree or another) a standard that's endorsed by apache / spark.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Motivation, performance and intended use? #11

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Motivation, performance and intended use? #11

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions