Description
@wjones127 this is very interesting, but having read https://github.com/apache/spark/tree/master/common/variant, I'm struggling to understand which queries would be fast with the Open Variant Type (OVT) — that document dives straight into the details of the format, without taking time to provide the motivation for the format — where they expect it to be useful. Let alone any quantative measures of how much faster it should be.
To make these questions more concrete, here are some direct questions:
- will OVT allow vectorised operations directly on the contained data?
- will serializing data to OVT be faster than to JSON, if so by how much?
- same for deserializing?
- what's the relative size of OVT vs JSON
- using a binary format like this means we have to decode rows to JSON or similar whenever we want to show them to users, any idea what the overhead will be of this? (I guess this is the same as the "deserializing" question)
Will the following queries be faster than datafusion-functions-json
, if so by roughly how much
select count(*) from my_table where ovt_column ? 'foo'
select count(*) from my_table where not(ovt_column ? 'foo')
select count(*) from my_table where ovt_column->'foo'->'bar' ? 'spam'
select count(*) from my_table where ovt_column->'foo'=42
select count(*) from my_table where ovt_column->'foo'->'bar'->'spam'=42
select sum(ovt_column->'foo') from my_table
In particular, is there any way to exclude some rows from a filter by knowing they don't contain a specific key?
I guess benchmarks vs datafusion-functions-json
would be extremely interesting.
Some background, based on my experience building jiter I have the kernel of an idea for a binary format to encode JSON data, but I want to avoid building it if OVT is going to really work, especially since OVT is (to one degree or another) a standard that's endorsed by apache / spark.