`oximeter`'s byte array datum could use some thought

To implement #4311, supporting missing samples, it makes sense to store each datum type in a nullable column in ClickHouse. For example, `Cumulative<u32>`s would be represented in a column with type `Nullable(UInt32)`. However, the nullable wrapper type [cannot contain _composite types_](https://clickhouse.com/docs/en/sql-reference/data-types/nullable), which includes arrays, which we use to represent both histograms and arbitrary byte arrays. The former we can handle unambiguously: it's not possible to create an `oximeter::Histogram<T>` with zero bins, so we can use an empty array as a sentinel for a missing sample. This doesn't work for byte arrays, or it at least raises a number of questions:

1. If we can't store bytes in the database as `Nullable(Array(UInt8))`, what type can we use? ClickHouse's `String` type is a C++-style string, which means it's really just an array of octets. We could use this, but it interacts poorly with the JSON encoding we're currently using to talk to the database. I don't believe it's possible to round-trip an escaped "byte string" to the database via JSON. We could do this if we switch to a binary serialization format, such as `RowBinary`.
2. Even if we _could_ do some work to encode the bytes as a string on the way in and out of the database, this makes it difficult to use correctly. First, if you access the database out of band of this client, you need to know the encoding and how to reverse it. Second, the fact that you need to encode or decode the value is not present in the database's type information -- there is no separate `BLOB` type, since that's just an alias for `String`, and so one would need to know which table the value was extracted from to know whether decoding was required. That becomes even trickier if / when we allow selecting data from more than one table. E.g., what happens if someone selects a "true" string and a byte array from two tables? It's not impossible to handle, it just can't happen directly at the serialization layer.
3. We could always choose to encode the `Datum::Bytes` object itself. I.e., instead of storing a byte array inside that, we could just decide to always base-64 encode the data, and store things as a string. This has the benefit of being more obvious, but still presents some of the problems in (2).

There is another option: we remove support for byte arrays. No code produces byte array samples today. The type was intended as an escape hatch of sorts, or a way to collect arbitrary information. But one can still always do that through the string type, where the application decides how the arbitrary data is encoded. For example, nothing prevents someone from storing JSON or base-64 encoded bytes in the `Datum::String` today.

I'm partial to this last choice, but there could also be a way to correctly and unambiguously store bytes that I've not considered.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`oximeter`'s byte array datum could use some thought #4551

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

oximeter's byte array datum could use some thought #4551

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`oximeter`'s byte array datum could use some thought #4551