Skip to content

oximeter's byte array datum could use some thought #4551

Open
@bnaecker

Description

@bnaecker

To implement #4311, supporting missing samples, it makes sense to store each datum type in a nullable column in ClickHouse. For example, Cumulative<u32>s would be represented in a column with type Nullable(UInt32). However, the nullable wrapper type cannot contain composite types, which includes arrays, which we use to represent both histograms and arbitrary byte arrays. The former we can handle unambiguously: it's not possible to create an oximeter::Histogram<T> with zero bins, so we can use an empty array as a sentinel for a missing sample. This doesn't work for byte arrays, or it at least raises a number of questions:

  1. If we can't store bytes in the database as Nullable(Array(UInt8)), what type can we use? ClickHouse's String type is a C++-style string, which means it's really just an array of octets. We could use this, but it interacts poorly with the JSON encoding we're currently using to talk to the database. I don't believe it's possible to round-trip an escaped "byte string" to the database via JSON. We could do this if we switch to a binary serialization format, such as RowBinary.
  2. Even if we could do some work to encode the bytes as a string on the way in and out of the database, this makes it difficult to use correctly. First, if you access the database out of band of this client, you need to know the encoding and how to reverse it. Second, the fact that you need to encode or decode the value is not present in the database's type information -- there is no separate BLOB type, since that's just an alias for String, and so one would need to know which table the value was extracted from to know whether decoding was required. That becomes even trickier if / when we allow selecting data from more than one table. E.g., what happens if someone selects a "true" string and a byte array from two tables? It's not impossible to handle, it just can't happen directly at the serialization layer.
  3. We could always choose to encode the Datum::Bytes object itself. I.e., instead of storing a byte array inside that, we could just decide to always base-64 encode the data, and store things as a string. This has the benefit of being more obvious, but still presents some of the problems in (2).

There is another option: we remove support for byte arrays. No code produces byte array samples today. The type was intended as an escape hatch of sorts, or a way to collect arbitrary information. But one can still always do that through the string type, where the application decides how the arbitrary data is encoded. For example, nothing prevents someone from storing JSON or base-64 encoded bytes in the Datum::String today.

I'm partial to this last choice, but there could also be a way to correctly and unambiguously store bytes that I've not considered.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions