Skip to content

Delta encoding for Span and Id #234

@sampsyo

Description

@sampsyo

This issue is a follow-up proposal to #233, which should be attempted first. Read that issue first for the background.

Whereas "start/length" encoding as described in #233 is probably a clean win with few downsides, there are even more aggressive approaches that do have significant costs. Namely: consider a "start-only" or even "length-only" representation. (The latter would essentially be delta encoding.) This would mean that Span would turn into a single 32- or 16-bit number, with these important trade-offs:

  • These require that all the ranges in a pool are contiguous, i.e., range R's end is always right next to range R+1's start. There are no gaps, and the ranges appear in order.
  • The span information is no longer self-contained; it requires looking at neighboring values to be useful. Namely: the Span<T> by itself would not suffice to find the items you want. You'd need to have something like an Id<Span<T>>, i.e., an index of a span, so that you can look up the neighboring span too, which is required to find the actual range of data you need.
  • In the case of a "length-only" representation, it's even worse: you can't do random access to spans; you always need to iterate from the beginning of the Pool<Span<T>>, summing up the lengths along the way, to find the indices you need. Or, rather, doing random access to look up a specific span would require O(n) time instead of O(1). This might be OK for some types, however, for which it is typical for client code to iterate from the beginning anyway. But it may not be appropriate for all pools, and we'd probably need to maintain multiple representation for different data types.

Aside from Span<T>, the "length-only" representation generalizes to Id<T> as well: this would be an "offset-only" representation where you only encode the offset from the previous Id<T> in a pool. The trade-offs are similar.

This could get to be a pretty involved project, but the idea would be to think carefully about where one of these tricky representations could be profitable. Then, we'll prototype it for that specific type of span, come up with some nice high-level interfaces that make the ranges ergonomic to access, and measure the impact on file size & performance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions