Skip to content

Consider supporting 1, 2, 4, and 24-bit trie values #4670

Open
@hsivonen

Description

@hsivonen

The trie builder always operates on 32-bit values and can then narrow the main backing array value to 8 or 16 bits at serialization time.

We already use a byte array as unaligned backing storage. We should consider extending the way the reads by index map to the backing byte array a little to support more compact value widths:

If the byte array had one extra byte at the end, we could use 32-bit unaligned loads to read 24-bit values (masking off the highest 8 bits) without going out of bounds. See also #4669.

For 1, 2, and 4-bit values, we could shift and mask the index to read smaller parts of bytes from an array that was 1/8, 1/4, or 1/2 in byte length compared to using 8 bits as the narrowest value.

1 bits is useful for accessing a binary property faster than from a fragmented inversion list.
2 bits is useful for bundling two co-occurring binary properties.
4 bits is useful for enumerated properties with few distinct values, e.g. Joining_Type.
24 bits is useful for scalar values.

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-unicodeComponent: Props, sets, triesT-enhancementType: Nice-to-have but not required

    Type

    No type

    Projects

    Status

    Not a 2.0 blocker

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions