Skip to content

[Question] The Future of the Block API #4138

Open
@Sewer56

Description

The current ZStandard Manual states the following:

Block level API (deprecated)
If the normal API doesn't provide the functionality you need, please open a GitHub issue.

So here I am.

Context

In low latency real time applications such as games, it is necessary to send messages
at a high frequency, to a high amount of clients.

Suppose the following scenario:

  • You are a regular player hosting a 64-player game (no dedicated server)
  • Using a client(many) - host(1) topology.

In this sort of setup, you have the 1 host send packets to 63 clients at an approximate rate of 50/60/100Hz.

A single 'message' sent every tick may look something like this:

  • IP + UDP Header: 28 bytes
  • Network Library Header: 2 bytes
  • Payload: ~2016 bytes [before compression]
    • Assumes average 32 bytes per player. With data 63 players (all players except recipient).

A message like this will need to be sent for each 63 players, resulting in 6300 messages per second at a 100Hz tickrate.

Problem Statement

Important

Working at this scale while targeting regular user machines, every bit of effort must be taken to ensure efficiency.

The payload size in particular is especially sensitive. The MTU of a packet
is around the 1440-1460 range in practice for most ipv4 endpoints I interact with.

What this means that is the complete packet (IP + UDP + Lib + Payload) exceeds ~1440, it must be
split into 2 packets.

When that happens, the host will have to incur:

  • ~32 bytes of overhead per client
    • Think 30 for second header and 2 for two extra fragmentation headers.
  • Doubles probability of dropped or delayed packets.
    • Since we have to send a second packet for each state now.
    • And therefore possible 'jitter' in game.

This overhead can add up quickly.
In this scenario, just this 32 byte overhead alone increases the bandwidth requirement by 1.5Mbit/s.

This alone is the difference between (for example):

  • Supporting 63 vs 64 players.
  • Experiencing packet loss when streaming AV1 video of your current experience.

Note

Big companies can afford to run dedicated servers and swallow inefficiencies however...

Open source library/code/mod authors like me, just doing free stuff for the community don't have these sorts of resources.
All infrastructure is hosted out of pocket at a loss.

Goals

Tip

The goal is to reduce the need to fragment at all costs.

To reduce the need to fragment, the packet size must be minimized as much as possible.
That includes manually taking control of block compression/decompression.

The difference between a 1440 and 1445 byte payload (min frame header + block header) can mean the difference between sending 1 and 2 packets. It is crucial I save every byte possible.

The following ideas come to mind (focusing on zstd integration here).

Using Custom Block Ranges and Dictionaries

Note

This requires some context, so apologies in advance.

Payload size can be reduced by taking manual control of block creation process.
By this I mean controlling which slice of data goes in which block.

Different sections of data can have different distributions of bytes; which in turn affects
the efficiency of huffman coding (and friends).


For this example, suppose we have an array of the following structure:

f32 x;
f32 y;
f32 z;

It is possible to maximize compression ratio by first reordering the data into structure of arrays:

x0, x1, y0, y1, z0, z1

And then reordering the bytes to group the each corresponding byte together:

(x0[0], x1[0]), (x0[1], x1[1]), (x0[2], x1[2]), (x0[3], x1[3]), (y0[0], y1[0]), (y0[1], y1[1]) ...

This increases the likelihood of repeated data, as many floats will have the same matching first and/or second byte.
And lastly, doing a bit of delta encoding (subtract current byte from previous byte):

i.e.

  • x0 = 1327.784 = 44 A5 F9 18
  • x1 = 1320.432 = 44 A5 0D D4
(44 44) (A5 A5) (F9 0D) (18 D4)

After delta encoding:

(00 00) (00 00) ..

This increases number of repeated 0 byte sequences and distribution of 0 bytes for huffman coding.


I would like to be able to isolate the float data inside a single block, because the byte distribution
between this, and for example manually bit packed data will differ.

Likewise, I would also want to apply a custom dictionary for each manually created block.
Since that will have context tuned for the very specific data I am working with.

The dictionary would also have greater relevance, where as bit packed data in previous sliding window might not.

Custom Block Header

Because the MTU is ~1450 and max message length is not anywhere near the max block length,
it's possible to shave bytes from the block headers:

typedef struct {
  int Last_Block : 1;
  int Block_Type : 2;
  int Block_Size : 13;
} Block_Header_New;

Here I trimmed the block header by 1 byte, which gives a max block size of 8192 bytes.
Greater than MTU and still easily larger than any packet will ever be.

Splitting Block Headers

Tip

Even better, it's possible to rearrange the data such that additional bytes can be saved.

Suppose I have a fixed (2) number of blocks. I would to split the block header into 2 locations.

  • Unused/Trimmed Space in Network Library Header
  • Start of Payload

I can store 2 bits in the Network Library Header to indicate if the block is compressed or not.
At the start of the payload I can then store 2 of the following:

typedef struct {
  int Block_Size : 12;
} Block_Header_Len;

This saves yet another byte.

Note

For efficient packet handling, the compressed payload must be byte aligned.

The concern here is moreso that a lack of a raw/block compression API forces us to use the standard zstd block format.

All in All

I'm not sure how, without a block compression API it would be possible to get the maximum out of ZStandard;
especially when the difference in a use case like this between a few bytes here (MTU) can be the barrier between
a bad and a very good experience.

Note

These are estimate savings.

Given the earlier example with the 2016 byte payload, the ability to compress blocks provides the
following potential and likely savings:

  • ~32 bytes from not exceeding MTU
    • If other savings are sufficient to avoid fragmentation.
  • ~100-150 bytes from controlling the blocks manually
    • This is a pure estimate; I haven't got enough real data to test with; I believe it would be in this range however.
    • This is from splitting into blocks manually and using a custom dictionary.
  • 2 bytes (6 - 4) from storing 2 block headers in a custom way.
  • 2 bytes from not including a zstd frame header.

Somewhere around 150 bytes per message. Or 7.5Mbps of upload speed, if translated onto earlier example.

Describe the solution you'd like

N/A

zstd manual requested to open an issue if the required functionality is not present.

In this case the 'requirements' are:

  • Custom byte ranges for starting/ending blocks
  • Dictionary per block
  • Omitting frame header
  • Custom block header/metadata encoding.

The existing APIs make these optimisations possible.
On the other hand, I'm not sure if the alternative suggestion of using the 'normal' API covers this.

Main concern lies in fragmenting vs not fragmenting.
Doubling possibility of transmission failure is a very unwanted effect.

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions