[RFC] New basis of operation: Reliable Datagram Pipe

Currently we configure USB as 4 or 2 completely unrelated streams: two IN/two OUT and one IN/one OUT (with bigger buffers) and expect applets to fit into this scheme. We also have an I2C sideband where applets can add arbitrary registers.

This scheme is problematic for many reasons:
- You only get full performance with one IN/one OUT pipe, and the difference in performance is *drastic*.
- Also you can't currently get this on Windows. (#231 item 2)
- There are serious race condition issues with the I2C registers that applets have to add workarounds for and which are a constant source of bugs. #147 points this out for SPI but it's true everywhere.
- I2C registers are not reset with the applet.
- With async applets, registers have *even more* race condition issues. Applet registers should be in the applet clock domain! But they are all in the I2C / main clock domain.
- You can never have more than 4 applets using USB pipes. Usually you cannot have more than 2. Sometimes, more than 1 (a few applets use e.g. two IN pipes). This is a blocker for #180, which at the moment is just infeasible.
- Having this weird scheme makes virtualizing Glasgow (e.g. to run the CLI on your PC and have the hardware be connected to your RPi elsewhere) very difficult, bordering on impossible.
- It also drags USB semantics everywhere, making it hard to do a future migration to Ethernet.
- Timeouts are... hard, because you don't have a good way to tell the applet to stop busy looping (like in https://github.com/GlasgowEmbedded/glasgow/issues/187) without just resetting it. You can use I2C registers for that but the protocol is awkward, and it introduces race conditions with async applets (doubly so if the clock is sourced externally).

Considering just the case of many applets and many pipes in isolation, a few solutions were proposed:
- using constant overhead byte stuffing to add metadata
- making the 1st byte of the USB package (which inherently delimits the stream) the applet number
Both of these are bad in that the coding overhead of this in Python will be *massive*. Whatever scheme we use, we need it to be faster, not slower, and this means handling >40 MB/s with USB scheduling in pure Python. So what do we do?

I propose a scheme called RDP: **Reliable Datagram Pipe**.
- It is essentially bootleg PCI Express.
- Communication with the device is done through a dual-simplex channel.
- Each individual datagram on the pipe (IN or OUT; both use the same format) consists of *endpoint* (uleb128), *credits* (uleb128), *length* (uleb128), and *payload* (bytes).
  - The endpoint is an arbitrary number used for routing datagrams in the FPGA and in the software. Its assignment is out of scope for this proposal.
  - The credits is a number of additional datagrams that can be sent in the opposite direction.
  - The length is a number describing the length of the payload.
  - The payload is an arbitrary sequence of bytes.
- The unit of transmission is a datagram, i.e. one 4-byte datagram is not the same as two 2-byte datagrams sent to the same endpoint. This is not a stream oriented scheme.
  - Many endpoints on the FPGA will contain no internal buffer, and will generate or parse the datagram payload on the fly.
  - If you want to make a stream oriented scheme out of it anyway, you can ask the demultiplexer to give you a stream (in software) or ignore the end-of-packet signal of the stream (in gateware).
- Datagrams are guaranteed to be delivered, but the latency (in either direction) is not bounded. In practice, what this means is that if you know that datagram number N has been delivered (e.g. through feedback), you also know that datagrams number 1..N-1 have been delivered. If you have no feedback you don't know when or if they're delivered.
  - Communicating with a non-existent endpoint is undefined behavior.
- Datagrams *to the same endpoint* are guaranteed to be delivered in-order.
  - The order of delivery to different endpoints is unspecified.
- Each endpoint has a maximum length for IN and OUT directions, which is communicated out of band.
  - Exceeding the maximum length results in undefined behavior.
- When a datagram is received in one (e.g. OUT) direction, it automatically consumes a credit for that direction. When the credits (the maximum/initial amount of which is communicated out of band) are exhausted the datagram source must pause. When the datagrams in that direction are processed, the next datagram that is sent in the opposite direction (e.g. IN) has a non-zero amount of credits attached to it. If there is no datagram to be sent in the opposite direction, a credit-only datagram (zero length, no payload, non-zero credit amount) can be sent. 
  - Credit-only datagrams carry no data and aren't forwarded to the application layer. To send an actual empty datagram, send it with no credits attached.
  - If a packet is received but there are no resources to process it (for any reason, including "the sender did not respect the credit amount") the packet is defined to be dropped.
  - If more credits are released than the maximum/initial amount, the behavior is undefined.

The scheme has the following advantages:
- Coding overhead is of the kind that Python can handle very well. We already have many protocols that use type-length-payload encoding (e.g. JTAG) and we achieve very high bit rates with them because Python can concatenate strings *really* quickly and it can match and slice only slightly slower. Unlike COBS or first-byte-of-every-512-byte-chunk-is-the-address, this scheme can be easily implemented in Python in such a way that if you have large payloads on average, your transfer rate is limited by USB.
- It isn't difficult to parse on the FPGA either.
  - In fact it seems so straightforward (especially for small endpoint values) that I think the parsers for it should be often nested. I.e. an endpoint addresses an applet, and the first bytes of the payload are uleb128 of a sub-endpoint within an applet.
- It has a straightforward mapping to the Internet protocol stack.
  - Probably the two most reasonable one is "one TCP socket for everything", considering head-of-line blocking is not an issue.
- Credit-based flow control means no head-of-line blocking and (with TCP encapsulation) no buffer bloat. (At least, outside of the Glasgow software.)
  - Dropping datagrams that exceed the available credits means it's easy to do quasi-realtime tasks such as "streaming video and hoping the other side has enough bandwidth for it" without losing framing sync even if you do end up dropping datagrams.
- It has an unlimited amount of endpoints, meaning we can support any number of applets that fit into the gateware.
- Mapping an endpoint to an applet means that the communication with the gateware of that applet can be abstracted as an ordered sequence of datagrams instead of the current stream.
- Using datagrams means that we can e.g. map each register to an endpoint and have it generate or overwrite its value with the payload bytes starting at the first one. This requires no additional buffering (only a state machine) and has no additional overhead.
- Actually, most registers (the main exception is applet reset registers) will be mapped to sub-endpoints within the applet.
  - You could potentially do everything with top-level endpoints, but that seems like it would really bloat the interconnect.
- Asynchronous applets are trivial: they are attached only through their IN and OUT pipes and the async reset register. Any applet registers use sub-endpoints and exist in the applet clock domain.
- There is no real need to make applets fit a contrived "SERDES+sideband" scheme anymore. Any sideband signals/registers are allocated to endpoints/sub-endpoints since they are so cheap.
- To make a timeout, add another (sub-)endpoint that pokes the state machine. The length of a datagram has to be known in advance, ergo any transmission proceeds after buffering at the source (or is otherwise guaranteed to make forward progress, meaning that the interconnect isn't stuck waiting for the transmission to complete while it's waiting for an event that never happens).

Using uleb128 as the encoding of the endpoint, credit, length fields has the following advantages:
- Very compact
- We won't ever run out of them
- Parsing overhead on the FPGA is trivial
- Parsing overhead on the Python side is less trivial, but still not too bad even in pure Python
  - This is simple and self-contained enough it can probably be easily accelerated with e.g. Cython
- Only pay for what you use
  - If you have just 2 endpoints you do not even generate the RTL decision tree bigger than 3 cases for the first byte

Downsides:
- The uleb128 encoding may be somewhat costly to parse on the Python side
- Unbounded latency (but we can't really make it bounded with Python and off-the-shelf OSes anyway)
  - But here it is *really* unbounded. Cut-through routing means an applet can transmit several GB of data and for that entire time it will prevent anything else from being transmitted.
    - Can be solved with fragmentation below applet level, but it gets complicated quick.

Open questions:
- Is it worth it allowing multiple levels of cut-through routing within the FPGA? I.e. instead of a single endpoint number, have a sequence of them at the beginning of the packet. Each router parses the number and then immediately starts forwarding the rest to the next one. The same happens in reverse for transmission.
  - Could be a good way to handle sub-endpoints without introducing complex credit accounting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] New basis of operation: Reliable Datagram Pipe #354

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] New basis of operation: Reliable Datagram Pipe #354

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions