Skip to content

Consider: Slipstream #322

@gavofyork

Description

@gavofyork

JAM Slipstream

Goal

Provide for low-latency block-building on a JAM network with minimal sacrifices to throughput and decentralisation.

Background

JAM has a 6 second block time. This results in finality latencies of a minimum of 18 seconds, and typically around 20 seconds, as guaranteeing, assurance (availability) and auditing must each happen prior to a vote on finality.

Normally, strict time to finality isn't so important since the value under consideration is small enough to gain sufficient confidence of irreversion either once the data is reported on-chain or once assurance is completed and Accumulation happens.

Thus we have multiple relevant latencies, depending on precisely how much economic security is required:

$L_{rep} = P + B = 12$
$L_{accum} = P + 2B = 18$
$L_{final} = P + 2B + A + F = 27$

where

  • $A$ is the auditing tranche period, 8 seconds in regular JAM
  • $P$ is the work-package period, 6 seconds in regular JAM
  • $B$ is the block period, equal to $P$ at 6 seconds in regular JAM
  • $F$ is the finality overhead, around 1 second in regular JAM

(Side note: $L_{accum}$ presently assumes that packages must be agreed as generally available before Accumulation happens. This need not strictly be the case; we could optimistically Accumulate reports even before they are agreed available as long as we do not finalise them until they are both available and audited. This reduces the time to apparent (non-final) Accumulation while increasing the risk of apparent (non-final) reversion. To couteract this increased risk, Guarantors would need to become heavily slashable in case their reports never became Available and a chain reversion happened. In this case we could use time $L_{rep}$.)

In principle JAM can reduce this latency accross the board simply by reducing the $P$ and $B$, say from 6 seconds to 1 second. Unfortunately this comes with a very simple associated problem brought about by the structure of ELVES:

All auditors are required to audit packages statelessly. This means that any data (save data read from the on-chain preimage store) cannot be expected to be immediately available to the auditor and must be included in the D3L by the guarantor and fetched from it by the auditor.

This creates a "data boundary" between data generated within Refinement and data generated between Refinements. Essentially, Refinement computation have trivial and free access to all compute done within their work-package period, but must rely on D3L access for computation done from before that. By having a higher temporal quantization (and thus materially increasing latency) we can (possibly massively) improve the efficiency with which we use D3L bandwidth.

For applications with high temporal data locality, this is a huge advantage. We may hope to assume that the bandwidth to the D3L remains the same for cores at 2MB/s regardless of block/package period. If we reduce these periods from 6 seconds to 1 second, then rather than one stateless work-package with 12MB of D3L access every 6 seconds, we have six stateless work-packages each with 2MB of D3L access, one every second over that same 6 second period. But each is indepdenently audited on a separate set of auditing nodes.

A similar degradation happens with Reports; instead of 48KB every 6 seconds, we have 8 KB every single second, but each must be without the context of the surrounding 5 seconds.

If no processing output from an earlier one-second package is used in later packages, then the D3L/Report efficiency is largely equivalent. However this is rarely the case: consider running CoreVM with a regular application. A large portion of D3L throughput would be to fetch and store RAM pages.

In the case of the relatively meagre DOOM this amounts to 130 of largely unchanging pages (with the rest of D3L being used for framebuffer output). This is approximately 520KB or around half a megabyte. These pages would account for 25% of D3L usage in the 1 second scenario, yet only 4% in the current 6 second scenario. Similarly passing the page hashes into Accumulation for ratification and update would previously take a little over 8KB every 6 seconds, around 17% of the Reporting bandwidth and leaving plenty of space for the frame buffer stream's DA contents to be committed to also. In the 1 second scenario, this changes to 8KB every 1 second, already saturating the Reporting bandwidth before we consider any other information which may need to be ratified or updated.

This should come as little surprise. By reducing the latency, we increase the frequency under which the system attains coherency. This increase in coherency comes at a cost of the amount of data which the system can process which may be characterised as either a reducing in (data-processing) speed or overall (data) size.

It would therefore be helpful to find a means of both reducing apparent latency to coherence/Accumulation while keeping similar levels of throughput.

Summary

Slipstream JAM proposes to reduce block period to 1 second, but keep the (context-free) auditing period at 6 seconds. Coretime pools and queues would continue to work on 6 second boundaries, with the first report being made within a 6-second shared-context Chunk determining the whole period's authorizer.

Refinement happens in a Chunk of 6 of 1-second execution periods with each period yielding information for a single report. Imports, exports and Reports must be individually valid on a per-package basis as normal since it cannot be certain that they will each get guaranteed, assured and Accumulated.

However, later packages may re-use imports of earlier packages. Furthermore, packages have two export lists and the export host call has two modes: one Permanent export list/mode which functions exactly like the present export system in all circumstances. A second Provisional export list/mode features in this way on the provision that no further packages are provided in this chunk. If and only if a subsequent package is provided in the Chunk then this Provisional export list is disregarded from the previous work-report and its exports are not (and need not!) be provided to the D3L.

Additional information for fetching and exports would become available at each 1-second period.

Availability for each package happens individually, keeping the assurance pipeline minimal.

Auditing for the entire 6-second chunk happens at once with auditors fetching each package in turn and executing. Fetching and excecution can be pipelined, further minimising both latency to Accumulation and to finality.

Chain reversion, in the case of a failed audit, would be to the block when the first package in the failed audit Chunk is reported. The individual package within the Chunk is disregarded. This leads to potentially rewinding the chain 5 seconds further back than strictly needed, but would allow for data ratification to be spread through the Chunk's six Reports rather than bunched up in order to to be strictly before the data is used in Refinement, allowing a small "overdraft" of correctness checking. It would also be no worse the the present system in real terms.

Problems

This approach does not entirely satisfactorily address the loss of Reporting bandwidth nor of D3L bandwidth. CoreVM page updates would need to be staged across each of the yield points during the overall Refinement execution of the Chunk of packages. If none of the final pages and their hashes are known until the final chunk is executed, this presents a conceptual problem.

**D3L access characteristics, especially exporting towards the end of the 6 seconds.

**Audit batching results in a potentially higher latency for finality in the case of cores which are anti-phase.

One qualification and two solutions are apparent. Firstly in CoreVM, there are essentially three types of data for which the portion of Reporting bandwidth into Accumulation is used:
1a. ratification of prior state (RAM page hashes which Refine is reading);
2a. commitment to posterior state (RAM page hashes which Refine has written);
3a. commitment to output data (video buffers/audio buffers/&c. which is being streamed by Refine).

There are also three corresponding types of I/O which come from the D3L:
1b. importing of prior state (RAM pages which Refine is reading);
2b. exporting of posterior state (RAM pages which Refine has written);
3b. yielding of output data (video buffers/audio buffers/&c. which is being streamed by Refine)

Of these, only 1 and 2 benefit from the context-sharing of larger work-packages, since in the case of commitment to (and yielding of) data output, each piece of output data has no structural relevance to those prior. Resources for streaming output can be evenly distributed over time without any expected loss in efficiency.

This framework seems largely inescapable and is also part of the design for an advaced ZK-based payments service.

It is also important to remember that prior state ratification (1a) necessarily happens prior to posterior state commitment (2a) and can be done in any of the Chunk's 6 reports. Prior state importing (1b) must happen before usage, and if usage of imports are dense and homogeneous these imports may all happen with a bias towards the first package in the Chunk.

This leaves (2a) and (2b), posterior state commitment and state exporting, as the remaining problem. It can be phrased as being the problem that the final second of DOOM execution may update all 130 pages, preventing any previous Report from making this commitment and more-than-saturating the final Report and the D3L export bandwidth. In short, state-mutation may not have high correlation between RAM locality and time.

Phase Affinity

The means of managing the problems brought about with both (1b) and (2) is to stagger Reporting itself across cores, optionally providing the bulk of Reporting and export bandwith at the end of the chunk and the bulk of import bandwdith at the beginning.

The idea here is to give each core a Phase Affinity, meaning that while the whole system has a 1 second block time, the package Chunks may be found at a different 1 second phase depending on the core.

Reporting with Phase

Materially, this means that some cores may report more information to the chain and have greater D3L reading or writing than others at certain blocks. In the case of Reporting, since preimage provision is not time-critical, we have an ability to be flexible over precisely how much data is reported by any particular core provided that the average is in line with requirements.

Guarantors (and, by extension, package builders) would be given a baseline of Reporting 8KB of data per package of the Chunk and a minimum of 2KB but with an ability to defer up to 6KB of any package's Reporting bandwidth to later packages within the chunk. The sixth and final package could Report as much as 38KB.

(Note: This could also be done without Phase Affinity, but could result in grossly oversize blocks once every 6 seconds, with Reports taking up to 10MB, and the need for that data to be not only distributed, but also acted upon, within 1 second. Whether this is viable can only properly be attested empirically with the JAM Toaster.)

D3L Access with Phase

Importing and exporting would have a similar front- and back-weighting system: Importing would have a baseline of 2MB per package of the Chunk, with the ability to front-load as much as this as desired (but no deferring). Exporting is the opposite: again a baseline of 2MB per package of the Chunk, but with the ability to defer up to 1MB from earlier packages to later ones.

Extrinsic Access with Phase

Accumulation Gas Limits with Phase

Accumulation computation is likely to need to happen on the same throughput basis as Reporting. If more data is Reported, we would expect a greater amount of gas to be provided in order to Accumulate that data and fold it into service state.

Even though block gas limits are strict, this should not provide a significant problem since JAM already has an effective queue mechanism and service Accumulation is already expected to be fully asynchronous.

Guarantor Groups with Phase

If Phase Affinity is utilised, then one further question needs considering; how guarantor subgroups move between phases as they move between cores. At the worst case, each minute would see a system-wide stall of up to 3 seconds. One possibility here is to manage phase affinity in such a way as to minimise the phase mismatch which guarantor groups may be forced to transition through.

This could be done through ensuring that guarantor rotations happened largely in-phase, a fairly easy prospect if we assign phase on the basis of core index magnitude. In this case only a full shuffle would cause a partial stall of the guarantor pipeline, happening only every 1 hour. Cores would experience, on average, a 3 second stall once per hour and a 1 second stall when rotating onto a new phase, which would happen once every two hours, giving a total of 3.5 seconds expected stall per hour, a rate which would be wholly undiscernable to workflows on the 20-second latency at present, and arguably an immaterial rate even with a 1-2 second latency.

Under Phase Affinity, it would be reasonable to allow greater freedom over how packages making up a Chunk distribute their Reporting bandwidth, since we could have confidence that systematic gluts would be unlikely to manifest and could be controlled against in their extremes.

Effect of change

With this approach we could expect a similar overall throughput profile to the present 6-second JAM. CoreVM in particular should be able to achieve near-equivalent levels of throughput performance.

Overall system latency measured as elapsed time from initial package submission to Accumulation of package implications into service state and appearence of Refinement exports into the D3L would be dramatically reduced.

Returning to our previous formulae we can now present:

$L'{rep} = P' + B' = 2 seconds$
$L'
{accum} = P' + 2B' = 3 seconds$
$L'_{final} = P' + 2B' + A + F = 12 second$

where

  • $P'$ is the work-package period, 1 second in Slipstream JAM
  • $B'$ is the block period, equal to $P$ at 1 second in Slipstream JAM
  • $F$ is the finality overhead, still around 1 second
  • $A$ is the auditing overhead, still 8 seconds

With a round-trip latency as low at 2 seconds, certain use-cases become potentially viable. In particular near-real-time multimedia use-cases open up. Decentralised, directed media applications with real-time democratic feedback and control mechanisms may become viable. Gaming becomes an interesting possibility.

Further reductions may be possible also. Empirical testing is yet to be conducted (but is right around the corner). Rough modelling suggests that nodes need enjoy only 500Mb of connectivity in order to deliver the present JAM performance levels. If this is borne out in examination of a real system, then we might reasonably reduce the block time further, increasing minimum requirements closer to 1Gb of internet connectivity in order to accound for the increased proportion of fixed overhead that greater block-frequency would imply.

In case we attained a 500ms block frequency (with, e.g. 12-block Chunking), we could enjoy round-trip latencies as low as 1 second.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions