Skip to content

Strategy for robust dynamic memory, readback, and async #366

Open
@raphlinus

Description

@raphlinus

One of the stickier points is how to handle robust dynamic memory. The fundamental problem is that the pipeline creates intermediate data structures (grids of tiles containing coarse winding numbers and path segments, per-tile command lists) whose size depends dynamically on the scene being rendered. For example, just changing a transform can significantly affect the number of tiles covered by a path, and thus the size of these data structures.

The standard GPU compute shader execution model cannot express what we need. At the time a command buffer is submitted, all buffers have pre-determined size. There is no way to dynamically allocate memory based on computation done inside a compute shader (note: CUDA doesn't have this limitation, shaders can simply call malloc or invoke C++ new).

Another potential way to address the fundamental problem is to divide work into chunks so that intermediate results fit in fixed size buffers. This would be especially appealing in resource constrained environments where calling malloc may not be guaranteed to succeed, or may cause undesirable resource contention. However, that requires the ability to launch work dependent on computation done in a shader. Again, CUDA can do this (for example, with device graph launch) but it is not a common capability of compute shaders, much less WebGPU.

The previous incarnation, piet-gpu, had a solution (see #175), but with drawbacks. Basically, rendering a scene required a fence back to the GPU to read a buffer with a success/failure indication, with reallocation and retry on failure. However, this requires the ability to do blocking readback (which is missing in WebGPU), and also blocks the calling thread (usually the UI runloop) until the result of the computation is available (which is the reason why it's missing in WebGPU).

There's no good solution to this problem, only a set of tradeoffs. We need to decide what to implement. The best choice will depend on the details of what's being rendered and which application it's integrated with. In no particular order, here are some of the choices:

  1. When the maximum complexity of the scenes being rendered is known in advance, then the buffer sizes can simply be determined in advance. On failure, the scene would fail to render. This may well be the best choice for games and applications in which the UI is not rendering user-provided content. It allows the entire rendering pipeline to be launched as "fire and forget" with no blocking.

  2. We could do analysis CPU-side to determine the memory usage, before launching the render. This is simple and poses no integration challenges, but such analysis is slow. In fact, it's probably comparable to running the compute pipeline on CPU and just using the GPU for fine rasterization, a modality we're considering as a compatibility fallback. It may be a viable choice when the scene complexity is low.

  3. We can implement blocking in a similar fashion as piet-gpu (this is closest to the current direction of the code). That would be native-only, so would require another approach for Web deployment. It also potentially creates integration issues, as calls to the Vello renderer would have to support blocking, and also somebody has to pump the wgpu process (Built-in polling thread gfx-rs/wgpu#1871 is potentially relevant in that case). In addition, a downside is that returns to the UI runloop would be delayed, likely impacting other tasks including being responsive to accessibility requests.

  4. We can have an async task that fully owns the GPU and associated resources. It would operate as a loop that receives a scene through an async channel, submits as many command buffers as needed with await points for the readback, then returns to the top of the loop after the last such submission. The UI runloop would create a scene, send it over a channel, and immediately return to the runloop. On native, the task would run in the threadpool of an async executor (such as tokio), and on native it would be invoked by spawn_local. This is appealing in use cases where a Vello-driven UI would be the sole user of the GPU, but poses serious challenges when that is to be shared.

  5. We can have a similar async task, but share access to the GPU by wrapping at least the wgpu Device object (and likely other types) in an Arc<Mutex<_>>. This makes it possible, at least in theory, to integrate with other GPU clients, but complicates that integration, as other such clients have to cooperate with the locking discipline. It's been suggested that wgpu implement Clone on Device and related types to directly support such sharing, and is worth noting that this is not a problem in JavaScript, as all such references are implicitly shareable.

  6. We can consider other possibilities where async is not represented by await points in an async Rust function, but rather by a state machine of some kind. The host would be responsible for advancing the state machine on completion of submitted command buffers. This is potentially the most flexible approach, but is complex, and also requires the host to support async.

  7. Similar to (1) but with mechanisms in place to recover from error and allocate for the next frame. To minimize visual disturbance, there could be a texture holding the previous frame. The fine shader could then blit from that texture when it detects failure. This is potentially the least invasive solution regarding integration, but deliberately makes jank possible.

A few other comments. There are other applications that can potentially benefit from readback (one is to do hit testing and collision detection on GPU). However, in GL it has historically been associated with horrible performance (for underlying reasons similar to what's been outlined above). In game contexts, a reasonable tradeoff may be to defer the results of readback for one frame, but that's less desirable here, as it results in a frame not being rendered.

It is worth exploring whether there may be practical extensions to the GPU execution model that eliminate the need for CPU readback. As mentioned above, ability to allocate or launch work from within a shader would help enormously.

I'm interested in hearing more about applications and use cases, to decide what we should be building. Some of it is fairly complex, other choices create integration problems. There's no one choice that seems a clear win.

A bit more discussion is in the Runloop, async, wgpu Zulip thread, and there are links from that to other resources.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions