Start next version of The Command Queue

eliemichel · eliemichel · commit 0cb0b9850eb9 · 2025-04-16T09:31:06.000+02:00
diff --git a/next/getting-started/adapter-and-device/the-adapter.md b/next/getting-started/adapter-and-device/the-adapter.md
@@ -223,7 +223,7 @@ This works, as long as the **callback mode** we set in the callback info is at l
 As of version `v24.0.0.2`, `wgpu-native` does not implement `wgpuInstanceProcessEvents`. In this very case, we may skip it because the adapter requests ends right within the call to `wgpuInstanceRequestAdapter`.
 ```
 
-This is an OK solution, although we still need to manage ourselves the `requestEnded` test and the `sleep()` operation. This solution is **all right for the adapter/device request** and I do not want to make this chapter any longer, so we will wait for chapter [The Command Queue](../the-command-queue.md) to see **another way**, which gives finer control over the pending asynchronous operations.
+This is an OK solution, although we still need to manage ourselves the `requestEnded` test and the `sleep()` operation. This solution is **all right for the adapter/device request** and I do not want to make this chapter any longer, so we will wait for chapter [Playing with buffers](../playing-with-buffers.md) to see **another way**, which gives finer control over the pending asynchronous operations.
 
 ##### With emscripten
 
diff --git a/next/getting-started/index.md b/next/getting-started/index.md
@@ -11,4 +11,6 @@ project-setup
 hello-webgpu
 adapter-and-device/index
 the-command-queue
+playing-with-buffers
+our-first-shader
 ```
diff --git a/next/getting-started/playing-with-buffers.md b/next/getting-started/playing-with-buffers.md
@@ -0,0 +1,39 @@
+Playing with buffers
+====================
+
+**WIP** *In this version of the guide, this chapter moves back in the "getting started" section, between command queue and the first (compute) shader.*
+
+In this chapter:
+
+ - We see **how to create and manipulate buffers**.
+ - We refine our control of **asynchronous operations**.
+
+Buffers
+-------
+
+Asynchronous operations
+-----------------------
+
+##### The good way
+
+**To keep track of ongoing asynchronous operations**, each function that starts such an operation **returns a `WGPUFuture`**, which is some sort of internal ID that **identifies the operation**:
+
+```C++
+WGPUFuture adapterRequest = wgpuInstanceRequestAdapter(instance, &options, callbackInfo);
+```
+
+```{note}
+Although it is technically just an integer value, the `WGPUFuture` should be treated as an **opaque handle**, i.e., one should not try to deduce anything from the very value of this ID.
+```
+
+This *future* can then be passed to `wgpuInstanceWaitAny` to mean "wait until this asynchronous operation completes"! Here is the signature of `wgpuInstanceWaitAny`:
+
+```C++
+WGPUWaitStatus wgpuInstanceWaitAny(WGPUInstance instance, size_t futureCount, WGPUFutureWaitInfo * futures, uint64_t timeoutNS);
+```
+
+```C++
+uint64_t timeoutNS = 200 * 1000; // 200 ms
+WGPUWaitStatus status = wgpuInstanceWaitAny(instance, 1, &adapterRequest, timeoutNS);
+```
+
diff --git a/next/getting-started/the-command-queue.md b/next/getting-started/the-command-queue.md
@@ -26,54 +26,137 @@ One important thing to keep in mind when doing graphics programming: we have **t
  1. **The code we write runs on the CPU**, and some of it triggers operations on the GPU. The only exception are *shaders*, which actually run on GPU.
  2. Processors are "**far away**", meaning that communicating between them **takes time**.
 
-They are not too far, but for high performance applications like real time graphics or when manipulating large amounts of data like in machine learning, this matters.
+They are not too far, but for high performance applications like real time graphics or when manipulating large amounts of data like in machine learning, this matters. For two reasons:
+
+### Bandwidth
+
+Since the GPU is meant for **massive parallel data processing**, its performance can easily be **bound by the memory transfers** rather than the actual computation.
+
+As it turns out, the **memory bandwidth limits** are more often hit within the GPU itself, **between its storage and its compute units**, but the CPU-GPU bandwidth is also limited, which one feels when trying to transfer large textures too often for instance.
+
+```{note}
+The connection between the **CPU memory** (RAM) and **GPU memory (vRAM)** depends on the type of GPU. Some GPUs are **integrated** within the same chip as the CPU, so they share the same memory. A **discrete** GPU is typically connected through a PCIe wire. And an **external** GPU would be connected with a Thunderbolt wire for instance. Each has a different bandwidth.
+```
 
 ### Latency
 
 **Even the smallest bit of information** needs some time for the round trip to and back from the GPU. As a consequence, functions that send instructions to the GPU return almost immediately: they **do not wait for the instruction to have actually been executed** because that would require to wait for the GPU to transfer back the "I'm done" information.
 
-Instead, the commands intended for the GPU are **batched** and fired through a **command queue**. The GPU consumes this queue **whenever it is ready**.
+Instead, the commands intended for the GPU are **batched** and fired through a **command queue**. The GPU consumes this queue **whenever it is ready**. This is what we detail in this chapter.
+
+### Timelines
 
 The CPU-side of our program, i.e., the C++ code that we write, lives in the **Content timeline**. The other side of the command queue is in the **Queue timeline**, running on the GPU.
 
 ```{note}
 There is also a **Device timeline** defined in [WebGPU's documentation](https://www.w3.org/TR/webgpu/#programming-model-timelines). It corresponds to the GPU operations for which our code actually waits for an immediate answer (called "synchronous" calls), but unlike the JavaScript API, it is roughly the same as the content timeline in our C++ case.
 ```
 
-### Bandwidth
+In the remainder of this chapter:
 
-Since the GPU is meant for **massive parallel data processing**, its performance can easily be **bound by the memory transfers** rather than the actual computation.
+ - We see **how to manipulate the queue**.
+ - We refine our control of **asynchronous operations**.
 
-As it turns out, the **memory bandwidth limits** are more often hit within the GPU itself, **between its storage and its compute units**, but the CPU-GPU bandwidth is also limited, which one feels when trying to transfer large textures too often for instance.
+Manipulating the Queue
+----------------------
+
+### Queue operations
+
+Our WebGPU device has **a single queue**, which is used to send both **commands** and **data**. We can get it with `wgpuDeviceGetQueue`.
+
+```{lit} C++, Get Queue
+WGPUQueue queue = wgpuDeviceGetQueue(device);
+```
+
+Naturally, we must also release the queue once we no longer use it, at the end of the program:
+
+```{lit} C++, Release things (prepend)
+// At the end
+wgpuQueueRelease(queue);
+```
 
 ```{note}
-The connection between the CPU memory (RAM) and GPU memory (vRAM) depends on the type of GPU. Some GPUs are *integrated* within the same chip as the CPU, so they share the same memory. A *discrete* GPU is typically connected through a PCIe wire. And an *external* GPU would be connected with a Thunderbolt wire for instance. Each has a different bandwidth.
+**Other graphics API** allow one to build **multiple queues** per device, and future version of WebGPU might as well. But for now, one queue is already more than enough for us to play with!
 ```
 
-Queue operations
-----------------
+Looking at `webgpu.h`, we find mainly **3 different means** to submit work to this queue:
 
-**WIP line**
+ - `wgpuQueueSubmit` sends **commands**, i.e., instructions of what to execute on the GPU.
+ - `wgpuQueueWriteBuffer` sends **data** from a CPU-side buffer to a **GPU-side buffer**.
+ - `wgpuQueueWriteTexture` sends **data** from a CPU-side buffer to a **GPU-side texture**.
+
+We can note that all these functions have a `void` return type: they send instructions/data to the GPU and return immediately **without waiting from an answer from the GPU**.
+
+The only way to **get information back** is through `wgpuQueueOnSubmittedWorkDone`, which is an **asynchronous operation** that gets invoked once the GPU confirms that it has (tried to) execute the commands. We show an example below.
 
-##### The good way
+### Submitting commands
 
-**To keep track of ongoing asynchronous operations**, each function that starts such an operation **returns a `WGPUFuture`**, which is some sort of internal ID that **identifies the operation**:
+We submit commands using the following procedure:
 
 ```C++
-WGPUFuture adapterRequest = wgpuInstanceRequestAdapter(instance, &options, callbackInfo);
+wgpuQueueSubmit(queue, /* number of commands */, /* pointer to the command array */);
 ```
 
-```{note}
-Although it is technically just an integer value, the `WGPUFuture` should be treated as an **opaque handle**, i.e., one should not try to deduce anything from the very value of this ID.
+We recognize here the typical way of sending arrays (briefly mentioned in [The Device](adapter-and-device/the-adapter.md) chapter). WebGPU is a C API so whenever it needs to receive an array of things, we first provide **the array size** and then **a pointer to the first element**.
+
+#### Array argument
+
+If we have a **single element**, it is simply done like so:
+
+```C++
+// With a single command:
+WGPUCommandBuffer command = /* [...] */;
+wgpuQueueSubmit(queue, 1, &command);
+wgpuCommandBufferRelease(command); // release command buffer once submitted
 ```
 
-This *future* can then be passed to `wgpuInstanceWaitAny` to mean "wait until this asynchronous operation completes"! Here is the signature of `wgpuInstanceWaitAny`:
+If we know at **compile time** ("statically") the number of commands, we may use a C array, or a `std::array` (which is safer):
 
 ```C++
-WGPUWaitStatus wgpuInstanceWaitAny(WGPUInstance instance, size_t futureCount, WGPUFutureWaitInfo * futures, uint64_t timeoutNS);
+// With a statically know number of commands:
+WGPUCommandBuffer commands[3];
+commands[0] = /* [...] */;
+commands[1] = /* [...] */;
+commands[2] = /* [...] */;
+wgpuQueueSubmit(queue, 3, commands);
+
+// or, safer and avoid repeating the array size:
+// (requires to #include <array>)
+std::array<WGPUCommandBuffer, 3> commands;
+commands[0] = /* [...] */;
+commands[1] = /* [...] */;
+commands[2] = /* [...] */;
+wgpuQueueSubmit(queue, commands.size(), commands.data());
 ```
 
+Or, if command buffers are **dynamically accumulated**, we use a `std::vector`:
+
 ```C++
-uint64_t timeoutNS = 200 * 1000; // 200 ms
-WGPUWaitStatus status = wgpuInstanceWaitAny(instance, 1, &adapterRequest, timeoutNS);
+// With a dynamical number of commands:
+// (requires to #include <vector>)
+std::vector<WGPUCommandBuffer> commands;
+commands.push_back(/* [...] */);
+if (someRuntimeCondition) {
+	commands.push_back(/* [...] */);
+}
+wgpuQueueSubmit(queue, commands.size(), commands.data());
 ```
+
+#### Command buffers
+
+In any case, do not forgot to **release** the command buffers once they have been submitted:
+
+```C++
+// Release:
+for (auto cmd : commands) {
+	wgpuCommandBufferRelease(cmd);
+}
+```
+
+> 🤔 Hey but what about **creating these buffers**, to begin with?
+
+A command buffer, which has type `WGPUCommandBuffer`, is not a buffer that we directly create! This buffer uses a special format that is left to the discretion of your driver/hardware. To build this buffer, we use a **command encoder**.
+
+### Command encoder
+
+**WIP line**