|
| 1 | +--- |
| 2 | +title: 0050 - Formalized Memory and Execution Model |
| 3 | +params: |
| 4 | + authors: |
| 5 | + - llvm-beanz: Chris Bieneman |
| 6 | + sponsors: |
| 7 | + - llvm-beanz: Chris Bieneman |
| 8 | + status: Under Consideration |
| 9 | +--- |
| 10 | + |
| 11 | +* PRs: [hlsl-specs#321](https://github.com/microsoft/hlsl-specs/pull/321) |
| 12 | + |
| 13 | +## Introduction |
| 14 | + |
| 15 | +This proposal seeks to define the memory and execution models for HLSL. The goal |
| 16 | +is to define a memory and execution model that is understandable to users, |
| 17 | +portable across a wide variety of GPU hardware, and strikes a balance between |
| 18 | +full portability and performance. |
| 19 | + |
| 20 | +## Motivation |
| 21 | + |
| 22 | +The HLSL and DXIL memory and execution model have never been fully defined. This |
| 23 | +forces reliance on Windows device certification for behavior conformance, and a |
| 24 | +general approach that the implementation defines the behavior. |
| 25 | + |
| 26 | +This state is not ideal from the start, however as HLSL becomes more widely |
| 27 | +portable, and DirectX moves to SPIRV the lack of a documented memory and |
| 28 | +execution model makes it near impossible to ensure portability across byte code |
| 29 | +formats. |
| 30 | + |
| 31 | +The SPIRV-defined memory and execution model is constantly evolving and seeking |
| 32 | +to address some of the specific problems discussed in this proposal, however |
| 33 | +SPIRV is not designed to be written by humans. It is not a requirement that |
| 34 | +HLSL's memory and execution models match SPIRV, just that SPIRV is expressive |
| 35 | +enough to model programs in the model defined by HLSL. |
| 36 | + |
| 37 | +This proposal is structured with multiple proposed solutions. Those will become |
| 38 | +_Alternatives Considered_ as they are eliminated from consideration. There are |
| 39 | +also two groupings of proposals to capture the execution and memory models |
| 40 | +separately although they are tightly connected in the final representation. |
| 41 | + |
| 42 | +## Execution Model Proposals |
| 43 | + |
| 44 | +All execution model proposals assume behaviors documented in |
| 45 | +[SPV_KHR_maximal_reconvergence](https://github.com/KhronosGroup/SPIRV-Registry/blob/main/extensions/KHR/SPV_KHR_maximal_reconvergence.asciidoc), |
| 46 | +as well as additional reconvergence requirements for `OpSwitch` to match the |
| 47 | +DXIL `switch` behavior such that tangles are formed by branch target (not |
| 48 | +selector value), and tangles are expected to reconverge at labels in the event |
| 49 | +of fall through. |
| 50 | + |
| 51 | +### Proposed solution #1 : Full Lockstep |
| 52 | + |
| 53 | +A full lockstep execution model is the simplest to understand. Under full |
| 54 | +lockstep, all the threads in a warp must behave as if they share the same |
| 55 | +program counter whether they are in the same tangle or not. |
| 56 | + |
| 57 | +This execution model provides some of the strictest guarantees for memory |
| 58 | +ordering and behavior. Consider the following code snippet: |
| 59 | + |
| 60 | +<a name="example1"></a> |
| 61 | + |
| 62 | +```hlsl |
| 63 | +groupshared int X; |
| 64 | +
|
| 65 | +[numthreads(4,1,1)] |
| 66 | +void main(uint GI : SV_GroupIndex) { |
| 67 | + if (GI == 0) |
| 68 | + X = 0; |
| 69 | + else if (GI == 2) |
| 70 | + X = 2; |
| 71 | +} |
| 72 | +``` |
| 73 | + |
| 74 | +In full lockstep this program is well-defined. Because all threads must act |
| 75 | +as-if they share a program counter, no thread can execute the `else if` |
| 76 | +condition or its block until thread 0 has executed the body of the `if`. This |
| 77 | +enforces strict ordering of memory operations to `groupshared`. |
| 78 | + |
| 79 | +### Proposed solution #2 : Lockstep Within A Tangle |
| 80 | + |
| 81 | +The lockstep within a tangle execution model is a slightly relaxed variant of |
| 82 | +the full lockstep model. It allows each tangle to have an independent program |
| 83 | +counter. In this model the [example from solution #1](#example1) is undefined |
| 84 | +because once thread 0 splits to form its own tangle, the second tangle can |
| 85 | +continue executing until it reaches a reconvergence point. This creates a data |
| 86 | +race writing to X between thread 0 and 2. |
| 87 | + |
| 88 | +This model requires more precise definition of reconvergence points, however it |
| 89 | +provides some ordering guarantees that make it safe. For example, the following |
| 90 | +adjusted program is well-defined in this model. |
| 91 | + |
| 92 | +<a name="example2"></a> |
| 93 | + |
| 94 | +```hlsl |
| 95 | +groupshared int X; |
| 96 | +
|
| 97 | +[numthreads(4,1,1)] |
| 98 | +void main(uint GI : SV_GroupIndex) { |
| 99 | + if (GI == 0) |
| 100 | + X = 0; |
| 101 | + if (GI == 2) // Reconverges here because this is a new statement, not an else. |
| 102 | + X = 2; |
| 103 | +} |
| 104 | +``` |
| 105 | + |
| 106 | +### Proposed solution #3 : Independent Threads |
| 107 | + |
| 108 | +This model is the most complicated, but also most flexible for compiler backend |
| 109 | +optimization. In this model threads are allowed to have fully independent |
| 110 | +program counters which only need to synchronize across tangles at designated |
| 111 | +sync points (e.g. wave operations, barriers, etc). |
| 112 | + |
| 113 | +Under this model the [example from solution #2](#example2) is undefined because |
| 114 | +the program contains no synchronization points, so each thread can execute |
| 115 | +independently and the memory ordering is not guaranteed. |
| 116 | + |
| 117 | +To illustrate a particular challenge with this model, consider the following |
| 118 | +example: |
| 119 | + |
| 120 | +<a name="example3"></a> |
| 121 | + |
| 122 | +```hlsl |
| 123 | +groupshared int X; |
| 124 | +
|
| 125 | +[numthreads(4,1,1)] |
| 126 | +void main(uint GI : SV_GroupIndex) { |
| 127 | + if (GI == 0) |
| 128 | + X = 0; |
| 129 | + GroupMemoryBarrierWithGroupSync(); // sync point! |
| 130 | + if (X == 0) |
| 131 | + X = 2; // How many threads are in the tangle here? |
| 132 | +} |
| 133 | +``` |
| 134 | + |
| 135 | +One problem with this model is not having clearly defined memory ordering which |
| 136 | +can impact tangle formation. If a thread is allowed to execute the second `if` |
| 137 | +body before all threads have finished evaluating the condition, tangle formation |
| 138 | +becomes unintuitive and potentially undefined. |
| 139 | + |
| 140 | +This can be made slightly stricter by requiring that branch statements (`if`, |
| 141 | +`else`, `switch`, `for`, `while`, etc.) are thread sync points. It also likely |
| 142 | +requires atomic operations to behave a sync points as well. |
| 143 | + |
| 144 | +## Memory Model Proposals |
| 145 | + |
| 146 | +Basically all modern GPUs implement some version of the [Heterogeneous Systems |
| 147 | +Architecture |
| 148 | +(HSA)](https://en.wikipedia.org/wiki/Heterogeneous_System_Architecture) |
| 149 | +standards. This makes it reasonable that HLSL's memory model derive from HSA. |
| 150 | + |
| 151 | +Specific concerns that must be addressed: |
| 152 | +* What are the ordering requirements, if any, for memory operations to aliasing |
| 153 | + memory across a wave? |
| 154 | +* What are the ordering requirements, if any, for memory operations to aliasing |
| 155 | + memory across a set of tangled threads? |
| 156 | + |
| 157 | +## Appendix 1: Magic Decoder Ring |
| 158 | + |
| 159 | +| DirectX Term | Khronos Term | Description | |
| 160 | +| ------------ | ------------ | ----------- | |
| 161 | +| thread, lane | invocation | The computation performed on a single element as described in the program. | |
| 162 | +| | tangle | A grouping of co-executing threads. | |
| 163 | +| wave | subgroup | A group of threads which may form one or more tangles and are executed on a shared SIMD or other compute unit. | |
| 164 | +| threadgroup | workgroup | A group of threads which may be subdivided into one or more waves and comprise a larger computation. | |
0 commit comments