Skip to content

Commit 37043ec

Browse files
authored
[New] Initial draft for memory and execution model (microsoft#505)
This is the start of a proposal for the memory and execution model for HLSL.
1 parent a219918 commit 37043ec

1 file changed

Lines changed: 164 additions & 0 deletions

File tree

Lines changed: 164 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
---
2+
title: 0050 - Formalized Memory and Execution Model
3+
params:
4+
authors:
5+
- llvm-beanz: Chris Bieneman
6+
sponsors:
7+
- llvm-beanz: Chris Bieneman
8+
status: Under Consideration
9+
---
10+
11+
* PRs: [hlsl-specs#321](https://github.com/microsoft/hlsl-specs/pull/321)
12+
13+
## Introduction
14+
15+
This proposal seeks to define the memory and execution models for HLSL. The goal
16+
is to define a memory and execution model that is understandable to users,
17+
portable across a wide variety of GPU hardware, and strikes a balance between
18+
full portability and performance.
19+
20+
## Motivation
21+
22+
The HLSL and DXIL memory and execution model have never been fully defined. This
23+
forces reliance on Windows device certification for behavior conformance, and a
24+
general approach that the implementation defines the behavior.
25+
26+
This state is not ideal from the start, however as HLSL becomes more widely
27+
portable, and DirectX moves to SPIRV the lack of a documented memory and
28+
execution model makes it near impossible to ensure portability across byte code
29+
formats.
30+
31+
The SPIRV-defined memory and execution model is constantly evolving and seeking
32+
to address some of the specific problems discussed in this proposal, however
33+
SPIRV is not designed to be written by humans. It is not a requirement that
34+
HLSL's memory and execution models match SPIRV, just that SPIRV is expressive
35+
enough to model programs in the model defined by HLSL.
36+
37+
This proposal is structured with multiple proposed solutions. Those will become
38+
_Alternatives Considered_ as they are eliminated from consideration. There are
39+
also two groupings of proposals to capture the execution and memory models
40+
separately although they are tightly connected in the final representation.
41+
42+
## Execution Model Proposals
43+
44+
All execution model proposals assume behaviors documented in
45+
[SPV_KHR_maximal_reconvergence](https://github.com/KhronosGroup/SPIRV-Registry/blob/main/extensions/KHR/SPV_KHR_maximal_reconvergence.asciidoc),
46+
as well as additional reconvergence requirements for `OpSwitch` to match the
47+
DXIL `switch` behavior such that tangles are formed by branch target (not
48+
selector value), and tangles are expected to reconverge at labels in the event
49+
of fall through.
50+
51+
### Proposed solution #1 : Full Lockstep
52+
53+
A full lockstep execution model is the simplest to understand. Under full
54+
lockstep, all the threads in a warp must behave as if they share the same
55+
program counter whether they are in the same tangle or not.
56+
57+
This execution model provides some of the strictest guarantees for memory
58+
ordering and behavior. Consider the following code snippet:
59+
60+
<a name="example1"></a>
61+
62+
```hlsl
63+
groupshared int X;
64+
65+
[numthreads(4,1,1)]
66+
void main(uint GI : SV_GroupIndex) {
67+
if (GI == 0)
68+
X = 0;
69+
else if (GI == 2)
70+
X = 2;
71+
}
72+
```
73+
74+
In full lockstep this program is well-defined. Because all threads must act
75+
as-if they share a program counter, no thread can execute the `else if`
76+
condition or its block until thread 0 has executed the body of the `if`. This
77+
enforces strict ordering of memory operations to `groupshared`.
78+
79+
### Proposed solution #2 : Lockstep Within A Tangle
80+
81+
The lockstep within a tangle execution model is a slightly relaxed variant of
82+
the full lockstep model. It allows each tangle to have an independent program
83+
counter. In this model the [example from solution #1](#example1) is undefined
84+
because once thread 0 splits to form its own tangle, the second tangle can
85+
continue executing until it reaches a reconvergence point. This creates a data
86+
race writing to X between thread 0 and 2.
87+
88+
This model requires more precise definition of reconvergence points, however it
89+
provides some ordering guarantees that make it safe. For example, the following
90+
adjusted program is well-defined in this model.
91+
92+
<a name="example2"></a>
93+
94+
```hlsl
95+
groupshared int X;
96+
97+
[numthreads(4,1,1)]
98+
void main(uint GI : SV_GroupIndex) {
99+
if (GI == 0)
100+
X = 0;
101+
if (GI == 2) // Reconverges here because this is a new statement, not an else.
102+
X = 2;
103+
}
104+
```
105+
106+
### Proposed solution #3 : Independent Threads
107+
108+
This model is the most complicated, but also most flexible for compiler backend
109+
optimization. In this model threads are allowed to have fully independent
110+
program counters which only need to synchronize across tangles at designated
111+
sync points (e.g. wave operations, barriers, etc).
112+
113+
Under this model the [example from solution #2](#example2) is undefined because
114+
the program contains no synchronization points, so each thread can execute
115+
independently and the memory ordering is not guaranteed.
116+
117+
To illustrate a particular challenge with this model, consider the following
118+
example:
119+
120+
<a name="example3"></a>
121+
122+
```hlsl
123+
groupshared int X;
124+
125+
[numthreads(4,1,1)]
126+
void main(uint GI : SV_GroupIndex) {
127+
if (GI == 0)
128+
X = 0;
129+
GroupMemoryBarrierWithGroupSync(); // sync point!
130+
if (X == 0)
131+
X = 2; // How many threads are in the tangle here?
132+
}
133+
```
134+
135+
One problem with this model is not having clearly defined memory ordering which
136+
can impact tangle formation. If a thread is allowed to execute the second `if`
137+
body before all threads have finished evaluating the condition, tangle formation
138+
becomes unintuitive and potentially undefined.
139+
140+
This can be made slightly stricter by requiring that branch statements (`if`,
141+
`else`, `switch`, `for`, `while`, etc.) are thread sync points. It also likely
142+
requires atomic operations to behave a sync points as well.
143+
144+
## Memory Model Proposals
145+
146+
Basically all modern GPUs implement some version of the [Heterogeneous Systems
147+
Architecture
148+
(HSA)](https://en.wikipedia.org/wiki/Heterogeneous_System_Architecture)
149+
standards. This makes it reasonable that HLSL's memory model derive from HSA.
150+
151+
Specific concerns that must be addressed:
152+
* What are the ordering requirements, if any, for memory operations to aliasing
153+
memory across a wave?
154+
* What are the ordering requirements, if any, for memory operations to aliasing
155+
memory across a set of tangled threads?
156+
157+
## Appendix 1: Magic Decoder Ring
158+
159+
| DirectX Term | Khronos Term | Description |
160+
| ------------ | ------------ | ----------- |
161+
| thread, lane | invocation | The computation performed on a single element as described in the program. |
162+
| | tangle | A grouping of co-executing threads. |
163+
| wave | subgroup | A group of threads which may form one or more tangles and are executed on a shared SIMD or other compute unit. |
164+
| threadgroup | workgroup | A group of threads which may be subdivided into one or more waves and comprise a larger computation. |

0 commit comments

Comments
 (0)