Skip to content

Commit 1c23b24

Browse files
[GrCUDA-21] Documentation update (#17)
* updated docs to configure oci instances * updated readme/setup script to use graal 21.2 * fixed errors in setup script * updated readme * removed outdated doc, useful things moved to benchmark files * updated design documentation * fixed python benchmarks using outdated paths * updated java from 11 to 8+
1 parent ab27a67 commit 1c23b24

22 files changed

+775
-916
lines changed

README.md

Lines changed: 168 additions & 139 deletions
Large diffs are not rendered by default.

demos/image_pipeline_local/image_pipeline.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -219,7 +219,7 @@ def pipeline_bw(img):
219219
cmap = plt.cm.gray if BW else None
220220
ax[0].imshow(img, cmap=cmap)
221221
ax[1].imshow(other[0], cmap=cmap)
222-
ax[2].imshow(np.dot(other[1][...,:3], [0.33, 0.33, 0.33]), cmap='gray') # other[1], cmap=plt.cm.gray)
222+
ax[2].imshow(np.dot(other[1][...,:3], [0.33, 0.33, 0.33]), cmap='gray')
223223
ax[3].imshow(other[2], cmap=cmap)
224224
ax[4].imshow(np.dot(other[3][...,:3], [0.33, 0.33, 0.33]), cmap='gray')
225225
ax[5].imshow(other[4], cmap=cmap)
@@ -239,8 +239,8 @@ def pipeline_bw(img):
239239
for j, x in enumerate(tmp):
240240
other2[j][:, :, i] = x
241241

242-
# fig, axes = plt.subplots(2, 2, figsize=(6, 6))
243-
# ax = axes.ravel()
242+
fig, axes = plt.subplots(2, 2, figsize=(6, 6))
243+
ax = axes.ravel()
244244

245245
cmap = plt.cm.gray if BW else None
246246
ax[0].imshow(img, cmap=cmap)

docs/grcuda-scheduler-architecture.md

Lines changed: 21 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
11
# Extending GrCUDA with a dynamic computational DAG
22

3+
This is an ever-changing design document that tracks the state of the asynchronous GrCUDA scheduler, as published in [DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime](https://ieeexplore.ieee.org/abstract/document/9460491).
4+
We do our best to keep this document updated and reflect the latest changes to GrCUDA. If you find any inconsistency, please report it as a GitHub issue.
5+
36
The main idea is to **represent GrCUDA computations as vertices of a DAG**, connected using their dependencies (e.g. the output of a kernel is used as input in another one).
47
* The DAG allows scheduling parallel computations on different streams and avoid synchronization when not necessary
58
* See `projects/resources/python/examples` and `projects/resources/python/benchmark/bench` for simple examples of how this technique can be useful
@@ -11,9 +14,9 @@ The main idea is to **represent GrCUDA computations as vertices of a DAG**, conn
1114
(e.g. how many CUDA streams we need, or how large each GPU block should be)
1215

1316
**How it works, in a few words**
14-
* The class `GpuExecutionContext` tracks GPU computational elements (e.g. `kernels`) declarations and invocations
15-
* When a new computation is created, or when it is called, it notifies `GpuExecutionContext` so that it updates the `DAG` by computing the data dependencies of the new computation
16-
* `GpuExecutionContext` uses the DAG to understand if the new computation can start immediately, or it must wait for other computations to finish
17+
* The class `GrCUDAExecutionContext` tracks GPU computational elements (e.g. `kernels`) declarations and invocations
18+
* When a new computation is created, or when it is called, it notifies `GrCUDAExecutionContext` so that it updates the `DAG` by computing the data dependencies of the new computation
19+
* `GrCUDAExecutionContext` uses the DAG to understand if the new computation can start immediately, or it must wait for other computations to finish
1720
* Different computations are overlapped using different CUDA streams, assigned by the `GrCUDAStreamManager` based on dependencies and free resources
1821
* Computations on the GPU are asynchronous and are scheduled on streams without explicit synchronization points, as CUDA guarantee that computations are stream-ordered
1922
* Synchronziation between streams happens with CUDA events, without blocking the host CPU thread
@@ -24,17 +27,17 @@ The main idea is to **represent GrCUDA computations as vertices of a DAG**, conn
2427
* The DAG supports kernel invocation, and array accesses (both `DeviceArray` and `MultiDimDeviceArray`)
2528
* Kernels are executed in parallel, on different streams, whenever possible
2629
* **Main classes used by the scheduler**
27-
1. `GpuExecutionContext`: takes care of scheduling and executing computations, it is the director of the orchestration and manages the DAG
30+
1. `GrCUDAExecutionContext`: takes care of scheduling and executing computations, it is the director of the orchestration and manages the DAG
2831
2. `GrCUDAComputationalElement`: abstract class that wraps GrCUDA computations, e.g. kernel executions and array accesses.
29-
It provides `GpuExecutionContext` with functions used to compute dependencies or decide if the computation must be done synchronously (e.g. array accesses)
32+
It provides `GrCUDAExecutionContext` with functions used to compute dependencies or decide if the computation must be done synchronously (e.g. array accesses)
3033
3. `ExecutionDAG`: the DAG representing the dependencies between computations, it is composed of vertices that wrap each `GrCUDAComputationalElement`
3134
4. `GrCUDAStreamManager`: class that handles the creation and the assignment of streams to kernels, and the synchronization between different streams or the host thread
3235
* **Basic execution flow**
3336
1. The host language (i.e. the user) calls an `InteropLibrary` object that can be associated to a `GrCUDAComputationalElement`, e.g. a kernel execution or an array access
34-
2. A new `GrCUDAComputationalElement` is created and registered to the `GpuExecutionContext`, to represent the computation
35-
3. `GpuExecutionContext` adds the computation to the DAG and computes its dependencies
36-
4. Based on the dependencies, the `GpuExecutionContext` associates a stream to the computation through `GrCUDAStreamManager`
37-
5. `GpuExecutionContext` executes the computation on the chosen stream, performing synchronization if necessary
37+
2. A new `GrCUDAComputationalElement` is created and registered to the `GrCUDAExecutionContext`, to represent the computation
38+
3. `GrCUDAExecutionContext` adds the computation to the DAG and computes its dependencies
39+
4. Based on the dependencies, the `GrCUDAExecutionContext` associates a stream to the computation through `GrCUDAStreamManager`
40+
5. `GrCUDAExecutionContext` executes the computation on the chosen stream, performing synchronization if necessary
3841
* GPU computations do not require synchronization w.r.t. previous computations on the stream where they executed, as CUDA guarantees stream-ordered execution.
3942
CUDA streams are synchronized with (asynchronous) CUDA events, without blocking the host.
4043
CPU computations that require a GPU result are synchronized with `cudaStreamSynchronize` only on the necessary streams
@@ -47,47 +50,22 @@ The main idea is to **represent GrCUDA computations as vertices of a DAG**, conn
4750
* The `cudaStreamAttachMemAsync` is also exposed, to exclusively associate a managed memory array to a given stream.
4851
This is used, on Pre-Pascal GPUs, to access arrays on CPU while a kernel is using other arrays on GPU
4952
* Most of the new code is unit-tested and integration-tested, and there is a Python benchmarking suite to measure execution time with different settings
50-
* For example, the file `projects/resources/python/benchmark/bench/bench_8` is a fairly complex image processing pipeline that automatically manages up to 4 different streams
51-
Compared to sequential scheduling, we are up to **2.5x faster**!
53+
* For example, the file `projects/resources/python/benchmark/bench/bench_8` is a fairly complex image processing pipeline that automatically manages up to 4 different streams.
5254
* **Streams** are managed internally by the GrCUDA runtime: we keep track of existing streams that are currently empty, and schedule computations on them in a FIFO order.
5355
New streams are created only if no existing stream is available
5456
* **Read-only** input arguments can be specified with the `const` keyword; they will be ignored in the dependency computations if possible:
5557
for example, if there are 2 kernels that use the same read-only input array, they will be executed concurrently
5658

57-
* **Current limitations**
58-
1. ~~Dependency computation does not consider disjoint parameter subsets.~~ **Now available!**
59-
Consider 3 kernels, `K1(X, Y)`, `K2(X)`, `K3(Y)`: `K2` and `K3` are both depending on `K1`, but are using different inputs, and can run in parallel.
60-
2. ~~Synchronization happens the main execution thread.~~ **We now support fully asynchronous GPU execution on a single CPU thread thanks to CUDA events**
61-
in the example before, calling `K2` requires to sync on the stream used by `K1`, and `K3` starts only after `K2` has started.
62-
If `Y` was read-only in `K1`, this wait would have been unnecessary
63-
3. **Scalar values are not considered for dependencies**: they are read-only when used as input, but there could be output-input dependencies in library functions with scalar output ([API Design, point 4](#api-design))
64-
4. Read-only arguments are visible on the **default stream**, instead of having their visibility limited to the stream where they are used.
65-
This makes the arguments visible to kernels running on different streams, but accesses by the CPU to these arguments require full device synchronization.
66-
This limitation is likely to occur only on Pre-Pascal devices, however
67-
5. **Library functions are not considered for asynchronous execution**: not all library functions expose a stream interface for asynchronous execution, and they are currently ignored by the DAG scheduler.
68-
We need to add, at the very least, a synchronization point before synchronous library functions
69-
7059
## Open questions
7160

7261
### Questions on API design (i.e. how do we provide the best user experience)
7362

74-
1. ~~How to understand if a parameter is read-only? ([API Design, point 4](#api-design))~~ We can use the `const` keyword in the computation signature
75-
2. How do we track scalar values in library function outputs? ([API Design, point 5](#api-design))
76-
3. How can user specify options cleanly? ([API Design, point 2](#api-design))
63+
1. How do we track scalar values in library function outputs? ([API Design, point 5](#api-design))
64+
* It is not clear if such library exists, for now we have not seen such situation.
65+
2. How can user specify options cleanly? ([API Design, point 2](#api-design))
7766
* Using only context startup options is limiting, but it simplify the problem (we don't have to worry about changing how the DAG is built at runtime)
7867
* If we want provide more flexibility, we can add functions to the DSL, but that's not very clean
7968

80-
### Questions on internal development (i.e. how do we do something in the most powerful/flexible/efficient way)
81-
82-
1. How to handle library functions? They usually have no stream options ([API Design, point 6](#api-design))
83-
2. How to handle pre-registered libraries and external functions? Same problem as question 1
84-
85-
### Other questions (e.g. things I don't understand about GrCUDA/GraalVM/Truffle)
86-
87-
1. What are `map` and `shred` functions? Are they exposed to the outside?
88-
2. When doing unit-testing, can we access internal data structures of the guest language (e.g. to monitor the state of the DAG)
89-
3. In Graalpython, can we create new guest polyglot contexts at runtime, with user-specified options?
90-
9169
***
9270

9371
## Detailed development notes
@@ -103,25 +81,23 @@ Dependencies are inferred automatically, instead of being manually specified by
10381
2. The API needs ways to modify the scheduling policy, if desired (e.g. go back to fully synchronized execution)
10482
* Context startup option? Easy, but cannot be modified
10583
* Expose a function in the GrCUDA DSL? More flexibility, but changing options using the DSL is not very clean
106-
3. ~~How to handle CPU control flow? In GrCUDA we are not aware of `if` and `for` loops on the host side~~
107-
* The DAG is built dynamically: we need to update it as we receive scheduling orders, and decide if we can execute or not. We don't care about the original control flow
108-
4. How do we identify if a **parameter is read-only**? If two kernels use the same parameter but only read from it, they can execute in parallel
84+
3. How do we identify if a **parameter is read-only**? If two kernels use the same parameter but only read from it, they can execute in parallel
10985
* This is not trivial: LLVM can understand, for example, if a scalar value is read-only, but doing that with an array is not always possible
11086
* Users might have to specify which parameters are read-only in the kernel signature, which is still better than using explicit handles
11187
* For now, we let programmers manually specify read-only array arguments using the `const` keyword, as done in `CUDA`
112-
5. How do we handle scalar values? We could also have dependencies due to scalar values (e.g. a computation is started only if the error in the next iteration is above a threashold)
88+
4. How do we handle scalar values? We could also have dependencies due to scalar values (e.g. a computation is started only if the error in the next iteration is above a threashold)
11389
* Currently, only reads from `DeviceArray` (and similar) return scalar values, and they must be done synchronously, as the result is immediately exposed to the guest language.
11490
* Array reads (and writes) are done synchronously by the host, and we guarantee that no kernel that uses the affected array is running
11591
* Kernels do not return scalar values, and scalar outputs are stored in a size-1 array (which we can treat as any other array)
11692
* Then the programmer can pass the size-1 array to another computation (handled like any array), or extract the value with an array read that triggers synchronization
11793
* Scalar values are only problematic when considering library functions that return them
11894
* One idea could be to *box* scalar values with Truffle nodes and store the actual value using a `Future`.
11995
If the user reads or writes the value, we wait for the GPU computation to end. Then the scalar value can be unboxed to avoid further overheads.
120-
* But running library functions on streams is problematic (see problem 6), so this solution might not be required
96+
* But running library functions on streams is problematic, so this solution might not be required
12197
6. Library functions: library functions are more complex to handle as they could also have code running on the host side.
12298
* They also do not expose streams, so it could be difficult to pipeline them
12399
* In some cases they might expose streams in the signature, we can probably find them by parsing the signature
124-
* They can also return scalars (see problem 5)
100+
* They can also return scalars
125101
* If we run them on threads, we parallelize at least the CPU side
126102

127103
### What is a computational element in GrCUDA?
@@ -149,7 +125,7 @@ Library functions (non-kernels) can also be loaded, using `BindFunction`
149125
Invocation to computational elements are wrapped in classes that extend a generic `GrCUDAComputationalElement`.
150126
`GrCUDAComputationalElement` is used to build the vertices of the DAG and exposes interfaces to compute data dependencies with other `GrCUDAComputationalElements` and to schedule the computation
151127

152-
### Other notes on GrCUDA architecture
128+
### Other notes on the internal GrCUDA architecture
153129

154130
These notes relate to the structure of the original GrCUDA repository. You can skip them if you are already familiar with it!
155131

0 commit comments

Comments
 (0)