necst
diff --git a/‎README.md‎
Lines changed: 168 additions & 139 deletions b/‎README.md‎
Lines changed: 168 additions & 139 deletions
diff --git a/‎demos/image_pipeline_local/image_pipeline.py‎
Lines changed: 3 additions & 3 deletions b/‎demos/image_pipeline_local/image_pipeline.py‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎docs/grcuda-scheduler-architecture.md‎
Lines changed: 21 additions & 45 deletions b/‎docs/grcuda-scheduler-architecture.md‎
Lines changed: 21 additions & 45 deletions
@@ -219,7 +219,7 @@ def pipeline_bw(img):
     cmap =  plt.cm.gray if BW else None
     ax[0].imshow(img, cmap=cmap)
     ax[1].imshow(other[0], cmap=cmap)
-    ax[2].imshow(np.dot(other[1][...,:3], [0.33, 0.33, 0.33]), cmap='gray') # other[1], cmap=plt.cm.gray)
+    ax[2].imshow(np.dot(other[1][...,:3], [0.33, 0.33, 0.33]), cmap='gray') 
     ax[3].imshow(other[2], cmap=cmap)
     ax[4].imshow(np.dot(other[3][...,:3], [0.33, 0.33, 0.33]), cmap='gray')
     ax[5].imshow(other[4], cmap=cmap)
@@ -239,8 +239,8 @@ def pipeline_bw(img):
         for j, x in enumerate(tmp):
             other2[j][:, :, i] = x
 
-    # fig, axes = plt.subplots(2, 2, figsize=(6, 6))
-    # ax = axes.ravel()
+    fig, axes = plt.subplots(2, 2, figsize=(6, 6))
+    ax = axes.ravel()
 
     cmap =  plt.cm.gray if BW else None
     ax[0].imshow(img, cmap=cmap)
 
@@ -1,5 +1,8 @@
 # Extending GrCUDA with a dynamic computational DAG
 
+This is an ever-changing design document that tracks the state of the asynchronous GrCUDA scheduler, as published in [DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime](https://ieeexplore.ieee.org/abstract/document/9460491). 
+We do our best to keep this document updated and reflect the latest changes to GrCUDA. If you find any inconsistency, please report it as a GitHub issue.
+
 The main idea is to **represent GrCUDA computations as vertices of a DAG**, connected using their dependencies (e.g. the output of a kernel is used as input in another one).
  * The DAG allows scheduling parallel computations on different streams and avoid synchronization when not necessary
  * See `projects/resources/python/examples` and `projects/resources/python/benchmark/bench` for simple examples of how this technique can be useful
@@ -11,9 +14,9 @@ The main idea is to **represent GrCUDA computations as vertices of a DAG**, conn
  (e.g. how many CUDA streams we need, or how large each GPU block should be)
 
 **How it works, in a few words**    
- * The class `GpuExecutionContext` tracks GPU computational elements (e.g. `kernels`) declarations and invocations
- * When a new computation is created, or when it is called, it notifies `GpuExecutionContext` so that it updates the `DAG` by computing the data dependencies of the new computation
- * `GpuExecutionContext` uses the DAG to understand if the new computation can start immediately, or it must wait for other computations to finish
+ * The class `GrCUDAExecutionContext` tracks GPU computational elements (e.g. `kernels`) declarations and invocations
+ * When a new computation is created, or when it is called, it notifies `GrCUDAExecutionContext` so that it updates the `DAG` by computing the data dependencies of the new computation
+ * `GrCUDAExecutionContext` uses the DAG to understand if the new computation can start immediately, or it must wait for other computations to finish
  * Different computations are overlapped using different CUDA streams, assigned by the `GrCUDAStreamManager` based on dependencies and free resources
  * Computations on the GPU are asynchronous and are scheduled on streams without explicit synchronization points, as CUDA guarantee that computations are stream-ordered
  * Synchronziation between streams happens with CUDA events, without blocking the host CPU thread
@@ -24,17 +27,17 @@ The main idea is to **represent GrCUDA computations as vertices of a DAG**, conn
 * The DAG supports kernel invocation, and array accesses (both `DeviceArray` and `MultiDimDeviceArray`)
     * Kernels are executed in parallel, on different streams, whenever possible
 * **Main classes used by the scheduler**
-    1. `GpuExecutionContext`: takes care of scheduling and executing computations, it is the director of the orchestration and manages the DAG
+    1. `GrCUDAExecutionContext`: takes care of scheduling and executing computations, it is the director of the orchestration and manages the DAG
     2. `GrCUDAComputationalElement`: abstract class that wraps GrCUDA computations, e.g. kernel executions and array accesses. 
-    It provides `GpuExecutionContext` with functions used to compute dependencies or decide if the computation must be done synchronously (e.g. array accesses)
+    It provides `GrCUDAExecutionContext` with functions used to compute dependencies or decide if the computation must be done synchronously (e.g. array accesses)
     3. `ExecutionDAG`: the DAG representing the dependencies between computations, it is composed of vertices that wrap each `GrCUDAComputationalElement`
     4. `GrCUDAStreamManager`: class that handles the creation and the assignment of streams to kernels, and the synchronization between different streams or the host thread
 * **Basic execution flow**
     1. The host language (i.e. the user) calls an `InteropLibrary` object that can be associated to a `GrCUDAComputationalElement`, e.g. a kernel execution or an array access
-    2. A new `GrCUDAComputationalElement` is created and registered to the `GpuExecutionContext`, to represent the computation
-    3. `GpuExecutionContext` adds the computation to the DAG and computes its dependencies
-    4. Based on the dependencies, the `GpuExecutionContext` associates a stream to the computation through `GrCUDAStreamManager`
-    5. `GpuExecutionContext` executes the computation on the chosen stream, performing synchronization if necessary
+    2. A new `GrCUDAComputationalElement` is created and registered to the `GrCUDAExecutionContext`, to represent the computation
+    3. `GrCUDAExecutionContext` adds the computation to the DAG and computes its dependencies
+    4. Based on the dependencies, the `GrCUDAExecutionContext` associates a stream to the computation through `GrCUDAStreamManager`
+    5. `GrCUDAExecutionContext` executes the computation on the chosen stream, performing synchronization if necessary
     * GPU computations do not require synchronization w.r.t. previous computations on the stream where they executed, as CUDA guarantees stream-ordered execution.
      CUDA streams are synchronized with (asynchronous) CUDA events, without blocking the host. 
      CPU computations that require a GPU result are synchronized with `cudaStreamSynchronize` only on the necessary streams
@@ -47,47 +50,22 @@ The main idea is to **represent GrCUDA computations as vertices of a DAG**, conn
     * The `cudaStreamAttachMemAsync` is also exposed, to exclusively associate a managed memory array to a given stream. 
     This is used, on Pre-Pascal GPUs, to access arrays on CPU while a kernel is using other arrays on GPU
 * Most of the new code is unit-tested and integration-tested, and there is a Python benchmarking suite to measure execution time with different settings
-    * For example, the file `projects/resources/python/benchmark/bench/bench_8` is a fairly complex image processing pipeline that automatically manages up to 4 different streams
-    Compared to sequential scheduling, we are up to **2.5x faster**!
+    * For example, the file `projects/resources/python/benchmark/bench/bench_8` is a fairly complex image processing pipeline that automatically manages up to 4 different streams.
 * **Streams** are managed internally by the GrCUDA runtime: we keep track of existing streams that are currently empty, and schedule computations on them in a FIFO order.
  New streams are created only if no existing stream is available
 * **Read-only** input arguments can be specified with the `const` keyword; they will be ignored in the dependency computations if possible:
  for example, if there are 2 kernels that use the same read-only input array, they will be executed concurrently 
 
-* **Current limitations**
-    1. ~~Dependency computation does not consider disjoint parameter subsets.~~ **Now available!**
-     Consider 3 kernels, `K1(X, Y)`, `K2(X)`, `K3(Y)`: `K2` and `K3` are both depending on `K1`, but are using different inputs, and can run in parallel.
-    2. ~~Synchronization happens the main execution thread.~~ **We now support fully asynchronous GPU execution on a single CPU thread thanks to CUDA events**
-     in the example before, calling `K2` requires to sync on the stream used by `K1`, and `K3` starts only after `K2` has started.
-      If `Y` was read-only in `K1`, this wait would have been unnecessary
-    3. **Scalar values are not considered for dependencies**: they are read-only when used as input, but there could be output-input dependencies in library functions with scalar output ([API Design, point 4](#api-design)) 
-    4. Read-only arguments are visible on the **default stream**, instead of having their visibility limited to the stream where they are used.
-     This makes the arguments visible to kernels running on different streams, but accesses by the CPU to these arguments require full device synchronization.
-     This limitation is likely to occur only on Pre-Pascal devices, however
-    5. **Library functions are not considered for asynchronous execution**: not all library functions expose a stream interface for asynchronous execution, and they are currently ignored by the DAG scheduler. 
-    We need to add, at the very least, a synchronization point before synchronous library functions
-
 ## Open questions
 
 ### Questions on API design (i.e. how do we provide the best user experience)
 
-1. ~~How to understand if a parameter is read-only? ([API Design, point 4](#api-design))~~ We can use the `const` keyword in the computation signature
-2. How do we track scalar values in library function outputs? ([API Design, point 5](#api-design))
-3. How can user specify options cleanly? ([API Design, point 2](#api-design))
+1. How do we track scalar values in library function outputs? ([API Design, point 5](#api-design))
+    * It is not clear if such library exists, for now we have not seen such situation.
+2. How can user specify options cleanly? ([API Design, point 2](#api-design))
     * Using only context startup options is limiting, but it simplify the problem (we don't have to worry about changing how the DAG is built at runtime)
     * If we want provide more flexibility, we can add functions to the DSL, but that's not very clean
 
-### Questions on internal development (i.e. how do we do something in the most powerful/flexible/efficient way)
-
-1. How to handle library functions? They usually have no stream options ([API Design, point 6](#api-design))
-2. How to handle pre-registered libraries and external functions? Same problem as question 1
-
-### Other questions (e.g. things I don't understand about GrCUDA/GraalVM/Truffle)
-
-1. What are `map` and `shred` functions? Are they exposed to the outside?
-2. When doing unit-testing, can we access internal data structures of the guest language (e.g. to monitor the state of the DAG)
-3. In Graalpython, can we create new guest polyglot contexts at runtime, with user-specified options?
-
 *** 
 
 ## Detailed development notes
@@ -103,25 +81,23 @@ Dependencies are inferred automatically, instead of being manually specified by
  2. The API needs ways to modify the scheduling policy, if desired (e.g. go back to fully synchronized execution)
      * Context startup option? Easy, but cannot be modified
      * Expose a function in the GrCUDA DSL? More flexibility, but changing options using the DSL is not very clean
- 3. ~~How to handle CPU control flow? In GrCUDA we are not aware of `if` and `for` loops on the host side~~
-     * The DAG is built dynamically: we need to update it as we receive scheduling orders, and decide if we can execute or not. We don't care about the original control flow
- 4. How do we identify if a **parameter is read-only**? If two kernels use the same parameter but only read from it, they can execute in parallel
+ 3. How do we identify if a **parameter is read-only**? If two kernels use the same parameter but only read from it, they can execute in parallel
      * This is not trivial: LLVM can understand, for example, if a scalar value is read-only, but doing that with an array is not always possible
      * Users might have to specify which parameters are read-only in the kernel signature, which is still better than using explicit handles
      * For now, we let programmers manually specify read-only array arguments using the `const` keyword, as done in `CUDA`
- 5. How do we handle scalar values? We could also have dependencies due to scalar values (e.g. a computation is started only if the error in the next iteration is above a threashold)
+ 4. How do we handle scalar values? We could also have dependencies due to scalar values (e.g. a computation is started only if the error in the next iteration is above a threashold)
      * Currently, only reads from `DeviceArray` (and similar) return scalar values, and they must be done synchronously, as the result is immediately exposed to the guest language. 
      * Array reads (and writes) are done synchronously by the host, and we guarantee that no kernel that uses the affected array is running
      * Kernels do not return scalar values, and scalar outputs are stored in a size-1 array (which we can treat as any other array)
      * Then the programmer can pass the size-1 array to another computation (handled like any array), or extract the value with an array read that triggers synchronization
      * Scalar values are only problematic when considering library functions that return them
      * One idea could be to *box* scalar values with Truffle nodes and store the actual value using a `Future`.
      If the user reads or writes the value, we wait for the GPU computation to end. Then the scalar value can be unboxed to avoid further overheads.
-     * But running library functions on streams is problematic (see problem 6), so this solution might not be required
+     * But running library functions on streams is problematic, so this solution might not be required
  6. Library functions: library functions are more complex to handle as they could also have code running on the host side.
     * They also do not expose streams, so it could be difficult to pipeline them
     * In some cases they might expose streams in the signature, we can probably find them by parsing the signature
-    * They can also return scalars (see problem 5)
+    * They can also return scalars
     * If we run them on threads, we parallelize at least the CPU side
 
 ### What is a computational element in GrCUDA?
@@ -149,7 +125,7 @@ Library functions (non-kernels) can also be loaded, using `BindFunction`
 Invocation to computational elements are wrapped in classes that extend a generic `GrCUDAComputationalElement`.
 `GrCUDAComputationalElement` is used to build the vertices of the DAG and exposes interfaces to compute data dependencies with other `GrCUDAComputationalElements` and to schedule the computation
 
-### Other notes on GrCUDA architecture
+### Other notes on the internal GrCUDA architecture
 
 These notes relate to the structure of the original GrCUDA repository. You can skip them if you are already familiar with it!