kokkos · romintomasetti · Aug 21, 2024 · Aug 27, 2024 · Aug 29, 2024 · Aug 30, 2024
diff --git a/docs/source/API/core-index.rst b/docs/source/API/core-index.rst
@@ -37,6 +37,8 @@ API: Core
      - Utility functionality part of Kokkos Core.
    * - `Detection Idiom <core/Detection-Idiom.html>`__
      - Used to recognize, in an SFINAE-friendly way, the validity of any C++ expression.
+   * - `Graph and related <core/Graph.html>`_
+     - Kokkos Graph abstraction.
    * - `Macros <core/Macros.html>`__
      - Global macros defined by Kokkos, used for architectures, general settings, etc.
 
@@ -60,4 +62,5 @@ API: Core
    ./core/Utilities
    ./core/Detection-Idiom
    ./core/Macros
+   ./core/Graph
    ./core/Profiling
diff --git a/docs/source/API/core/Graph.axpby.kokkos.graph.cpp b/docs/source/API/core/Graph.axpby.kokkos.graph.cpp
@@ -0,0 +1,12 @@
+auto graph = Kokkos::Experimental::create_graph(exec_A, [&](auto root){
+    auto node_xpy = root.then_parallel_for(N, MyAxpby{x, y, alpha, beta});
+    auto node_zpy = root.then_parallel_for(N, MyAxpby{z, y, gamma, beta});
+
+    auto node_dotp = Kokkos::Experimental::when_all(node_xpy, node_zpy).then_parallel_reduce(
+        N, MyDotp{x, z}, dotp
+    )
+});
+
+graph.submit(exec_A);
+
+exec_A.fence();
diff --git a/docs/source/API/core/Graph.axpby.kokkos.graph.p2300.cpp b/docs/source/API/core/Graph.axpby.kokkos.graph.p2300.cpp
@@ -0,0 +1,77 @@
+/**
+ * This is some external library function to which we pass a sender.
+ * The sender might either be a regular @c Kokkos execution space instance
+ * or a graph-node-sender-like stuff.
+ * The asynchronicity within the function will either be provided by the graph
+ * or must be dealt with in the regular way (creating many space instances).
+ */
+sender library_stuff(sender start)
+{
+    sender auto exec_A, exec_B;
+
+    if constexpr (Kokkos::is_a_sender<sender>) {
+        exec_A = exec_B = start;
+    } else {
+        /// How do we partition ?
+        exec_A = start;
+        exec_B = Kokkos::partition_space(start, 1);
+    }
+
+    auto node_xpy = Kokkos::parallel_for(exec_A, policy(N), MyAxpby{x, y, alpha, beta});
+    auto node_zpy = Kokkos::parallel_for(exec_B, policy(N), MyAxpby{z, y, gamma, beta});
+
+    /// In the non-graph case,how do we enforce that e.g. node_zpy is done and launch
+    /// the parallel-reduce on the same execution space instance as node_xpy without writing
+    /// any additional piece of code ?
+
+    /// No need to fence, because @c Kokkos::when_all will take care of that.
+    return Kokkos::parallel_reduce(
+        Kokkos::when_all(node_xpy, node_zpy),
+        policy(N),
+        MyDotp{x, z}, dotp
+    );
+}
+
+int main()
+{
+    /// A @c Kokkos execution space instance is a context (i.e. a source
+    /// of asynchronous execution such as a thread pool or a GPU stream)
+    const Kokkos::DefaultExecutionSpace context {};
+
+    /// A scheduler is a lightweight handle to an execution context.
+    stdexec::scheduler auto scheduler = context.get_scheduler();
+
+    /**
+     * Start the chain of nodes with an "empty" node, similar to @c std::execution::schedule.
+     * Under the hood, it creates the @c Kokkos::Graph.
+     * All nodes created from this sender will share a handle to the underlying @c Kokkos::Graph.
+     */
+    stdexec::sender auto start = Kokkos::Experimental::Graph::schedule(scheduler);
+
+    /// @c Kokkos::parallel_for would behave much like @c std::execution::bulk.
+    stdexec::sender auto my_work = Kokkos::Experimental::Graph::parallel_for(start, policy(N), ForFunctor{...});
+
+    /// Pass our chain to some external library function.
+    stdexec::sender auto subgraph = library_stuff(mywork);
+
+    /// Add some work again.
+    stdexec::sender auto my_other_work = Kokkos::Experimental::Graph::parallel_scan(subgraph, policy(N), ScanFunctor{...});
+
+    /// @c Kokkos::Graph has a free function for instantiating the underlying graph.
+    /// All nodes connected to the same handle (i.e. that are on the same chain) are notified
+    /// that they cannot be used as senders anymore,
+    /// because they are locked in an instantiated graph. In other words, the chain is a DAG, and it
+    /// cannot change anymore.
+    stdexec::sender auto executable_chain = Kokkos::Graph::instantiate(my_other_work);
+
+    /// Submission is a no-op if the passed sender is a @c Kokkos execution space instance.
+    /// Otherwise, it submits the underlying graph.
+    Kokkos::Graph::submit(scheduler, executable_chain)
+
+    ::stdexec::sync_wait(scheduler);
+
+    /// Submit the chain again, using another scheduler.
+    /// In essence, what @c Kokkos::Graph::submit can do is pertty much similar to what
+    /// @c std::execution::starts_on does. It allows the sender to be executed elsewhere.
+    Kokkos::Graph::submit(another_scheduler, executable_chain);
+}
diff --git a/docs/source/API/core/Graph.axpby.kokkos.vanilla.cpp b/docs/source/API/core/Graph.axpby.kokkos.vanilla.cpp
@@ -0,0 +1,8 @@
+Kokkos::parallel_for(policy_t(exec_A, 0, N), MyAxpby{x, y, alpha, beta});
+Kokkos::parallel_for(policy_t(exec_B, 0, N), MyAxpby{z, y, gamma, beta});
+
+exec_B.fence();
+
+Kokkos::parallel_reduce(policy_t(exec_A, 0, N), MyDotp{x, z}, dotp);
+
+exec_A.fence();
diff --git a/docs/source/API/core/Graph.old.rst b/docs/source/API/core/Graph.old.rst
@@ -0,0 +1,166 @@
+# What are the semantics of `Kokkos::Graph` ?
+
+What are the allowed semantics of `Kokkos::Graph` ?
+
+Questions:
+
+1. Do we document the allowed semantics for which the user gets covered by `Kokkos` or do we try to enforce the semantics with object states and stuff ?
+2. What about the execution space instance ? It seems that `submit` should allow one to be passed.
+3. Multi-GPU.
+4. runtime aggregate node is still not possible, see https://github.com/kokkos/kokkos/issues/6060.
+4. Missing documentation online ?
+
+It should allow functionalities listed in https://www.olcf.ornl.gov/wp-content/uploads/2021/10/013_CUDA_Graphs.pdf, slide 4.
+
+## Usage
+
+How would people use `Kokkos::Graph` ?
+
+### The simplest usage I could come with
+
+The graph is known in advance (at compile time) and can be created in the lambda (*i.e.* not using hidden `impl` stuff).
+Once created, the user expects that the graph can be re-submitted several time. The user does not want to add/remove nodes once submitted for the first time (no fancy stuff).
+The user does not care about streams whatsoever.
+
+1. Create some `data` in a view, and a `functor` to act on it.
+2. Create the `graph` and add a parallel-for `node` using the `functor` acting on `data`.
+3. Submit the graph as much as you want.
+
+```c++
+template <typename Mem>
+struct Functor
+{
+    Kokkos::View<int*, Mem> data;
+
+    template <std::integer T>
+    KOKKOS_FUNCTION
+    void operator()(const T index) const { ... <data> ... };
+};
+
+int main()
+{
+    const Kokkos::View<int*, Exec> data(...);
+
+    auto graph = Kokkos::Experimental::create_graph<Exec>([&](auto root) {
+        [[maybe_unused]] const auto node = root.then_parallel_for(0, ..., Functor<Mem>{ .data = data });
+    });
+
+    graph.submit();
+}
+```
+
+### More advanced usage
+
+The graph is unknown and cannot be easily/prettily create in the lambda (*e.g.* the user attaches nodes dynamically depending on some complex setup like partitioning).
+Once created, the user still expects that the graph can be re-submitted several time.
+The user care about streams for orchestration.
+
+We need to use some `impl` stuff for such a case.
+
+```c++
+/**
+ * Create the graph.
+ *
+ * 1. Damien said there are other ways to do that w/o using Impl, but I could not find them. It seems that TestGraph.hpp only uses
+ *    the Kokkos::Experimental::create_graph that takes a closure.
+ *    It seems that 'construct_graph' should somehow be promoted to the public API. Is there any reason not to do so?
+ * 2. The execution space instance is not used until the executable graph is launched with 'cudaGraphLaunch'.
+ *    Therefore, it's questionnable whether it should be part of the Kokkos::Graph state or not (it's an Impl detail though).
+ */
+auto graph = Kokkos::Impl::GraphAccess::construct_graph(exec_a);
+auto root  = Kokkos::Impl::GraphAccess::create_root_ref(graph);
+
+/**
+ * Fill the graph with nodes, according to a complex DAG topology.
+ * The nodes might be added conditionally (conditions might change at runtime, e.g. MPI partitioning).
+ *
+ *       ROOT
+ *      /    \
+ *     N11    N12
+ *     |       | \
+ *     N21    N22 N23
+ *     \      /   /
+ *      \    /   /
+ *         N31
+ *
+ * @todo Add @c if nodes. See also https://developer.nvidia.com/blog/dynamic-control-flow-in-cuda-graphs-with-conditional-nodes/.
+ */
+std::vector<generic_node_t> N31_predecessors;
+
+if(condition_branch_1) // branch 1
+{
+    auto N11 = root.then_parallel_for(...label..., ...policy..., ...body...);
+    auto N21 = root.then_parallel_for(...label..., ...policy..., ...body...);
+    N31_predecessors.push_back(N21);
+}
+
+if(condition_branch_2) // branch 2
+{
+    auto N12 = root.then_parallel_for(...name..., ...policy..., ...body...);
+    auto N22 = root.then_parallel_for(...name..., ...policy..., ...body...);
+    auto N23 = root.then_parallel_for(...name..., ...policy..., ...body...);
+    N31_predecessors.push_back(N22);
+    N31_predecessors.push_back(N23);
+}
+
+//! This is currently impossible. See also https://github.com/kokkos/kokkos/issues/6060.
+auto N31_ready = Kokkos::Experimental::when_all(N31_predecessors);
+auto N31 = N31_ready.then_parallel_for(...name..., ...policy..., ...body...);
+
+/**
+ * The topology of the graph has been defined.
+ * It now has to be instantiated.
+ * According to:
+ *  - https://www.olcf.ornl.gov/wp-content/uploads/2021/10/013_CUDA_Graphs.pdf (slide 9)
+ *  - https://developer.nvidia.com/blog/employing-cuda-graphs-in-a-dynamic-environment/
+ * the topology cannot change once the graph has been instantiated,
+ * but the nodes parameters may be updated (cudaGraphExecUpdate).
+ */
+graph.instantiate(...)
+
+/**
+ * Launch the graph on some execution space instance.
+ * Re-launch onto another execution space instance. 
+ * According to cudaGraphLaunch, a stream is allowed and it makes sense.
+ *
+ * @todo Check for @c HIP and @c SYCL.
+ */
+graph.submit(exec_b);
+graph.submit(exec_c);
+```
+
+## What to do, prioritizing
+
+### Promote `construct_graph` to the public API
+
+This allows for advanced use cases that do not fit well with the current closure-based construction API.
+
+Retrieving the root node should also be promoted to the public API.
+
+### `Kokkos::Graph::instantiate`
+
+**Add** `Kokkos::Graph::instantiate` to the public API.
+
+This allows the user to control when the executable graph gets instantiated.
+
+It can be called only once.
+
+Adding nodes after instantiation is prohibited.
+
+### `Kokkos::Graph::submit`
+
+**Change** the public API to accept an execution space instance.
+
+Note that it is simply used to order the graph launch into some work queue.
+
+### Remove the execution space instance from `Kokkos::Graph` state
+
+The title says it all.
+
+### Allow dynamic aggregate node
+
+**Add** a `Kokkos::Experimental::when_all` that allows for a vector/list of nodes to be passed.
+
+## Go further
+
+We might want to get the design of `Kokkos::Graph` close to `std::execution` (https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p2300r10.html).