Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/source/API/core-index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,8 @@ API: Core
- Utility functionality part of Kokkos Core.
* - `Detection Idiom <core/Detection-Idiom.html>`__
- Used to recognize, in an SFINAE-friendly way, the validity of any C++ expression.
* - `Graph and related <core/Graph.html>`_
- Kokkos Graph abstraction.
* - `Macros <core/Macros.html>`__
- Global macros defined by Kokkos, used for architectures, general settings, etc.

Expand All @@ -60,4 +62,5 @@ API: Core
./core/Utilities
./core/Detection-Idiom
./core/Macros
./core/Graph
./core/Profiling
12 changes: 12 additions & 0 deletions docs/source/API/core/Graph.axpby.kokkos.graph.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
auto graph = Kokkos::Experimental::create_graph(exec_A, [&](auto root){
auto node_xpy = root.then_parallel_for(N, MyAxpby{x, y, alpha, beta});
auto node_zpy = root.then_parallel_for(N, MyAxpby{z, y, gamma, beta});

auto node_dotp = Kokkos::Experimental::when_all(node_xpy, node_zpy).then_parallel_reduce(
N, MyDotp{x, z}, dotp
)
});

graph.submit(exec_A);

exec_A.fence();
77 changes: 77 additions & 0 deletions docs/source/API/core/Graph.axpby.kokkos.graph.p2300.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
/**
* This is some external library function to which we pass a sender.
* The sender might either be a regular @c Kokkos execution space instance
* or a graph-node-sender-like stuff.
* The asynchronicity within the function will either be provided by the graph
* or must be dealt with in the regular way (creating many space instances).
*/
sender library_stuff(sender start)
{
sender auto exec_A, exec_B;

if constexpr (Kokkos::is_a_sender<sender>) {
exec_A = exec_B = start;
} else {
/// How do we partition ?
exec_A = start;
exec_B = Kokkos::partition_space(start, 1);
}

auto node_xpy = Kokkos::parallel_for(exec_A, policy(N), MyAxpby{x, y, alpha, beta});
auto node_zpy = Kokkos::parallel_for(exec_B, policy(N), MyAxpby{z, y, gamma, beta});

/// In the non-graph case,how do we enforce that e.g. node_zpy is done and launch
/// the parallel-reduce on the same execution space instance as node_xpy without writing
/// any additional piece of code ?

/// No need to fence, because @c Kokkos::when_all will take care of that.
return Kokkos::parallel_reduce(
Kokkos::when_all(node_xpy, node_zpy),
policy(N),
MyDotp{x, z}, dotp
);
}

int main()
{
/// A @c Kokkos execution space instance is a context (i.e. a source
/// of asynchronous execution such as a thread pool or a GPU stream)
const Kokkos::DefaultExecutionSpace context {};

/// A scheduler is a lightweight handle to an execution context.
stdexec::scheduler auto scheduler = context.get_scheduler();

/**
* Start the chain of nodes with an "empty" node, similar to @c std::execution::schedule.
* Under the hood, it creates the @c Kokkos::Graph.
* All nodes created from this sender will share a handle to the underlying @c Kokkos::Graph.
*/
stdexec::sender auto start = Kokkos::Experimental::Graph::schedule(scheduler);

/// @c Kokkos::parallel_for would behave much like @c std::execution::bulk.
stdexec::sender auto my_work = Kokkos::Experimental::Graph::parallel_for(start, policy(N), ForFunctor{...});

/// Pass our chain to some external library function.
stdexec::sender auto subgraph = library_stuff(mywork);

/// Add some work again.
stdexec::sender auto my_other_work = Kokkos::Experimental::Graph::parallel_scan(subgraph, policy(N), ScanFunctor{...});

/// @c Kokkos::Graph has a free function for instantiating the underlying graph.
/// All nodes connected to the same handle (i.e. that are on the same chain) are notified
/// that they cannot be used as senders anymore,
/// because they are locked in an instantiated graph. In other words, the chain is a DAG, and it
/// cannot change anymore.
stdexec::sender auto executable_chain = Kokkos::Graph::instantiate(my_other_work);

/// Submission is a no-op if the passed sender is a @c Kokkos execution space instance.
/// Otherwise, it submits the underlying graph.
Kokkos::Graph::submit(scheduler, executable_chain)

::stdexec::sync_wait(scheduler);

/// Submit the chain again, using another scheduler.
/// In essence, what @c Kokkos::Graph::submit can do is pertty much similar to what
/// @c std::execution::starts_on does. It allows the sender to be executed elsewhere.
Kokkos::Graph::submit(another_scheduler, executable_chain);
}
8 changes: 8 additions & 0 deletions docs/source/API/core/Graph.axpby.kokkos.vanilla.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
Kokkos::parallel_for(policy_t(exec_A, 0, N), MyAxpby{x, y, alpha, beta});
Kokkos::parallel_for(policy_t(exec_B, 0, N), MyAxpby{z, y, gamma, beta});

exec_B.fence();

Kokkos::parallel_reduce(policy_t(exec_A, 0, N), MyDotp{x, z}, dotp);

exec_A.fence();
166 changes: 166 additions & 0 deletions docs/source/API/core/Graph.old.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
# What are the semantics of `Kokkos::Graph` ?

What are the allowed semantics of `Kokkos::Graph` ?

Questions:

1. Do we document the allowed semantics for which the user gets covered by `Kokkos` or do we try to enforce the semantics with object states and stuff ?
2. What about the execution space instance ? It seems that `submit` should allow one to be passed.
3. Multi-GPU.
4. runtime aggregate node is still not possible, see https://github.com/kokkos/kokkos/issues/6060.
4. Missing documentation online ?

It should allow functionalities listed in https://www.olcf.ornl.gov/wp-content/uploads/2021/10/013_CUDA_Graphs.pdf, slide 4.

## Usage

How would people use `Kokkos::Graph` ?

### The simplest usage I could come with

The graph is known in advance (at compile time) and can be created in the lambda (*i.e.* not using hidden `impl` stuff).
Once created, the user expects that the graph can be re-submitted several time. The user does not want to add/remove nodes once submitted for the first time (no fancy stuff).
The user does not care about streams whatsoever.

1. Create some `data` in a view, and a `functor` to act on it.
2. Create the `graph` and add a parallel-for `node` using the `functor` acting on `data`.
3. Submit the graph as much as you want.

```c++
template <typename Mem>
struct Functor
{
Kokkos::View<int*, Mem> data;

template <std::integer T>
KOKKOS_FUNCTION
void operator()(const T index) const { ... <data> ... };
};

int main()
{
const Kokkos::View<int*, Exec> data(...);

auto graph = Kokkos::Experimental::create_graph<Exec>([&](auto root) {
[[maybe_unused]] const auto node = root.then_parallel_for(0, ..., Functor<Mem>{ .data = data });
});

graph.submit();
}
```

### More advanced usage

The graph is unknown and cannot be easily/prettily create in the lambda (*e.g.* the user attaches nodes dynamically depending on some complex setup like partitioning).
Once created, the user still expects that the graph can be re-submitted several time.
The user care about streams for orchestration.

We need to use some `impl` stuff for such a case.

```c++
/**
* Create the graph.
*
* 1. Damien said there are other ways to do that w/o using Impl, but I could not find them. It seems that TestGraph.hpp only uses
* the Kokkos::Experimental::create_graph that takes a closure.
* It seems that 'construct_graph' should somehow be promoted to the public API. Is there any reason not to do so?
* 2. The execution space instance is not used until the executable graph is launched with 'cudaGraphLaunch'.
* Therefore, it's questionnable whether it should be part of the Kokkos::Graph state or not (it's an Impl detail though).
*/
auto graph = Kokkos::Impl::GraphAccess::construct_graph(exec_a);
auto root = Kokkos::Impl::GraphAccess::create_root_ref(graph);

/**
* Fill the graph with nodes, according to a complex DAG topology.
* The nodes might be added conditionally (conditions might change at runtime, e.g. MPI partitioning).
*
* ROOT
* / \
* N11 N12
* | | \
* N21 N22 N23
* \ / /
* \ / /
* N31
*
* @todo Add @c if nodes. See also https://developer.nvidia.com/blog/dynamic-control-flow-in-cuda-graphs-with-conditional-nodes/.
*/
std::vector<generic_node_t> N31_predecessors;

if(condition_branch_1) // branch 1
{
auto N11 = root.then_parallel_for(...label..., ...policy..., ...body...);
auto N21 = root.then_parallel_for(...label..., ...policy..., ...body...);
N31_predecessors.push_back(N21);
}

if(condition_branch_2) // branch 2
{
auto N12 = root.then_parallel_for(...name..., ...policy..., ...body...);
auto N22 = root.then_parallel_for(...name..., ...policy..., ...body...);
auto N23 = root.then_parallel_for(...name..., ...policy..., ...body...);
N31_predecessors.push_back(N22);
N31_predecessors.push_back(N23);
}

//! This is currently impossible. See also https://github.com/kokkos/kokkos/issues/6060.
auto N31_ready = Kokkos::Experimental::when_all(N31_predecessors);
auto N31 = N31_ready.then_parallel_for(...name..., ...policy..., ...body...);

/**
* The topology of the graph has been defined.
* It now has to be instantiated.
* According to:
* - https://www.olcf.ornl.gov/wp-content/uploads/2021/10/013_CUDA_Graphs.pdf (slide 9)
* - https://developer.nvidia.com/blog/employing-cuda-graphs-in-a-dynamic-environment/
* the topology cannot change once the graph has been instantiated,
* but the nodes parameters may be updated (cudaGraphExecUpdate).
*/
graph.instantiate(...)

/**
* Launch the graph on some execution space instance.
* Re-launch onto another execution space instance.
* According to cudaGraphLaunch, a stream is allowed and it makes sense.
*
* @todo Check for @c HIP and @c SYCL.
*/
graph.submit(exec_b);
graph.submit(exec_c);
```

## What to do, prioritizing

### Promote `construct_graph` to the public API

This allows for advanced use cases that do not fit well with the current closure-based construction API.

Retrieving the root node should also be promoted to the public API.

### `Kokkos::Graph::instantiate`

**Add** `Kokkos::Graph::instantiate` to the public API.

This allows the user to control when the executable graph gets instantiated.

It can be called only once.

Adding nodes after instantiation is prohibited.

### `Kokkos::Graph::submit`

**Change** the public API to accept an execution space instance.

Note that it is simply used to order the graph launch into some work queue.

### Remove the execution space instance from `Kokkos::Graph` state

The title says it all.

### Allow dynamic aggregate node

**Add** a `Kokkos::Experimental::when_all` that allows for a vector/list of nodes to be passed.

## Go further

We might want to get the design of `Kokkos::Graph` close to `std::execution` (https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p2300r10.html).
Loading