Skip to content

Commit 23c7de5

Browse files
committed
C API RFC
1 parent ae0c89d commit 23c7de5

File tree

1 file changed

+126
-0
lines changed

1 file changed

+126
-0
lines changed

rfcs/20240806-c-api/README.md

+126
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
# C API Design Document (RFC)
2+
3+
4+
## Introduction
5+
6+
The oneCCL communication library’s current APIs is defined in the [oneAPI
7+
specification][ccl-spec]. However, other APIs used by similar collective
8+
communication libraries differ from those used by oneCCL. For example, see
9+
[NCCL][nccl-spec] from Nvidia, [RCCL][rccl-spec] from AMD, and hccl from
10+
Habana. This RFC asks for feedback about aligning the oneCCL APIs to be closer
11+
to other vendor libraries, since this facilitates integration with frameworks
12+
and upstreaming to the open source.
13+
14+
One difference between oneCCL and other vendors communication libraries is that
15+
all other communication libraries have a C API, while oneCCL has a C++ API.
16+
This is because oneCCL was designed to integrate with SYCL, which is based on
17+
C++. One of the goals of oneCCL is to support different hardware and vendors,
18+
such as Intel Data Center GPU Max Series, Intel Core and Intel Xeon family,
19+
Intel Gaudi, Nvidia or AMD GPUs, among others.
20+
21+
[ccl-spec]: https://uxlfoundation.github.io/oneAPI-spec/spec/elements/oneCCL/source/index.html
22+
[hccl-spec]: https://docs.habana.ai/en/latest/API_Reference_Guides/HCCL_APIs/C_API.html
23+
[nccl-spec]: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api.html
24+
[rccl-spec]: https://rocm.docs.amd.com/projects/rccl/en/latest/api-reference/api-library.html#api-library
25+
26+
## Proposal
27+
28+
The proposal is to define a C-like API that aligns with current APIs in other
29+
communication libraries, while introducing a few changes, as described next:
30+
31+
1. Most APIs are C-based like other communication libraries. C++ data
32+
structures are hidden behind handles returned to the user, such as
33+
`ccl::stream` and `ccl::comm`.
34+
35+
2. The API is extended with two C++ API functions to support `sycl::queue`:
36+
37+
- `onecclResult_t onecclCreateStream(sycl::queue, &oneccl_stream)`
38+
- `onecclResult_t onecclReleaseStream(oneccl_stream)`
39+
40+
Once the sycl::queue is registered, it is hidden behind the ccl stream
41+
handle
42+
43+
3. Add functions to allow users to explicitly control the lifetime of objects,
44+
instead of relying on the C++ destructors
45+
46+
- `onecclResult_t onecclCommFinalize(comm)`
47+
- `onecclResult_t onecclCommDestroy(comm)`
48+
49+
4. Drop support for out-of-order SYCL queue and SYCL buffers. The current
50+
oneCCL library support out of order SYCL queues, but this feature is not
51+
used by the users of the library. In general, the collective operations are
52+
submitted to an in-order queue. When out-of order behavior is required,
53+
commands are submitted to a different in-order queue, and the two queues are
54+
synchronized.
55+
56+
5. Drop support for SYCL buffers. Only [Unified Shared Memory][usm-example] is
57+
supported.
58+
59+
[usm-example]: https://www.intel.com/content/www/us/en/developer/articles/code-sample/dpcpp-usm-code-sample.html
60+
61+
### APIs
62+
63+
The tables below contain the NCCL API, the corresponding new proposed oneCCL
64+
API, and the current oneCCL API.
65+
66+
#### APIs related with communicator creation.
67+
68+
| NCCL | oneCCL (proposed C) | oneCCL (current, C++) |
69+
|-------------------|------------------------------|-------------------------|
70+
|`cudaError_t` |`onecclResult_t cudaSetDevice(device)(1)`| N/A |
71+
|`ncclResult_t ncclGetUniqueId (id)`| `onecclResult_t onecclGetUniqueId (id)`| `ccl::create_main_kvs(); ccl::create_kvs(main_addr);`|
72+
|`ncclResult_t ncclCommInitRank(comm, size, id, rank)`|`onecclResult_t onecclCommInitRank(comm, size, id, rank)`|`comm cl::create_communicator(size, rank, device, context, kvs) comms ccl:create_communicators(size, rank, device, context, kvs)`|
73+
|`ncclResult_t ncclCommInitRankConfig(comm, size, id, rank, attr)`|`onecclResult_t onecclCommInitRankConfig(comm, size, id, rank, attr)`|`comm ccl:create_communicator(size, rank, device, context, kvs, attr)`|
74+
|`ncclResult_t ncclCommInitAll (comms, ndev, dev_list)`|`onecclResult_t onecclCommInitAll(comms,ndev,dev_list)`| Not currently available.Working on adding support.|
75+
|`ncclCommSplit` | Not implemented | Not implemented |
76+
|`nccltResult ncclCommFinalize(comm)`|`onecclResult_t onecclCommFinalize(comm)`| N/A |
77+
|`ncclResult_t ncclCommDestroy(comm)`|`onecclResult_t onecclCommDestroy(comm)`| Destructor |
78+
79+
Notice that cudaSetDevice(device) is a CUDA call, not a NCCL call. If an
80+
equivalent call is available in SYCL (or calling language), the proposed
81+
onecclSetDevice(device) will not be needed.
82+
83+
#### APIs related with Collective Communication operations
84+
85+
| NCCL | oneCCL (proposed C) | oneCCL (current, C++) |
86+
|-------------------|------------------------------|-------------------------|
87+
|`ncclResult_t ncclAllgather (sendbuff,recvbuff,count, datatype, op, comm, stream)`|`onecclResult_t onecclAllgather(sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream)`|`ccl::event communicator::allgather (2) (sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream, deps)`|
88+
|`ncclResult_t ncclAllreduce(sendbuff,recvbuff, count, datatype, op, comm, stream)`|`onecclResult_t onecclAllreduce(sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream)`|`ccl::event
89+
communicator::allreduce(sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream, deps)`|
90+
|`ncclResult_t ncclBroadcast(sendbuff,recvbuff,count, datatype, op, comm, stream)`|`onecclResult_t onecclBroadcast(sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream)`|`ccl::event communicator::broadcast (3) (sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream, deps)`|
91+
|`ncclResult_t ncclReduce(sendbuff,recvbuff,count, datatype, op, comm, stream)`|`onecclResult_t onecclReduce(sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream)`|`ccl::event communicator::reduce(sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream, deps)`|
92+
|`ncclResult_t ncclReduceScatter(sendbuff,recvbuff, count, datatype, op, comm, stream)`|`onecclResult_t onecclReduceScatter(sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream)`|`ccl::event communicator::reduce_scatter(sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream, deps)`|
93+
| N/A |`onecclAlltoall onecclAlltoallv` We could deprecate|`communicator::alltoall communicator::alltoallv`|
94+
| N/A |`onecclBarrier` We could deprecate and use Allreduce with 1 Byte|`ccl::event communicator::barrier`|
95+
96+
- Currently oneCCL contains Allgatherv, but this will be deprecated in the
97+
future
98+
- The current API is slightly different, but the next oneCCL release will align
99+
the Broadcast with the one shown here
100+
101+
#### Group APIs
102+
103+
| NCCL | oneCCL (proposed C) | oneCCL (current, C++) |
104+
|-------------------|------------------------------|-------------------------|
105+
|`ncclResult_t ncclGroupStart()`|`onecclResult_t onecclGroupStart()`| N/A |
106+
|`ncclResult_t ncclGroupEnd()` |`onecclResult_t onecclGroupEnd()` | N/A |
107+
108+
#### Point to Point APIs
109+
110+
| NCCL | oneCCL (proposed C) | oneCCL (current, C++) |
111+
|-------------------|------------------------------|-------------------------|
112+
|`ncclResult_t ncclSend(sendbuf, count, datatype, peer, comm, stream)`|`onecclResult_t onecclSend(sendbuf, count, datatype, peer, comm, oneccl_stream)`|`ccl::event communicator::send(sendbuf, count,datatype, peer, comm, oneccl_stream)`|
113+
|`ncclResult_t ncclRecv(…)`|`onecclResult_t onecclRecv(…)`|`communicator::recv`|
114+
115+
#### Other APIs
116+
117+
| NCCL | oneCCL (proposed C) | oneCCL (current, C++) |
118+
|-------------------|------------------------------|-------------------------|
119+
|`ncclResult_t ncclCommCount(comm, size)`|`onecclResult_t onecclCommCount(comm, size)`|`size communicator::size()`|
120+
|`ncclResult_t ncclCommCuDevice(comm, device)`|`onecclResult_t onecclCommGetDevice(comm, device)`|`device communicator::get_device()`|
121+
|`ncclResult_t ncclCommUserRank(comm, rank)`|`onecclResult_t onecclCommUserRank(comm, rank)`|`rank communicator::rank()`|
122+
|`ncclResult_t ncclGetVersion(version)`|`onecclResult_t onecclGetVersion(version)`|`version ccl:get_library_version()`|
123+
|`ncclCommAbort` | Not implemented | N/A |
124+
|`ncclCommGetAsyncError`| Not implemented | N/A |
125+
|`ncclGetLastError` | Not implemented | N/A |
126+
|`ncclGetErrorString`| Not implemented | N/A |

0 commit comments

Comments
 (0)