|
| 1 | +# C API Design Document (RFC) |
| 2 | + |
| 3 | + |
| 4 | +## Introduction |
| 5 | + |
| 6 | +The oneCCL communication library’s current APIs is defined in the [oneAPI |
| 7 | +specification][ccl-spec]. However, other APIs used by similar collective |
| 8 | +communication libraries differ from those used by oneCCL. For example, see |
| 9 | +[NCCL][nccl-spec] from Nvidia, [RCCL][rccl-spec] from AMD, and hccl from |
| 10 | +Habana. This RFC asks for feedback about aligning the oneCCL APIs to be closer |
| 11 | +to other vendor libraries, since this facilitates integration with frameworks |
| 12 | +and upstreaming to the open source. |
| 13 | + |
| 14 | +One difference between oneCCL and other vendors communication libraries is that |
| 15 | +all other communication libraries have a C API, while oneCCL has a C++ API. |
| 16 | +This is because oneCCL was designed to integrate with SYCL, which is based on |
| 17 | +C++. One of the goals of oneCCL is to support different hardware and vendors, |
| 18 | +such as Intel Data Center GPU Max Series, Intel Core and Intel Xeon family, |
| 19 | +Intel Gaudi, Nvidia or AMD GPUs, among others. |
| 20 | + |
| 21 | +[ccl-spec]: https://uxlfoundation.github.io/oneAPI-spec/spec/elements/oneCCL/source/index.html |
| 22 | +[hccl-spec]: https://docs.habana.ai/en/latest/API_Reference_Guides/HCCL_APIs/C_API.html |
| 23 | +[nccl-spec]: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api.html |
| 24 | +[rccl-spec]: https://rocm.docs.amd.com/projects/rccl/en/latest/api-reference/api-library.html#api-library |
| 25 | + |
| 26 | +## Proposal |
| 27 | + |
| 28 | +The proposal is to define a C-like API that aligns with current APIs in other |
| 29 | +communication libraries, while introducing a few changes, as described next: |
| 30 | + |
| 31 | +1. Most APIs are C-based like other communication libraries. C++ data |
| 32 | + structures are hidden behind handles returned to the user, such as |
| 33 | + `ccl::stream` and `ccl::comm`. |
| 34 | + |
| 35 | +2. The API is extended with two C++ API functions to support `sycl::queue`: |
| 36 | + |
| 37 | + - `onecclResult_t onecclCreateStream(sycl::queue, &oneccl_stream)` |
| 38 | + - `onecclResult_t onecclReleaseStream(oneccl_stream)` |
| 39 | + |
| 40 | + Once the sycl::queue is registered, it is hidden behind the ccl stream |
| 41 | + handle |
| 42 | + |
| 43 | +3. Add functions to allow users to explicitly control the lifetime of objects, |
| 44 | + instead of relying on the C++ destructors |
| 45 | + |
| 46 | + - `onecclResult_t onecclCommFinalize(comm)` |
| 47 | + - `onecclResult_t onecclCommDestroy(comm)` |
| 48 | + |
| 49 | +4. Drop support for out-of-order SYCL queue and SYCL buffers. The current |
| 50 | + oneCCL library support out of order SYCL queues, but this feature is not |
| 51 | + used by the users of the library. In general, the collective operations are |
| 52 | + submitted to an in-order queue. When out-of order behavior is required, |
| 53 | + commands are submitted to a different in-order queue, and the two queues are |
| 54 | + synchronized. |
| 55 | + |
| 56 | +5. Drop support for SYCL buffers. Only [Unified Shared Memory][usm-example] is |
| 57 | + supported. |
| 58 | + |
| 59 | +[usm-example]: https://www.intel.com/content/www/us/en/developer/articles/code-sample/dpcpp-usm-code-sample.html |
| 60 | + |
| 61 | +### APIs |
| 62 | + |
| 63 | +The tables below contain the NCCL API, the corresponding new proposed oneCCL |
| 64 | +API, and the current oneCCL API. |
| 65 | + |
| 66 | +#### APIs related with communicator creation. |
| 67 | + |
| 68 | +| NCCL | oneCCL (proposed C) | oneCCL (current, C++) | |
| 69 | +|-------------------|------------------------------|-------------------------| |
| 70 | +|`cudaError_t` |`onecclResult_t cudaSetDevice(device)(1)`| N/A | |
| 71 | +|`ncclResult_t ncclGetUniqueId (id)`| `onecclResult_t onecclGetUniqueId (id)`| `ccl::create_main_kvs(); ccl::create_kvs(main_addr);`| |
| 72 | +|`ncclResult_t ncclCommInitRank(comm, size, id, rank)`|`onecclResult_t onecclCommInitRank(comm, size, id, rank)`|`comm cl::create_communicator(size, rank, device, context, kvs) comms ccl:create_communicators(size, rank, device, context, kvs)`| |
| 73 | +|`ncclResult_t ncclCommInitRankConfig(comm, size, id, rank, attr)`|`onecclResult_t onecclCommInitRankConfig(comm, size, id, rank, attr)`|`comm ccl:create_communicator(size, rank, device, context, kvs, attr)`| |
| 74 | +|`ncclResult_t ncclCommInitAll (comms, ndev, dev_list)`|`onecclResult_t onecclCommInitAll(comms,ndev,dev_list)`| Not currently available.Working on adding support.| |
| 75 | +|`ncclCommSplit` | Not implemented | Not implemented | |
| 76 | +|`nccltResult ncclCommFinalize(comm)`|`onecclResult_t onecclCommFinalize(comm)`| N/A | |
| 77 | +|`ncclResult_t ncclCommDestroy(comm)`|`onecclResult_t onecclCommDestroy(comm)`| Destructor | |
| 78 | + |
| 79 | +Notice that cudaSetDevice(device) is a CUDA call, not a NCCL call. If an |
| 80 | +equivalent call is available in SYCL (or calling language), the proposed |
| 81 | +onecclSetDevice(device) will not be needed. |
| 82 | + |
| 83 | +#### APIs related with Collective Communication operations |
| 84 | + |
| 85 | +| NCCL | oneCCL (proposed C) | oneCCL (current, C++) | |
| 86 | +|-------------------|------------------------------|-------------------------| |
| 87 | +|`ncclResult_t ncclAllgather (sendbuff,recvbuff,count, datatype, op, comm, stream)`|`onecclResult_t onecclAllgather(sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream)`|`ccl::event communicator::allgather (2) (sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream, deps)`| |
| 88 | +|`ncclResult_t ncclAllreduce(sendbuff,recvbuff, count, datatype, op, comm, stream)`|`onecclResult_t onecclAllreduce(sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream)`|`ccl::event |
| 89 | +communicator::allreduce(sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream, deps)`| |
| 90 | +|`ncclResult_t ncclBroadcast(sendbuff,recvbuff,count, datatype, op, comm, stream)`|`onecclResult_t onecclBroadcast(sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream)`|`ccl::event communicator::broadcast (3) (sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream, deps)`| |
| 91 | +|`ncclResult_t ncclReduce(sendbuff,recvbuff,count, datatype, op, comm, stream)`|`onecclResult_t onecclReduce(sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream)`|`ccl::event communicator::reduce(sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream, deps)`| |
| 92 | +|`ncclResult_t ncclReduceScatter(sendbuff,recvbuff, count, datatype, op, comm, stream)`|`onecclResult_t onecclReduceScatter(sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream)`|`ccl::event communicator::reduce_scatter(sendbuff, recvbuff, count, datatype, op, comm, oneccl_stream, deps)`| |
| 93 | +| N/A |`onecclAlltoall onecclAlltoallv` We could deprecate|`communicator::alltoall communicator::alltoallv`| |
| 94 | +| N/A |`onecclBarrier` We could deprecate and use Allreduce with 1 Byte|`ccl::event communicator::barrier`| |
| 95 | + |
| 96 | +- Currently oneCCL contains Allgatherv, but this will be deprecated in the |
| 97 | + future |
| 98 | +- The current API is slightly different, but the next oneCCL release will align |
| 99 | + the Broadcast with the one shown here |
| 100 | + |
| 101 | +#### Group APIs |
| 102 | + |
| 103 | +| NCCL | oneCCL (proposed C) | oneCCL (current, C++) | |
| 104 | +|-------------------|------------------------------|-------------------------| |
| 105 | +|`ncclResult_t ncclGroupStart()`|`onecclResult_t onecclGroupStart()`| N/A | |
| 106 | +|`ncclResult_t ncclGroupEnd()` |`onecclResult_t onecclGroupEnd()` | N/A | |
| 107 | + |
| 108 | +#### Point to Point APIs |
| 109 | + |
| 110 | +| NCCL | oneCCL (proposed C) | oneCCL (current, C++) | |
| 111 | +|-------------------|------------------------------|-------------------------| |
| 112 | +|`ncclResult_t ncclSend(sendbuf, count, datatype, peer, comm, stream)`|`onecclResult_t onecclSend(sendbuf, count, datatype, peer, comm, oneccl_stream)`|`ccl::event communicator::send(sendbuf, count,datatype, peer, comm, oneccl_stream)`| |
| 113 | +|`ncclResult_t ncclRecv(…)`|`onecclResult_t onecclRecv(…)`|`communicator::recv`| |
| 114 | + |
| 115 | +#### Other APIs |
| 116 | + |
| 117 | +| NCCL | oneCCL (proposed C) | oneCCL (current, C++) | |
| 118 | +|-------------------|------------------------------|-------------------------| |
| 119 | +|`ncclResult_t ncclCommCount(comm, size)`|`onecclResult_t onecclCommCount(comm, size)`|`size communicator::size()`| |
| 120 | +|`ncclResult_t ncclCommCuDevice(comm, device)`|`onecclResult_t onecclCommGetDevice(comm, device)`|`device communicator::get_device()`| |
| 121 | +|`ncclResult_t ncclCommUserRank(comm, rank)`|`onecclResult_t onecclCommUserRank(comm, rank)`|`rank communicator::rank()`| |
| 122 | +|`ncclResult_t ncclGetVersion(version)`|`onecclResult_t onecclGetVersion(version)`|`version ccl:get_library_version()`| |
| 123 | +|`ncclCommAbort` | Not implemented | N/A | |
| 124 | +|`ncclCommGetAsyncError`| Not implemented | N/A | |
| 125 | +|`ncclGetLastError` | Not implemented | N/A | |
| 126 | +|`ncclGetErrorString`| Not implemented | N/A | |
0 commit comments