Persistent Collectives X (PCX) is a collective communication library aimed at providing high performance, low latency persistent collectives over RDMA devices, while minimizing CPU/GPU utilization to near-zero.
Note: PCX is currently in a PoC stage.
PCX is a communication library that provides standard collective communication operations for CPUs and GPUs using RDMA over InfiniBand Verbs. PCX supports communication for a single multi-threaded process or multi-process applications that run within a single node or distributed across an arbitrary number of nodes.
PCX targeted at accelerating deep learning distributed training and any other HPC application that requires high performant communication.
Currently PCX supports only all-reduce operation. Other operations like all-gather, reduce, broadcast, and reduce-scatter will be supported in the future.
There are two algorithms (Ring and Recursive Doubling) implementing the all-reduce collective operation and user can choose either one of them for the all-reduce collective operations.
PCX takes advantage of the persistent attributes of the communication and uses several different technologies including: RDMA, CORE-Direct, NIC calculation capabilities, GPU-Direct RDMA (also known as PeerDirect) and network device memory (when available), in combination with (almost) complete software bypass for the data-path.
PCX provides an easy high-level, C++ based interface for using built-in algorithm for the collective operations.
In addition, PCX can be used as a low-level library for one who wants to implement some custom collective operation algorithms using low-level "building blocks" that PCX exposes.
PCX requires Mellanox's ConnectX-5 (or later) NICs.
The basic requirements for PCX are:
- MLNX_OFED 4.5
 
In order to take full advantage of the PCX library, PCX requires:
- GPU-Direct RDMA (In case the data for the collective operation resides in the GPU)
 
Note: There is work in progress to remove the dependency on MLNX_OFED 4.5 and use only rdma-core.
The example shows how Gloo can use PCX for all-reduce operation (using the Ring algorithm):
#include "third-party/pcx/pcx_allreduce_alg_chunked_ring.h"
namespace gloo {
template <typename T>
class PcxAllreduceRing : public Algorithm {
public:
  PcxAllreduceRing(
      const std::shared_ptr<Context> &context,
      const std::vector<T *> &ptrs,
      const int count,
      const ReductionFunction<T> *fn = ReductionFunction<T>::sum)
      : Algorithm(context),
        pcxAlg(contextSize_, 
               contextRank_, 
               ptrs, 
               count, 
               this->context_->nextSlot(), 
               this->context_->nextSlot(), 
               (void *)&(this->context_)) {}
  void run() {
    pcxAlg.run();
  }
private:
  PcxAllreduceChunkedRing<T> pcxAlg;
};
} // namespace gloo
Examples for implementation of collectives with low-level API of PCX can be seen here: