Introduce Completion Counters verbs#1701
Conversation
083d6ab to
189d11c
Compare
|
@jgunthorpe @rleon Any comments on this? I'd like to start pushing the implementation. |
|
I've been feeling a high reluctance lately to introduce new verbs that rely on a single HW implementations.. That has not gone well in the past. Consider using dv, but also consider the questions below if it should be a new verb: Can any existing HW implement this, or just one? |
|
As a general comment, completion counters were introduced into libfabric v1.0 (~2016), and the counter APIs are used by MPI. I think only the CXI provider has HW support, however. UEC defines mandatory support for counters as part of the AI Full and HPC profiles (UEC 2.2.5.3.7). The UET-verbs PR (which is very much out of date) only targeted the AI Base profile, so counters were excluded. I expect once UET-verbs expands to cover AI Full and HPC profiles, UEC will ask for completion counters as well. I will review the actual submission here, but my expectation is that the proposal is aligned with a libfabric implementation, which would also align with UEC. |
shefty
left a comment
There was a problem hiding this comment.
Comments are based on comparing to UEC spec requirements, libfabric API, and looking at the CXI implementation, which is the only other HW-based counter implementation I'm aware of.
| number of completed operations. This makes them well suited for applications | ||
| that need to know how many requests have completed without requiring | ||
| per-request details, such as credit based flow control or tracking responses | ||
| from remote peers. |
There was a problem hiding this comment.
libfabric and UEC call out 2 separate types of counters. One counts completed data transfer operations. The other counts the number of application bytes carried by a data transfer. For example, the number of bytes written into a RDMA write target buffer.
There was a problem hiding this comment.
Thanks for your feedback, replied in another comment below.
| *cc_attr*. On success, the returned **ibv_comp_cntr** structure contains | ||
| pointers to the completion and error count values. The maximum number of | ||
| completion counters a device supports is reported by the *max_comp_cntr* | ||
| field of **ibv_device_attr_ex**. |
There was a problem hiding this comment.
Returning direct pointers to the memory locations where the counters are stored is forcing a specific implementation. Counters could be local to the NIC and require specific commands to read. It is more generic and aligns with libfabric to provide function calls to read/write the counters.
From what I can tell looking at the CXI libfabric implementation, this is how counters work on CXI. Counters are a HW object. Reading a counter submits a command that copies the value back to host memory. A counter may also be associated with a threshold value. Once the counter reaches that value, it will also update host memory with the value.
There was a problem hiding this comment.
One of the use cases we want this API to support is direct polling of the counters from GPUs, what requires having counter memory addresses. In this case I actually expect counters to be created with external memory located in HBM so exposing addresses might be redundant.
I think we can go with your suggestion, add ibv_cntr_read verb and pave the path to kernel, then each provider can decide to optimize it by simple memory read when possible.
There was a problem hiding this comment.
One of the use cases we want this API to support is direct polling of the counters from GPUs, what requires having counter memory addresses.
Isn't this handled through dmabuf mechanism?
There was a problem hiding this comment.
Yes, right. In the scenario I described the buffer is allocated in GPU memory, exported as dmabuf which is then passed in when creating counters. GPU will directly read counter values from its memory.
For this case exposing counter VA isn't needed and it might not be available.
| in the returned **ibv_comp_cntr** will point to it. Using external memory | ||
| allows the counter values to reside in application-managed buffers or in | ||
| memory exported through DMA-BUF, enabling zero-copy observation of completion | ||
| progress by co-located processes or devices. |
There was a problem hiding this comment.
This deviates from libfabric and isn't called out by UEC. Letting the app specify where the counter should reside is forcing a very specific HW implementation. I agree with Jason here and would definitely want to have multiple NIC implementations available before adding this.
Can we simplify the API to meet more minimal requirements?
AFAICT, CXI can support an option such as this, but the buffer would be a write-back buffer, not the actual counter. This suggests that a call such as ibv_read_cntr(cntr, &value, flags) might be an option to direct the counter value to a specific memory location, versus the libfabric method of returning the value directly from the function.
There was a problem hiding this comment.
I added this to common verbs because I believe completion counters in GPU memory will be one of the main use cases for verbs completion counters. Device implementations can vary, but one of the options would be maintaining an internal counter and syncing it back to the user-supplied buffer, as you suggested CXI is doing. That said, I understand your concern that this may be difficult or impossible to implement on some devices that otherwise support the rest of the API.
We can either make IBV_COMP_CNTR_INIT_WITH_EXTERNAL_MEM support optional, or move the capability to create counters with external memory to direct verbs. @jgunthorpe what's your opinion on this?
| uint32_t flags; | ||
| struct ibv_memory_location comp_cntr_ext_mem; | ||
| struct ibv_memory_location err_cntr_ext_mem; | ||
| }; |
There was a problem hiding this comment.
libfabric attributes include an enum for events to count (transfers or bytes). UEC lists support for both as mandatory. Adding an enum here, even if only transfers are included to start with, would make this more extensible.
libfabric also supports blocking counter read calls. It's possible this might align with a verbs ibv_comp_channel (or ibv_cntr_channel?) object.
I don't see any statement in UEC regarding blocking counters (or CQs). However, since libfabric counters include blocking support, they may simply not have thought it important to call that support out separately. (I don't think MPI uses blocking counters.)
CXI implements blocking reads by setting a triggered threshold. HW writes the counter value to memory once it hits the trigger. SW in this case spins until the counter is triggered.
At least based on UEC 1.0, I would defer adding wait or trigger APIs until there's a larger discussion.
There was a problem hiding this comment.
Regarding events vs. bytes I think it's already easy to extend by adding another flag like IBV_COMP_CNTR_INIT_COUNT_BYTES.
For your comment about blocking counters, I would try keep comp_cntr and comp_channel separate in rdma-core. Upper layers (like Libfabric) can combine these constructs to implement more advanced behaviors. Anyway I think we would be able to add this support later if we see the need.
There was a problem hiding this comment.
I would still remove external memory support, particularly in the initial API submission.
I would also include the enum to select the counter type (bytes vs. completions). This isn't something that should be a flag, as only one is usable at a time. Even if the enum only has 1 option to start with, introducing it now makes compatibility much easier, both from the perspective of verbs, but also the user.
| uint32_t comp_mask; | ||
| uint32_t op_mask; | ||
| }; | ||
| ``` |
There was a problem hiding this comment.
This behavior seems aligned with libfabric, so it should work with UEC.
It depends on device implementation, EFA is going to support this in existing devices.
As @shefty mentioned Libfabric has a similar interface and UE is going to support it as well.
I think the bare minimum needed is the ability to map QP + opcode to a counter during completion handling and increment this counter. With @shefty's suggestion to add ibv_cntr_read, read can be implemented using an admin command. |
|
Updated according to suggestions above. |
| external memory allows the counter values to | ||
| reside in application-managed buffers or in memory exported through DMA-BUF, | ||
| enabling zero-copy observation of completion progress by co-located processes | ||
| or devices. |
There was a problem hiding this comment.
I still question trying to support external memory for counters as a standard API feature. This is similar to supporting external memory for a CQ. I don't see a benefit for supporting this for host memory. GPU/device memory could have a benefit, but, again, this is forcing a specific NIC implementation.
libfabric took a different approach for NIC to GPU interactions in the 1.x release (removed from 2.x because of a lack of implementation). That involved a trigger API, where once a counter hit a certain threshold, memory on the GPU would be updated. That's not the same as keeping the actual counter value on the GPU.
https://github.com/ofiwg/libfabric/blob/v1.22.x/include/rdma/fi_trigger.h
The advantage that approach had is that the trigger could come from the NIC or CPU, based on the vendor implementation. And even the counters could exist totally in SW.
I would remove external memory support and defer discussion for how to best support completions to a GPU.
| struct ibv_context *context; | ||
| uint32_t handle; | ||
| uint64_t comp_count_max_value; | ||
| uint64_t err_count_max_value; |
There was a problem hiding this comment.
I would move these to device attributes, not set per counter. An app may need to know this prior to created the counters, rather than after the fact.
| uint32_t flags; | ||
| struct ibv_memory_location comp_cntr_ext_mem; | ||
| struct ibv_memory_location err_cntr_ext_mem; | ||
| }; |
There was a problem hiding this comment.
I would still remove external memory support, particularly in the initial API submission.
I would also include the enum to select the counter type (bytes vs. completions). This isn't something that should be a flag, as only one is usable at a time. Even if the enum only has 1 option to start with, introducing it now makes compatibility much easier, both from the perspective of verbs, but also the user.
Extend verbs interface to support Completion Counters that can be seen as a light-weight alternative to polling CQ. A completion counter object separately counts successful and error completions, can be attached to multiple QPs and be configured to count completions of a subset of operation types. This is especially useful for batch or credit based workloads running on accelerators but can serve many other types of applications as well. Expose supported number of completion counters through query device extended verb. Reviewed-by: Daniel Kinsbursky <dkinsb@amazon.com> Reviewed-by: Yonatan Nachum <ynachum@amazon.com> Signed-off-by: Michael Margolin <mrgolin@amazon.com>
To commit: ?? ("RDMA/efa: Add Completion Counters support").
Signed-off-by: Michael Margolin <mrgolin@amazon.com>
Add kernel ioctl interface, provider ops, command helpers, and public API dispatch to enable providers to implement the Completion Counters verbs introduced in the previous commit. Reviewed-by: Daniel Kinsbursky <dkinsb@amazon.com> Signed-off-by: Michael Margolin <mrgolin@amazon.com>
Implement completion counters for the EFA provider. Each counter object has two event counters (success and error) with set, inc, and read operations. Counter values are accessed directly from userspace through mapped memory. Add efadv_create_comp_cntr() for extended creation with external memory options. Buffer memory is described to the kernel via efa_uverbs_buffer_desc struct passed as a driver-specific ioctl attribute. Three modes are supported per counter: user-supplied VA, user-supplied dmabuf, or provider-internal memory. Signed-off-by: Michael Margolin <mrgolin@amazon.com>
Expose the Completion Counters verbs API through pyverbs. Add CompCntr class with set, set_err, inc, inc_err, read and read_err methods. Add CompCntrInitAttr and CompCntrAttachAttr helper classes. Add attach_comp_cntr method to the QP class. Add EfaCompCntr and EfaCompCntrInitAttr for efadv extended creation with external memory options. Signed-off-by: Michael Margolin <mrgolin@amazon.com>
|
@shefty I've moved external memory support to EFA direct verbs. |
ShacharKagan
left a comment
There was a problem hiding this comment.
This review covers only the pyverbs commit (pyverbs: Add Completion Counters support). I have not reviewed the libibverbs/efa changes in this PR.
| raise PyverbsRDMAErrno('Failed to create comp_cntr') | ||
| self.ctx = ctx | ||
|
|
||
| def close(self): |
There was a problem hiding this comment.
CompCntr needs a dealloc that calls close(). PyverbsCM doesn't provide one, so without it the C counter leaks whenever a CompCntr is garbage-collected outside an explicit close() / with block.
| self.comp_cntr = v.ibv_create_comp_cntr(ctx.context, &attr.attr) | ||
| if self.comp_cntr == NULL: | ||
| raise PyverbsRDMAErrno('Failed to create comp_cntr') | ||
| self.ctx = ctx |
There was a problem hiding this comment.
Register the new counter with its Context so teardown ordering works:
self.ctx = ctx
ctx.add_ref(self)
Also update pyverbs/device.pyx:
Add self.comp_cntrs = weakref.WeakSet() in Context.init.
Add an elif isinstance(obj, CompCntr): self.comp_cntrs.add(obj) branch in Context.add_ref.
|
|
||
| cdef class CompCntrInitAttr(PyverbsObject): | ||
| """Represents ibv_comp_cntr_init_attr struct.""" | ||
| def __init__(self, comp_mask=0, type=0, flags=0): |
There was a problem hiding this comment.
type shadows a Python builtin. Better to rename the parameter. Suggestion: cntr_type
|
|
||
| cdef class CompCntr(PyverbsCM): | ||
| """Represents a Completion Counter.""" | ||
| def __init__(self, Context ctx not None, CompCntrInitAttr attr not None): |
There was a problem hiding this comment.
Please add meaningful docstrings for the methods that wrap the verbs: init, close, set, set_err, inc, inc_err, read, read_err.
|
|
||
| cdef class EfaCompCntr(CompCntr): | ||
| """ | ||
| Initializes an EFA Completion Counter with EFA-specific options. |
There was a problem hiding this comment.
Class docstrings don't take :param: / :return:. Move the parameter block to init and replace the class docstring with a one-line description like "EFA-specific Completion Counter."
| """ | ||
| def __init__(self, Context ctx not None, CompCntrInitAttr attr not None, | ||
| EfaCompCntrInitAttr efa_attr=None): | ||
| if efa_attr is None: |
There was a problem hiding this comment.
EfaCompCntr.init skips super().init() (correct, since calling CompCntr.init would re-create the counter), but as a result PyverbsObject.init never runs and self.logger is never assigned. Any later use of the logger will raise AttributeError.
Suggestion: Initialize the parent chain explicitly, bypassing only CompCntr:
PyverbsCM.init(self)
| self.comp_cntr = dv.efadv_create_comp_cntr(ctx.context, &attr.attr, &efa_attr.attr, sizeof(efa_attr.attr)) | ||
| if self.comp_cntr == NULL: | ||
| raise PyverbsRDMAErrno('Failed to create EFA comp_cntr') | ||
| self.ctx = ctx |
There was a problem hiding this comment.
call ctx.add_ref(self) here once CompCntr is registered with Context (see the corresponding comment on comp_cntr.pyx)
| EfaCompCntrInitAttr efa_attr=None): | ||
| if efa_attr is None: | ||
| efa_attr = EfaCompCntrInitAttr() | ||
| self.comp_cntr = dv.efadv_create_comp_cntr(ctx.context, &attr.attr, &efa_attr.attr, sizeof(efa_attr.attr)) |
There was a problem hiding this comment.
The line is too long. Break it:
self.comp_cntr = dv.efadv_create_comp_cntr(ctx.context, &attr.attr,
&efa_attr.attr,
sizeof(efa_attr.attr))
| &attr.attr) | ||
| if rc != 0: | ||
| raise PyverbsRDMAError('Failed to attach comp cntr to QP', rc) | ||
|
|
There was a problem hiding this comment.
Track attached QPs on the counter and close them before destroying it.
A CompCntr cannot be destroyed while a QP is still attached to it — the QP must be destroyed first (as described in the man page). The patch attaches QPs to a counter but never tracks the relationship, so an explicit comp_cntr.close() while QPs are still alive will fail with EBUSY.
Extend verbs interface to support Completion Counters that can be seen as a light-weight alternative to polling CQ. A completion counter object separately counts successful and error completions, can be attached to multiple QPs and be configured to count completions of a subset of operation types. This is especially useful for batch or credit based workloads running on accelerators but can serve many other types of applications as well.
Kernel changes are here.
Thanks.