Skip to content

Introduce Completion Counters verbs#1701

Open
mrgolin wants to merge 5 commits into
linux-rdma:masterfrom
amzn:comp-counters
Open

Introduce Completion Counters verbs#1701
mrgolin wants to merge 5 commits into
linux-rdma:masterfrom
amzn:comp-counters

Conversation

@mrgolin
Copy link
Copy Markdown
Contributor

@mrgolin mrgolin commented Feb 16, 2026

Extend verbs interface to support Completion Counters that can be seen as a light-weight alternative to polling CQ. A completion counter object separately counts successful and error completions, can be attached to multiple QPs and be configured to count completions of a subset of operation types. This is especially useful for batch or credit based workloads running on accelerators but can serve many other types of applications as well.

Kernel changes are here.

Thanks.

@mrgolin
Copy link
Copy Markdown
Contributor Author

mrgolin commented Mar 3, 2026

@jgunthorpe @rleon Any comments on this? I'd like to start pushing the implementation.

@jgunthorpe
Copy link
Copy Markdown
Member

I've been feeling a high reluctance lately to introduce new verbs that rely on a single HW implementations.. That has not gone well in the past. Consider using dv, but also consider the questions below if it should be a new verb:

Can any existing HW implement this, or just one?
Are there any prior-art standards that already implement this idea, like UE or something?
What does hardware actually have to do to implement it?

@shefty
Copy link
Copy Markdown
Contributor

shefty commented Mar 13, 2026

As a general comment, completion counters were introduced into libfabric v1.0 (~2016), and the counter APIs are used by MPI. I think only the CXI provider has HW support, however.

UEC defines mandatory support for counters as part of the AI Full and HPC profiles (UEC 2.2.5.3.7). The UET-verbs PR (which is very much out of date) only targeted the AI Base profile, so counters were excluded. I expect once UET-verbs expands to cover AI Full and HPC profiles, UEC will ask for completion counters as well.

I will review the actual submission here, but my expectation is that the proposal is aligned with a libfabric implementation, which would also align with UEC.

@shijin-aws
Copy link
Copy Markdown
Contributor

shijin-aws commented Mar 13, 2026

@shefty Yes, the proposal is aimed to be aligned with Libfabric semantic after discussion between @mrgolin and us
libfabric folks in aws

Copy link
Copy Markdown
Contributor

@shefty shefty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments are based on comparing to UEC spec requirements, libfabric API, and looking at the CXI implementation, which is the only other HW-based counter implementation I'm aware of.

number of completed operations. This makes them well suited for applications
that need to know how many requests have completed without requiring
per-request details, such as credit based flow control or tracking responses
from remote peers.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libfabric and UEC call out 2 separate types of counters. One counts completed data transfer operations. The other counts the number of application bytes carried by a data transfer. For example, the number of bytes written into a RDMA write target buffer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your feedback, replied in another comment below.

*cc_attr*. On success, the returned **ibv_comp_cntr** structure contains
pointers to the completion and error count values. The maximum number of
completion counters a device supports is reported by the *max_comp_cntr*
field of **ibv_device_attr_ex**.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning direct pointers to the memory locations where the counters are stored is forcing a specific implementation. Counters could be local to the NIC and require specific commands to read. It is more generic and aligns with libfabric to provide function calls to read/write the counters.

From what I can tell looking at the CXI libfabric implementation, this is how counters work on CXI. Counters are a HW object. Reading a counter submits a command that copies the value back to host memory. A counter may also be associated with a threshold value. Once the counter reaches that value, it will also update host memory with the value.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the use cases we want this API to support is direct polling of the counters from GPUs, what requires having counter memory addresses. In this case I actually expect counters to be created with external memory located in HBM so exposing addresses might be redundant.
I think we can go with your suggestion, add ibv_cntr_read verb and pave the path to kernel, then each provider can decide to optimize it by simple memory read when possible.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the use cases we want this API to support is direct polling of the counters from GPUs, what requires having counter memory addresses.

Isn't this handled through dmabuf mechanism?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, right. In the scenario I described the buffer is allocated in GPU memory, exported as dmabuf which is then passed in when creating counters. GPU will directly read counter values from its memory.
For this case exposing counter VA isn't needed and it might not be available.

in the returned **ibv_comp_cntr** will point to it. Using external memory
allows the counter values to reside in application-managed buffers or in
memory exported through DMA-BUF, enabling zero-copy observation of completion
progress by co-located processes or devices.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This deviates from libfabric and isn't called out by UEC. Letting the app specify where the counter should reside is forcing a very specific HW implementation. I agree with Jason here and would definitely want to have multiple NIC implementations available before adding this.

Can we simplify the API to meet more minimal requirements?

AFAICT, CXI can support an option such as this, but the buffer would be a write-back buffer, not the actual counter. This suggests that a call such as ibv_read_cntr(cntr, &value, flags) might be an option to direct the counter value to a specific memory location, versus the libfabric method of returning the value directly from the function.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this to common verbs because I believe completion counters in GPU memory will be one of the main use cases for verbs completion counters. Device implementations can vary, but one of the options would be maintaining an internal counter and syncing it back to the user-supplied buffer, as you suggested CXI is doing. That said, I understand your concern that this may be difficult or impossible to implement on some devices that otherwise support the rest of the API.

We can either make IBV_COMP_CNTR_INIT_WITH_EXTERNAL_MEM support optional, or move the capability to create counters with external memory to direct verbs. @jgunthorpe what's your opinion on this?

uint32_t flags;
struct ibv_memory_location comp_cntr_ext_mem;
struct ibv_memory_location err_cntr_ext_mem;
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libfabric attributes include an enum for events to count (transfers or bytes). UEC lists support for both as mandatory. Adding an enum here, even if only transfers are included to start with, would make this more extensible.

libfabric also supports blocking counter read calls. It's possible this might align with a verbs ibv_comp_channel (or ibv_cntr_channel?) object.

I don't see any statement in UEC regarding blocking counters (or CQs). However, since libfabric counters include blocking support, they may simply not have thought it important to call that support out separately. (I don't think MPI uses blocking counters.)

CXI implements blocking reads by setting a triggered threshold. HW writes the counter value to memory once it hits the trigger. SW in this case spins until the counter is triggered.

At least based on UEC 1.0, I would defer adding wait or trigger APIs until there's a larger discussion.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding events vs. bytes I think it's already easy to extend by adding another flag like IBV_COMP_CNTR_INIT_COUNT_BYTES.

For your comment about blocking counters, I would try keep comp_cntr and comp_channel separate in rdma-core. Upper layers (like Libfabric) can combine these constructs to implement more advanced behaviors. Anyway I think we would be able to add this support later if we see the need.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still remove external memory support, particularly in the initial API submission.

I would also include the enum to select the counter type (bytes vs. completions). This isn't something that should be a flag, as only one is usable at a time. Even if the enum only has 1 option to start with, introducing it now makes compatibility much easier, both from the perspective of verbs, but also the user.

uint32_t comp_mask;
uint32_t op_mask;
};
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior seems aligned with libfabric, so it should work with UEC.

@mrgolin
Copy link
Copy Markdown
Contributor Author

mrgolin commented Mar 18, 2026

@jgunthorpe

Can any existing HW implement this, or just one?

It depends on device implementation, EFA is going to support this in existing devices.

Are there any prior-art standards that already implement this idea, like UE or something?

As @shefty mentioned Libfabric has a similar interface and UE is going to support it as well.

What does hardware actually have to do to implement it?

I think the bare minimum needed is the ability to map QP + opcode to a counter during completion handling and increment this counter. With @shefty's suggestion to add ibv_cntr_read, read can be implemented using an admin command.

@mrgolin
Copy link
Copy Markdown
Contributor Author

mrgolin commented Mar 22, 2026

Updated according to suggestions above.
Looks there is some issue with the pipeline.

external memory allows the counter values to
reside in application-managed buffers or in memory exported through DMA-BUF,
enabling zero-copy observation of completion progress by co-located processes
or devices.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still question trying to support external memory for counters as a standard API feature. This is similar to supporting external memory for a CQ. I don't see a benefit for supporting this for host memory. GPU/device memory could have a benefit, but, again, this is forcing a specific NIC implementation.

libfabric took a different approach for NIC to GPU interactions in the 1.x release (removed from 2.x because of a lack of implementation). That involved a trigger API, where once a counter hit a certain threshold, memory on the GPU would be updated. That's not the same as keeping the actual counter value on the GPU.

https://github.com/ofiwg/libfabric/blob/v1.22.x/include/rdma/fi_trigger.h

The advantage that approach had is that the trigger could come from the NIC or CPU, based on the vendor implementation. And even the counters could exist totally in SW.

I would remove external memory support and defer discussion for how to best support completions to a GPU.

struct ibv_context *context;
uint32_t handle;
uint64_t comp_count_max_value;
uint64_t err_count_max_value;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move these to device attributes, not set per counter. An app may need to know this prior to created the counters, rather than after the fact.

uint32_t flags;
struct ibv_memory_location comp_cntr_ext_mem;
struct ibv_memory_location err_cntr_ext_mem;
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still remove external memory support, particularly in the initial API submission.

I would also include the enum to select the counter type (bytes vs. completions). This isn't something that should be a flag, as only one is usable at a time. Even if the enum only has 1 option to start with, introducing it now makes compatibility much easier, both from the perspective of verbs, but also the user.

mrgolin added 5 commits May 26, 2026 11:57
Extend verbs interface to support Completion Counters that can be seen
as a light-weight alternative to polling CQ. A completion counter object
separately counts successful and error completions, can be attached to
multiple QPs and be configured to count completions of a subset of
operation types. This is especially useful for batch or credit based
workloads running on accelerators but can serve many other types of
applications as well.

Expose supported number of completion counters through query device
extended verb.

Reviewed-by: Daniel Kinsbursky <dkinsb@amazon.com>
Reviewed-by: Yonatan Nachum <ynachum@amazon.com>
Signed-off-by: Michael Margolin <mrgolin@amazon.com>
To commit: ?? ("RDMA/efa: Add Completion Counters support").

Signed-off-by: Michael Margolin <mrgolin@amazon.com>
Add kernel ioctl interface, provider ops, command helpers, and public
API dispatch to enable providers to implement the Completion Counters
verbs introduced in the previous commit.

Reviewed-by: Daniel Kinsbursky <dkinsb@amazon.com>
Signed-off-by: Michael Margolin <mrgolin@amazon.com>
Implement completion counters for the EFA provider. Each counter
object has two event counters (success and error) with set, inc, and
read operations. Counter values are accessed directly from userspace
through mapped memory.

Add efadv_create_comp_cntr() for extended creation with external
memory options. Buffer memory is described to the kernel via
efa_uverbs_buffer_desc struct passed as a driver-specific ioctl
attribute. Three modes are supported per counter: user-supplied VA,
user-supplied dmabuf, or provider-internal memory.

Signed-off-by: Michael Margolin <mrgolin@amazon.com>
Expose the Completion Counters verbs API through pyverbs.

Add CompCntr class with set, set_err, inc, inc_err, read and read_err
methods. Add CompCntrInitAttr and CompCntrAttachAttr helper classes.
Add attach_comp_cntr method to the QP class.

Add EfaCompCntr and EfaCompCntrInitAttr for efadv extended creation
with external memory options.

Signed-off-by: Michael Margolin <mrgolin@amazon.com>
@mrgolin mrgolin changed the title [RFC] Introduce Completion Counters verbs Introduce Completion Counters verbs May 26, 2026
@mrgolin
Copy link
Copy Markdown
Contributor Author

mrgolin commented May 26, 2026

@shefty I've moved external memory support to EFA direct verbs.

Copy link
Copy Markdown
Contributor

@ShacharKagan ShacharKagan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This review covers only the pyverbs commit (pyverbs: Add Completion Counters support). I have not reviewed the libibverbs/efa changes in this PR.

Comment thread pyverbs/comp_cntr.pyx
raise PyverbsRDMAErrno('Failed to create comp_cntr')
self.ctx = ctx

def close(self):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CompCntr needs a dealloc that calls close(). PyverbsCM doesn't provide one, so without it the C counter leaks whenever a CompCntr is garbage-collected outside an explicit close() / with block.

Comment thread pyverbs/comp_cntr.pyx
self.comp_cntr = v.ibv_create_comp_cntr(ctx.context, &attr.attr)
if self.comp_cntr == NULL:
raise PyverbsRDMAErrno('Failed to create comp_cntr')
self.ctx = ctx
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Register the new counter with its Context so teardown ordering works:
self.ctx = ctx
ctx.add_ref(self)

Also update pyverbs/device.pyx:
Add self.comp_cntrs = weakref.WeakSet() in Context.init.
Add an elif isinstance(obj, CompCntr): self.comp_cntrs.add(obj) branch in Context.add_ref.

Comment thread pyverbs/comp_cntr.pyx

cdef class CompCntrInitAttr(PyverbsObject):
"""Represents ibv_comp_cntr_init_attr struct."""
def __init__(self, comp_mask=0, type=0, flags=0):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type shadows a Python builtin. Better to rename the parameter. Suggestion: cntr_type

Comment thread pyverbs/comp_cntr.pyx

cdef class CompCntr(PyverbsCM):
"""Represents a Completion Counter."""
def __init__(self, Context ctx not None, CompCntrInitAttr attr not None):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add meaningful docstrings for the methods that wrap the verbs: init, close, set, set_err, inc, inc_err, read, read_err.


cdef class EfaCompCntr(CompCntr):
"""
Initializes an EFA Completion Counter with EFA-specific options.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Class docstrings don't take :param: / :return:. Move the parameter block to init and replace the class docstring with a one-line description like "EFA-specific Completion Counter."

"""
def __init__(self, Context ctx not None, CompCntrInitAttr attr not None,
EfaCompCntrInitAttr efa_attr=None):
if efa_attr is None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EfaCompCntr.init skips super().init() (correct, since calling CompCntr.init would re-create the counter), but as a result PyverbsObject.init never runs and self.logger is never assigned. Any later use of the logger will raise AttributeError.

Suggestion: Initialize the parent chain explicitly, bypassing only CompCntr:
PyverbsCM.init(self)

self.comp_cntr = dv.efadv_create_comp_cntr(ctx.context, &attr.attr, &efa_attr.attr, sizeof(efa_attr.attr))
if self.comp_cntr == NULL:
raise PyverbsRDMAErrno('Failed to create EFA comp_cntr')
self.ctx = ctx
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call ctx.add_ref(self) here once CompCntr is registered with Context (see the corresponding comment on comp_cntr.pyx)

EfaCompCntrInitAttr efa_attr=None):
if efa_attr is None:
efa_attr = EfaCompCntrInitAttr()
self.comp_cntr = dv.efadv_create_comp_cntr(ctx.context, &attr.attr, &efa_attr.attr, sizeof(efa_attr.attr))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line is too long. Break it:
self.comp_cntr = dv.efadv_create_comp_cntr(ctx.context, &attr.attr,
&efa_attr.attr,
sizeof(efa_attr.attr))

Comment thread pyverbs/qp.pyx
&attr.attr)
if rc != 0:
raise PyverbsRDMAError('Failed to attach comp cntr to QP', rc)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Track attached QPs on the counter and close them before destroying it.
A CompCntr cannot be destroyed while a QP is still attached to it — the QP must be destroyed first (as described in the man page). The patch attaches QPs to a counter but never tracks the relationship, so an explicit comp_cntr.close() while QPs are still alive will fail with EBUSY.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants