Skip to content

Commit 6020f70

Browse files
authored
Merge pull request #126 from amd-vserbu/perf/pcie-transfer-performance
Pcie transfer performance
2 parents 1483dc5 + 5bcf449 commit 6020f70

56 files changed

Lines changed: 7752 additions & 1130 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,3 +85,6 @@ driver/kcompat/.scratch/
8585

8686
# Python test coverage
8787
.coverage
88+
89+
# Project-local scratch space
90+
/tmp/

docs/reference/kernel-abi/index.rst

Lines changed: 212 additions & 35 deletions
Large diffs are not rendered by default.

docs/reference/smi/commands.rst

Lines changed: 152 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -151,11 +151,27 @@ validate
151151
--------
152152

153153
Run memory integrity and bandwidth tests against a board's HBM and DDR
154-
subsystems.
154+
subsystems. For each memory path, bandwidth is reported as single-direction
155+
C2H read, single-direction H2C write, and simultaneous bidirectional
156+
throughput (read, write, and total). After the per-memory phases, a final
157+
parallel phase drives HBM and DDR simultaneously with ``2 * N`` buffers for
158+
single-direction tests and ``4 * N`` threads for bidirectional tests; this
159+
phase is skipped when ``--ddr-only`` or ``--hbm-only`` is given.
155160

156161
.. code-block:: text
157162
158-
v80-smi validate -d <BDF> [-j|--threads <N>]
163+
v80-smi validate -d <BDF> [-j|--threads <N>] [-R|--no-reset] [--mm-channel <spec>] [--buffer-size <size>] [--offset <size>] [--starting-offset <size>] [--raw-transfer-test | --use-qdma-driver] [--ddr-only | --hbm-only] [--channel-allocation <auto|paired>] [--channel-region-stride <size>] [--ring-size-index <0-15>] [--bandwidth-iterations <N>] [--bandwidth-duration <seconds>]
164+
165+
Requirements by mode:
166+
167+
* Default mode uses VRTD buffers, requires a running VRTD daemon, and resets
168+
the board unless ``--no-reset`` is given.
169+
* ``--raw-transfer-test`` bypasses VRTD for transfers and requires the SLASH
170+
QDMA driver device node for the board. It skips reset.
171+
* ``--use-qdma-driver`` bypasses both VRTD and SLASH for transfers and requires
172+
the stock ``qdma-pf`` driver to be bound to the board's QDMA PF. This backend
173+
is built only when ``SMI_ENABLE_QDMA_DRIVER_BACKEND`` is enabled at CMake
174+
configure time.
159175

160176
.. option:: -d, --device <BDF>
161177

@@ -164,6 +180,140 @@ subsystems.
164180
.. option:: -j, --threads <N>
165181

166182
Number of parallel buffers/threads for the validation test (1–64, default 8).
183+
Bidirectional phases use ``2 * N`` logical positions in each enabled memory
184+
space.
185+
186+
.. option:: --buffer-size <size>
187+
188+
Size of each test buffer. Values may be bare bytes or use ``k``/``K`` or
189+
``m``/``M`` suffixes. The default and maximum are ``512M``. Values must be
190+
4 KiB-aligned.
191+
192+
.. option:: --offset <size>
193+
194+
Distance between logical buffer positions. The default is ``512M``. Values
195+
may be bare bytes or use ``k``/``K`` or ``m``/``M`` suffixes, must be
196+
4 KiB-aligned, and must be at least ``--buffer-size`` so buffers do not
197+
overlap.
198+
199+
.. option:: --starting-offset <size>
200+
201+
Offset from each memory-space base for logical position 0. The default is
202+
``0``. Values may be bare bytes or use ``k``/``K`` or ``m``/``M`` suffixes
203+
and must be 4 KiB-aligned.
204+
205+
Buffers are placed at ``memory_base + starting_offset + position * offset``.
206+
Single-direction phases use positions ``0..N-1``. Bidirectional phases use
207+
positions ``0..2N-1`` with reads on even positions and writes on odd positions.
208+
The full range must remain inside the 64 x 512 MB DDR/HBM address space. If any
209+
placement option is specified in default VRTD mode, ``validate`` uses raw VRTD
210+
buffers so the exact addresses are honored; this requires raw memory access
211+
permission.
212+
213+
The largest phase maps up to ``4 * N * buffer-size`` of host buffers when both
214+
HBM and DDR are enabled, or ``2 * N * buffer-size`` with ``--ddr-only`` or
215+
``--hbm-only``; the command fails early if that exceeds currently available
216+
host memory.
217+
218+
.. option:: -R, --no-reset
219+
220+
Skip the device reset step before running memory tests.
221+
222+
.. option:: --mm-channel <spec>
223+
224+
AXI-MM / NoC channel selection for each buffer's QDMA queue pair, in every
225+
mode. ``spec`` is either a single value applied to all buffers, or a
226+
comma-separated list giving one channel per logical buffer position
227+
(exactly ``2 x --threads`` entries; there is no repeating/wrap, and any
228+
other length is an error):
229+
230+
* ``auto`` (the default) lets the driver stripe queues across both channels
231+
by ``qid & 1``.
232+
* ``0`` / ``1`` pin the queue to that AXI-MM channel (and hence NoC channel).
233+
* e.g. with ``-j 1`` the list ``0,1`` puts buffer position 0 on channel 0 and
234+
position 1 on channel 1. Bidirectional phases use positions ``0..2N-1``;
235+
single-direction phases use the first ``N`` entries.
236+
237+
This is independent of ``--channel-allocation`` (which controls the device
238+
address): ``--mm-channel`` controls the host-side NoC ingress (NMU) per
239+
queue. With ``--use-qdma-driver`` the selection maps to the stock driver's
240+
per-queue MM-channel attribute.
241+
242+
.. option:: --raw-transfer-test
243+
244+
Use libslash raw QDMA transfers instead of VRTD buffers. This mode implies
245+
``--no-reset`` and requires the SLASH QDMA driver device to be present.
246+
247+
.. option:: --use-qdma-driver
248+
249+
Run the raw transfer test over the off-the-shelf Xilinx QDMA driver
250+
(``/dev/qdma<idx>-MM-<qid>``) instead of SLASH. smi provisions the queues
251+
itself: it raises the function's ``qmax`` via sysfs if needed, creates and
252+
starts bidirectional AXI-MM queue pairs over generic netlink (the same
253+
``xnl_pf`` interface ``dma-ctl`` uses), then transfers over the per-queue
254+
char devices. Queue pairs are spread round-robin across the function's MM
255+
engine channels (``channel = qid % mm_channel_max``); the CPM5 QDMA on the
256+
V80 exposes two, so the test exercises both. This mode implies
257+
``--no-reset`` and is mutually exclusive with ``--raw-transfer-test``. It
258+
requires the stock ``qdma-pf`` driver to be bound to the board's PF (it
259+
cannot be bound at the same time as the SLASH driver), and typically
260+
requires root to raise ``qmax`` and open the queue devices.
261+
262+
.. option:: --ddr-only
263+
264+
Run only the DDR memory tests and skip the HBM phase. Mutually exclusive
265+
with ``--hbm-only``.
266+
267+
.. option:: --hbm-only
268+
269+
Run only the HBM memory tests and skip the DDR phase. Mutually exclusive
270+
with ``--ddr-only``.
271+
272+
.. option:: --channel-allocation <auto|paired>
273+
274+
Raw-transfer-only (``--raw-transfer-test`` or ``--use-qdma-driver``) control
275+
over how QDMA MM/NoC channels map onto device memory. On CPM5 the host-side
276+
NoC ingress port (NMU) is chosen per queue by the SW-context
277+
mm-channel/host_id (SLASH uses ``qid & 1``), while the memory-side NoC egress
278+
endpoint (NSU / pseudo-channel) is chosen by the device address. Default
279+
``auto`` keeps the historical behaviour: channel ``qid & 1`` with linear
280+
addressing, so both NMUs can converge on a single NSU and bandwidth caps at
281+
one path. ``paired`` couples the two: even positions land in memory region 0
282+
on channel 0, odd positions in region 1 on channel 1 (one
283+
``--channel-region-stride`` apart), giving two independent NMU->NSU paths.
284+
This mirrors the off-the-shelf ``dma-perf`` ``offset_ch0``/``offset_ch1``
285+
knobs and is the placement that lets both NoC ports contribute bandwidth.
286+
287+
.. option:: --channel-region-stride <size>
288+
289+
In ``--channel-allocation paired`` mode, the byte distance between the two
290+
per-channel memory regions (the NSU / pseudo-channel stride). Default ``16G``
291+
(== half the per-memory address space, matching the dma-perf HBM
292+
``offset_ch1 - offset_ch0`` spacing). Must be a non-zero multiple of 4 KiB.
293+
Accepts bare bytes or ``k``/``K``, ``m``/``M``, ``g``/``G`` suffixes.
294+
295+
.. option:: --ring-size-index <0-15>
296+
297+
Raw-transfer-only (``--raw-transfer-test`` or ``--use-qdma-driver``).
298+
Override the QDMA descriptor-ring size index used when creating SLASH raw
299+
queue pairs or starting stock-driver queues. When omitted, each backend keeps
300+
its existing default. Useful A/B values for 4 KiB descriptor throughput are
301+
``0``, ``11``, ``13``, and ``15``.
302+
303+
.. option:: --bandwidth-iterations <N>
304+
305+
Raw-transfer-only (``--raw-transfer-test`` or ``--use-qdma-driver``). Repeat
306+
each whole-buffer transfer in every bandwidth phase ``N`` times and report
307+
bandwidth over the sustained loop. The default is ``1``, which preserves the
308+
historical one-shot measurement.
309+
310+
.. option:: --bandwidth-duration <seconds>
311+
312+
Raw-transfer-only duration mode. When non-zero, each bandwidth phase repeats
313+
whole-buffer transfers until the requested wall-clock duration has elapsed
314+
and counts only completed transfers. This is useful for comparing SLASH's raw
315+
path against long-running tools such as ``dma-perf``. A value of ``0`` uses
316+
``--bandwidth-iterations`` instead.
167317

168318
debug
169319
-----

driver/Makefile

Lines changed: 98 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -42,15 +42,28 @@ else
4242
LIBQDMA_PATH := $(LIBQDMA_FALLBACK)
4343
endif
4444

45+
# SLASH carries a few local modifications to the pinned QDMA submodule's
46+
# libqdma sources (see $(LIBQDMA_PATCH_DIR)/). The submodule itself stays
47+
# pristine; the patches are applied to whichever libqdma tree is being built
48+
# (the DKMS-local ./libqdma or the in-tree submodule) by the libqdma-patches
49+
# target before the module is compiled. See that target for details.
50+
LIBQDMA_PATCH_DIR := patches
51+
4552
SLASH_QDMA_OP_DEBUG ?= 0
4653

54+
# Per-transfer timing instrumentation. Set to 1 to emit one dmesg line per
55+
# DMA transfer breaking down the kernel phases. Default off (zero overhead).
56+
SLASH_QDMA_TIMING ?= 0
57+
4758
# Kcompat feature flags. Defaults are "n"; the all: recipe runs
4859
# driver/kcompat/probe.sh against $(KDIR) to detect the actual values
4960
# and passes them into the kbuild recursion. Each pair (modern API +
5061
# legacy fallback) is covered by one probe — if the modern form is
5162
# absent, the legacy form is the unconditional fallback in slash_compat.h.
5263
SLASH_HAVE_VM_FLAGS_SET ?= n
5364
SLASH_HAVE_MODULE_IMPORT_NS_TOKEN ?= n
65+
SLASH_HAVE_URING_CMD ?= n
66+
SLASH_HAVE_URING_SQE_CMD ?= n
5467

5568
# Set GCOV=1 to instrument the module for kernel gcov coverage.
5669
# Not set by default — never enable this in production builds.
@@ -72,6 +85,7 @@ ccflags-y += \
7285
\
7386
-DTANDEM_BOOT_SUPPORTED=1 \
7487
-DSLASH_QDMA_OP_DEBUG=$(SLASH_QDMA_OP_DEBUG) \
88+
-DSLASH_QDMA_TIMING=$(SLASH_QDMA_TIMING) \
7589
-DSLASH_VERSION_STR=\"$(SLASH_VERSION)\"
7690

7791
ifeq ($(SLASH_HAVE_VM_FLAGS_SET),y)
@@ -82,6 +96,25 @@ ifeq ($(SLASH_HAVE_MODULE_IMPORT_NS_TOKEN),y)
8296
ccflags-y += -DSLASH_HAVE_MODULE_IMPORT_NS_TOKEN
8397
endif
8498

99+
# Optional io_uring uring_cmd async transfer path. Probed by kcompat; absent on
100+
# kernels without CONFIG_IO_URING or uring_cmd support (e.g. RHEL 9, Ubuntu
101+
# 22.04 GA), where the synchronous transfer ioctl remains the only path.
102+
ifeq ($(SLASH_HAVE_URING_CMD),y)
103+
ccflags-y += -DSLASH_HAVE_URING_CMD
104+
endif
105+
106+
# Selects the io_uring SQE payload accessor: io_uring_sqe_cmd(cmd->sqe) when
107+
# present (newer kernels + distro backports), else cmd->cmd. Only meaningful
108+
# when SLASH_HAVE_URING_CMD is also set.
109+
ifeq ($(SLASH_HAVE_URING_SQE_CMD),y)
110+
ccflags-y += -DSLASH_HAVE_URING_SQE_CMD
111+
endif
112+
113+
# Force-include the compat header into every TU (including the pinned libqdma
114+
# submodule sources we don't modify) so kernel-API shims such as from_timer()
115+
# reach third-party code too. Safe on all kernels: the shims are guarded.
116+
ccflags-y += -include $(src)/slash_compat.h
117+
85118

86119
LIBQDMA_OBJS := \
87120
$(LIBQDMA_PATH)/qdma_mbox.o \
@@ -120,18 +153,80 @@ $(MODULE)-objs += $(LIBQDMA_OBJS) $(QDMA_ACCESS_OBJS)
120153

121154
KCOMPAT := "$(SHELL)" "$(PWD)/kcompat/probe.sh"
122155

123-
all:
156+
all: libqdma-patches
124157
@flags="$$($(KCOMPAT) "$(KDIR)" | tr '\n' ' ')"; \
125158
echo "slash: kcompat: $$flags"; \
126159
$(MAKE) -C "$(KDIR)" M="$(PWD)" $$flags modules
127160

161+
# Apply SLASH's local libqdma patches ($(LIBQDMA_PATCH_DIR)/*.patch) to the
162+
# libqdma source tree in use, in filename order, right before building.
163+
#
164+
# The pinned submodule is not edited directly by commits: patches live in-tree
165+
# and are stamped onto the working copy here. Application is idempotent — each patch is first tested
166+
# for being already applied (reverse dry-run) and skipped if so — so repeated
167+
# `make` runs, incremental builds, and DKMS rebuilds are all safe. A patch that
168+
# neither applies cleanly nor is already present aborts the build.
169+
#
170+
# $(PWD) is the driver dir for both `make` (in-tree) and DKMS (MAKE[0] runs
171+
# `make -C driver ...`); ./libqdma is the DKMS-packaged copy, otherwise fall
172+
# back to the in-tree submodule path. Uses patch(1) so it is independent of
173+
# whether the libqdma tree lives inside a git checkout.
174+
libqdma-patches:
175+
@set -e; \
176+
patch_dir="$(PWD)/$(LIBQDMA_PATCH_DIR)"; \
177+
set -- "$$patch_dir"/*.patch; \
178+
if [ ! -e "$$1" ]; then exit 0; fi; \
179+
if [ -d "$(PWD)/libqdma" ]; then lq="$(PWD)/libqdma"; \
180+
else lq="$(PWD)/$(LIBQDMA_FALLBACK)"; fi; \
181+
if [ ! -d "$$lq" ]; then \
182+
echo "slash: ERROR libqdma sources not found at $$lq" >&2; \
183+
echo "slash: run 'git submodule update --init --recursive' first" >&2; \
184+
exit 1; \
185+
fi; \
186+
command -v patch >/dev/null 2>&1 || { \
187+
echo "slash: ERROR patch(1) not found; it is required to apply libqdma patches" >&2; \
188+
exit 1; }; \
189+
for p in "$$@"; do \
190+
name="$$(basename "$$p")"; \
191+
if patch -R -p1 -d "$$lq" --dry-run -f -s -i "$$p" >/dev/null 2>&1; then \
192+
echo "slash: libqdma patch already applied, skipping: $$name"; \
193+
elif patch -p1 -d "$$lq" --dry-run -f -s -i "$$p" >/dev/null 2>&1; then \
194+
echo "slash: applying libqdma patch: $$name"; \
195+
patch -p1 -d "$$lq" -f -s -i "$$p"; \
196+
else \
197+
echo "slash: ERROR libqdma patch does not apply cleanly: $$name" >&2; \
198+
echo "slash: (libqdma tree at $$lq is neither pristine nor already patched)" >&2; \
199+
exit 1; \
200+
fi; \
201+
done
202+
203+
# Best-effort revert of the libqdma patches, restoring the submodule working
204+
# copy to pristine. Useful when editing the patches themselves. Never fails the
205+
# build: patches that are not currently applied are simply skipped.
206+
unpatch-libqdma:
207+
@set -e; \
208+
patch_dir="$(PWD)/$(LIBQDMA_PATCH_DIR)"; \
209+
set -- "$$patch_dir"/*.patch; \
210+
if [ ! -e "$$1" ]; then exit 0; fi; \
211+
if [ -d "$(PWD)/libqdma" ]; then lq="$(PWD)/libqdma"; \
212+
else lq="$(PWD)/$(LIBQDMA_FALLBACK)"; fi; \
213+
[ -d "$$lq" ] || exit 0; \
214+
for p in $$(printf '%s\n' "$$@" | tac); do \
215+
name="$$(basename "$$p")"; \
216+
if patch -R -p1 -d "$$lq" --dry-run -f -s -i "$$p" >/dev/null 2>&1; then \
217+
echo "slash: reverting libqdma patch: $$name"; \
218+
patch -R -p1 -d "$$lq" -f -s -i "$$p"; \
219+
fi; \
220+
done
221+
128222
clean:
129-
$(MAKE) -C "$(KDIR)" M="$(PWD)" clean
223+
-$(MAKE) -C "$(KDIR)" M="$(PWD)" clean
130224
rm -rf "$(PWD)/kcompat/.scratch"
225+
$(MAKE) unpatch-libqdma
131226

132227
install: all
133228
sudo install -d -m 755 /lib/modules/$(shell uname -r)/extra
134229
sudo install -m 644 $(MODULE).ko /lib/modules/$(shell uname -r)/extra
135230
sudo depmod -a
136231

137-
.PHONY: all clean install
232+
.PHONY: all clean install libqdma-patches unpatch-libqdma

driver/README.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,58 @@
11
# SLASH kernel module
22

3+
## Module parameters
4+
5+
Exposed under `/sys/module/slash/parameters/` (all writable at runtime; see
6+
`modinfo slash.ko`):
7+
8+
| Parameter | Type | Default | Description |
9+
|-----------|------|---------|-------------|
10+
| `qdma_num_threads` | uint | 8 | Number of libqdma worker threads. |
11+
| `qdma_debugfs_path` | charp | disabled | debugfs mount path for libqdma. |
12+
13+
### A/B testing NoC channel bandwidth
14+
15+
The AXI-MM / NoC channel is chosen per queue pair when it is added (the
16+
`mm_channel` field of the qpair-add ioctl, `enum slash_qdma_mm_channel`):
17+
`auto` stripes queues across both channels by `qid & 1`, while `0` / `1` pin a
18+
queue to a single channel. Every queue creator carries this setting, so it can
19+
be driven per buffer to check whether both PCIe NMUs (NoC channels) actually
20+
contribute bandwidth. With `v80-smi validate`:
21+
22+
```sh
23+
# All queues on NoC channel 0 (NMU S00)
24+
sudo v80-smi validate -d <BDF> --raw-transfer-test --no-reset --mm-channel 0
25+
26+
# All queues on NoC channel 1 (NMU S01)
27+
sudo v80-smi validate -d <BDF> --raw-transfer-test --no-reset --mm-channel 1
28+
29+
# Split across both channels (qid & 1)
30+
sudo v80-smi validate -d <BDF> --raw-transfer-test --no-reset --mm-channel auto
31+
32+
# Explicit per-buffer split (even positions -> channel 0, odd -> channel 1)
33+
sudo v80-smi validate -d <BDF> --raw-transfer-test --no-reset --mm-channel 0,1
34+
```
35+
36+
Debug builds with `SLASH_QDMA_OP_DEBUG=1` log each queue's selected
37+
`mm_channel` when it is added. If the split run is no faster than a single
38+
forced channel, traffic is not being spread across both NMUs. The per-queue
39+
setting affects every queue created through this driver (both the VRTD buffer
40+
path and `--raw-transfer-test`); the off-the-shelf Xilinx QDMA driver path
41+
(`--use-qdma-driver`) honors `--mm-channel` through its own channel attribute.
42+
343
## Testing
444

545
The test suite requires a physical V80 to be present and the module to be
646
loaded into a running kernel.
747

48+
## Local libqdma patches
49+
50+
SLASH carries small patches for the pinned `libqdma` submodule under
51+
`driver/patches/`. The driver `Makefile` applies them before building, and
52+
`make clean` attempts to revert them so the submodule working copy returns to
53+
its pristine pinned state. DKMS packages include the same patch directory and
54+
depend on `patch(1)`.
55+
856
### Prerequisites
957

1058
- A kernel built with `CONFIG_GCOV_KERNEL=y` (only needed for coverage runs).

0 commit comments

Comments
 (0)