Skip to content

Commit 7fec692

Browse files
authored
#194 - Add hardware tunnel flow offloads (#194)
Signed-off-by: Cliff Burdick <cburdick@nvidia.com>
1 parent da466ec commit 7fec692

23 files changed

Lines changed: 2150 additions & 181 deletions

AGENTS.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -33,8 +33,8 @@ There is no unit test suite. Verification is done via the benchmark executables
3333

3434
| Executable | Source | Typical config |
3535
|---|---|---|
36-
| `daqiri_bench_raw_gpudirect` | `raw_gpudirect_bench.cpp` | `daqiri_bench_raw_tx_rx.yaml`, `daqiri_bench_raw_tx_rx_4q.yaml`, `daqiri_bench_raw_tx_rx_spark.yaml`, `daqiri_bench_raw_{tx,rx}_spark_xhost.yaml`, `daqiri_bench_raw_sw_loopback.yaml`, `daqiri_bench_raw_rx_multi_q.yaml`, `daqiri_bench_raw_tx_rx_spark_mq.yaml` (mq base; `run_spark_mq_bench.sh` derives the 4 cells via `scripts/gen_spark_mq_config.py`), `daqiri_bench_raw_tx_rx_pacing.yaml` (per-queue `pacing_mbps`; DPDK engine only) |
37-
| `daqiri_example_dynamic_rx_flow` | `dynamic_rx_flow_example.cpp` | `daqiri_example_dynamic_rx_flow.yaml`queues-only `flow_isolation: true` startup followed by runtime RX flow add/delete |
36+
| `daqiri_bench_raw_gpudirect` | `raw_gpudirect_bench.cpp` | `daqiri_bench_raw_tx_rx.yaml`, `daqiri_bench_raw_tx_rx_4q.yaml`, `daqiri_bench_raw_tx_rx_spark.yaml`, `daqiri_bench_raw_{tx,rx}_spark_xhost.yaml`, `daqiri_bench_raw_sw_loopback.yaml`, `daqiri_bench_raw_rx_multi_q.yaml`, `daqiri_bench_raw_tx_rx_vxlan.yaml`, `daqiri_bench_raw_tx_rx_vlan.yaml`, `daqiri_bench_raw_tx_rx_gre.yaml`, `daqiri_bench_raw_tx_rx_nvgre.yaml`, `daqiri_bench_raw_tx_rx_spark_mq.yaml` (mq base; `run_spark_mq_bench.sh` derives the 4 cells via `scripts/gen_spark_mq_config.py`), `daqiri_bench_raw_tx_rx_pacing.yaml` (per-queue `pacing_mbps`; DPDK engine only) |
37+
| `daqiri_example_dynamic_rx_flow` | `dynamic_rx_flow_example.cpp` | `daqiri_example_dynamic_rx_flow.yaml``flow_isolation: true` startup followed by runtime RX queue-steering and raw-engine decap/pop flow add/delete |
3838
| `daqiri_bench_raw_hds` | `raw_hds_bench.cpp` | `daqiri_bench_raw_tx_rx_hds.yaml` |
3939
| `daqiri_bench_raw_reorder_seq` | `raw_reorder_seq_bench.cpp` | `daqiri_bench_raw_tx_rx_reorder_seq_1024*.yaml`, `daqiri_bench_raw_rx_reorder_seq_*.yaml` |
4040
| `daqiri_bench_raw_reorder_quantize` | `raw_reorder_quantize_bench.cpp` | `daqiri_bench_raw_tx_rx_reorder_quantize_seq_batch.yaml` |
@@ -94,9 +94,10 @@ Vendored under `third_party/` as submodules (`.gitmodules`): `yaml-cpp` (config
9494

9595
### Current limitations
9696
- TX header fill currently supports UDP only (see README).
97-
- Raw Ethernet RX flow `action.id` must match an `rx.queues` ID and flex-item flows must reference a valid `flex_item_id` on the same interface; `daqiri_init()` aborts if RX flow rules, send-to-kernel fallbacks (`flow_isolation: true`), or `tx_eth_src` offload rules cannot be programmed on the NIC.
97+
- Raw Ethernet RX flow legacy `action.id` or final `actions:` queue action must match an `rx.queues` ID, and flex-item flows must reference a valid `flex_item_id` on the same interface; `daqiri_init()` aborts if RX flow rules, send-to-kernel fallbacks (`flow_isolation: true`), transform flow actions, or `tx_eth_src` offload rules cannot be programmed on the NIC.
98+
- Raw Ethernet tunnel/VLAN transform flows are hardware-only on the DPDK and ibverbs raw engines. TX flows may contain only push/encap transform actions and RX transform flows must use pop/decap actions ending in a queue; socket/RDMA engines reject these actions instead of adding a software fallback.
9899
- Raw Ethernet RX flow steering: a single interface cannot mix standard (UDP/IP) and
99-
flex-item flows; `DpdkEngine::validate_config()` rejects mixed configs at init.
100+
flex-item flows, and flex-item flows cannot combine with tunnel/VLAN transform actions; `DpdkEngine::validate_config()` rejects mixed configs at init.
100101
- No CI yet — contributors and reviewers verify manually (CONTRIBUTING.md).
101102

102103
## Documentation

README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,9 @@ DAQIRI provides direct NIC hardware access in userspace, bypassing the Linux ker
3939
- **Flow Steering** — Configure the NIC's hardware flow engine to route packets by UDP
4040
source/destination port or flex-item payload fields. Raw RX flows can be configured
4141
statically in YAML or added/deleted dynamically after `daqiri_init()`. Per RX
42-
interface, use standard UDP/IP flows or flex-item flows, not both.
42+
interface, use standard UDP/IP flows or flex-item flows, not both. Raw DPDK and
43+
raw ibverbs flows can also use hardware-only VLAN push/pop and VXLAN, GRE, or
44+
NVGRE encap/decap actions; socket/RDMA streams reject those tunnel actions.
4345
- **RDMA** — RDMA verbs (READ, WRITE, SEND) over RoCE on Ethernet NICs or InfiniBand.
4446
- **Optional OpenTelemetry metrics** — Expose per-interface or per-queue packet,
4547
byte, and drop counters when built with `DAQIRI_ENABLE_OTEL_METRICS=ON`.

docs/api-reference/configuration.md

Lines changed: 46 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -194,19 +194,28 @@ engine.
194194

195195
`rx.flows:` — Static startup flow rules that steer packets to specific queues based on
196196
match criteria. This sequence may be omitted; a queues-only RX config can add DPDK RX
197-
flows later with the dynamic flow API.
197+
flows later with the dynamic flow API. For Raw Ethernet on the DPDK and ibverbs engines,
198+
RX flows can also perform hardware VLAN pop or tunnel decapsulation before queue delivery.
198199

199200
- **`name`**: Flow name.
200201
- type: `string`
201202
- **`id`**: Flow ID. Retrievable at runtime via `get_packet_flow_id()`.
202203
- type: `integer`
203-
- **`action`**: What to do with matched packets.
204-
- **`type`**: Action type. Only `queue` is currently supported.
205-
- type: `string`
206-
- **`id`**: Queue ID to steer matched packets to. Must match the `id` of an entry under
207-
`rx.queues` on the same interface. `daqiri_init()` rejects unknown queue IDs during
208-
config validation.
209-
- type: `integer`
204+
- **`action`**: Legacy single action map. Existing configs may keep using
205+
`action: {type: queue, id: ...}`.
206+
- **`actions`**: Ordered action list. Use this for tunnel/VLAN transforms.
207+
RX transform flows must end with `type: queue`.
208+
- **`type: queue`**: Steer matched packets to an RX queue.
209+
- **`id`**: Queue ID under `rx.queues` on the same interface.
210+
- **`type: vlan_pop`**: Pop one VLAN tag in hardware.
211+
- **`type: tunnel_decap`**: Decapsulate a hardware tunnel before queue delivery.
212+
- **`tunnel.type`**: `vxlan`, `gre`, or `nvgre`.
213+
- **`outer_eth_src` / `outer_eth_dst`**: Outer Ethernet addresses.
214+
- **`outer_ipv4_src` / `outer_ipv4_dst`**: Outer IPv4 addresses. IPv6 outer
215+
headers are not supported in v1.
216+
- **VXLAN fields**: `vni`, optional `outer_udp_src`, `outer_udp_dst` default `4789`.
217+
- **GRE fields**: optional `gre_protocol` default `0x0800`.
218+
- **NVGRE fields**: `tni`, optional `flow_id`.
210219
- **`match`**: Criteria for matching packets.
211220
- **`udp_src`**: UDP source port or port range (e.g., `1000-1010`).
212221
- type: `integer` or `string`
@@ -230,6 +239,7 @@ created when `flow_isolation: true`, initialization fails with a critical log an
230239
A single RX interface must use either standard UDP/IP flows or flex-item flows, not both.
231240
Both classes install conflicting DPDK group-0 jump rules, so only one is reachable when mixed.
232241
`daqiri_init` rejects such configs with a clear error.
242+
Flex-item flows cannot be combined with VLAN/tunnel transform actions in v1.
233243

234244
### Flow Isolation
235245

@@ -375,6 +385,34 @@ daqiri::set_reorder_cuda_stream("rx_port", "rx_reorder_0", stream);
375385
- type: `integer`
376386
- default: `0`
377387

388+
### Transmit Flows
389+
390+
`tx.flows:` — Raw Ethernet hardware transform rules for outgoing packets. Supported on
391+
the DPDK and ibverbs raw engines only. TX flows match the packet as supplied by the
392+
application, then push or encapsulate headers in hardware; the application buffer remains
393+
the pre-encap packet.
394+
395+
- **`name`** / **`id`**: Flow label and ID.
396+
- **`actions`**: Ordered transform action list. TX flows cannot contain `queue`.
397+
- **`type: vlan_push`**: Push one VLAN tag.
398+
- **`vlan_id`**: VLAN ID, `0..4095`.
399+
- **`pcp`**: Priority, `0..7`, default `0`.
400+
- **`dei`**: Drop eligible indicator, `0..1`, default `0`.
401+
- **`ethertype`**: VLAN TPID, default `0x8100`.
402+
- **`type: tunnel_encap`**: Encapsulate in `vxlan`, `gre`, or `nvgre`.
403+
- **`tunnel.type`**: `vxlan`, `gre`, or `nvgre`.
404+
- **`outer_eth_src` / `outer_eth_dst`** and **`outer_ipv4_src` /
405+
`outer_ipv4_dst`** are required.
406+
- **VXLAN fields**: `vni`, optional `outer_udp_src`, `outer_udp_dst` default `4789`.
407+
- **GRE fields**: optional `gre_protocol` default `0x0800`.
408+
- **NVGRE fields**: `tni`, optional `flow_id`.
409+
- **`match`**: Same standard UDP/IP match keys as RX flows. Omit `match` for a
410+
catch-all TX transform.
411+
412+
DAQIRI validates tunnel overhead against the configured packet buffer size and
413+
the supported jumbo-frame bound. For RX decap/pop, MTU sizing accounts for the
414+
outer wire frame while packet buffers contain the post-decap frame.
415+
378416
### Accurate Send
379417

380418
`tx.accurate_send:` — Enable hardware-timed packet transmission using PTP timestamps. When

docs/api-reference/cpp.md

Lines changed: 34 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -136,6 +136,11 @@ Raw Ethernet RX flows can be added and deleted after `daqiri_init()` on the
136136
`dpdk` and raw `ibverbs` engines. This supports queues-only startup configs,
137137
including `rx.flow_isolation: true` with no initial `rx.flows`. Static YAML
138138
flows still use explicit configured IDs and are not deletable through this API.
139+
The legacy `FlowRuleConfig::action_` field remains the shorthand for a single
140+
queue action; `FlowRuleConfig::actions_` is the ordered form used when a dynamic
141+
RX rule needs hardware VLAN pop or tunnel decapsulation before queue delivery.
142+
Dynamic TX transform flows are not part of v1; configure TX encapsulation/push
143+
rules statically under `tx.flows`.
139144

140145
```cpp
141146
daqiri::FlowRuleConfig flow;
@@ -173,6 +178,33 @@ while (flow_id == 0) {
173178
}
174179
```
175180

181+
For a dynamic VXLAN decap rule, use ordered actions and make the final action
182+
the target queue:
183+
184+
```cpp
185+
daqiri::FlowRuleConfig decap;
186+
decap.name_ = "vxlan_decap_5000";
187+
188+
daqiri::FlowAction tunnel;
189+
tunnel.type_ = daqiri::FlowType::TUNNEL_DECAP;
190+
tunnel.tunnel_.type_ = daqiri::TunnelType::VXLAN;
191+
tunnel.tunnel_.outer_eth_src_ = "02:00:00:00:00:01";
192+
tunnel.tunnel_.outer_eth_dst_ = "02:00:00:00:00:02";
193+
tunnel.tunnel_.outer_ipv4_src_ = "192.0.2.1";
194+
tunnel.tunnel_.outer_ipv4_dst_ = "192.0.2.2";
195+
tunnel.tunnel_.outer_udp_dst_ = 4789;
196+
tunnel.tunnel_.vni_ = 100;
197+
decap.actions_.push_back(tunnel);
198+
199+
daqiri::FlowAction queue;
200+
queue.type_ = daqiri::FlowType::QUEUE;
201+
queue.id_ = 0;
202+
decap.actions_.push_back(queue);
203+
204+
decap.match_.type_ = daqiri::FlowMatchType::IPV4_UDP;
205+
decap.match_.udp_dst_ = 5000;
206+
```
207+
176208
Packets matching a dynamic rule are marked with the same `FlowId` returned by
177209
the add completion, so `get_packet_flow_id()` gives the handle to pass to
178210
`delete_flow_async()`. `poll_flow_op()` returns `Status::NOT_READY` when no flow
@@ -219,7 +251,8 @@ auto delete_status = daqiri::delete_flow_async(flow_id, &delete_op);
219251
```
220252

221253
Dynamic flow support is RX-only in v1. Socket, RDMA/RoCE, and software loopback
222-
engines return `NOT_SUPPORTED`.
254+
engines return `NOT_SUPPORTED`; tunnel/VLAN transform actions are accepted only
255+
by raw DPDK and raw ibverbs.
223256

224257
## Reordered RX Bursts
225258

docs/api-reference/index.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,15 +22,18 @@ A DAQIRI application starts from a YAML configuration file (or an
2222
equivalent `NetworkConfig` struct built in code). The configuration
2323
defines the active stream type, optional engine, endpoint URIs, NIC interfaces, RX and TX
2424
queues, memory regions, flow steering rules, flow isolation,
25-
header-data split, and optional reorder plans. After initialization,
25+
hardware flow transform actions, header-data split, and optional reorder plans. After initialization,
2626
the language API operates on those configured ports, queues, buffers,
2727
and flows.
2828

2929
The language APIs do **not** discover queues, memory, or flow steering
3030
rules on their own. They are runtime handles over the topology declared
3131
in the configuration (YAML file or `NetworkConfig` struct). The
3232
configuration is the source of truth for queue IDs, memory placement,
33-
stream-type / engine / endpoint selection, and flow routing.
33+
stream-type / engine / endpoint selection, flow routing, and static TX
34+
tunnel/VLAN transforms. Dynamic RX flow APIs can add and delete runtime
35+
queue-steering rules and, on raw DPDK or raw ibverbs, runtime RX decap/pop
36+
rules using the same ordered action model.
3437

3538
The configuration schema lives in the
3639
[Configuration YAML Reference](configuration.md). For an annotated

docs/api-reference/python.md

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -591,6 +591,11 @@ The workflow sections above show the common call order and ownership rules.
591591
| `rdma_get_port_queue(conn_id)` | Return `(Status, port, queue)`. |
592592
| `rdma_get_server_conn_id(server_addr, server_port)` | Return `(Status, conn_id)`. |
593593

594+
Dynamic RX flows are RX-only in v1. The `action` attribute remains the single queue-action
595+
shorthand; use ordered `actions` when a raw DPDK or raw ibverbs dynamic rule needs hardware
596+
VLAN pop or VXLAN/GRE/NVGRE decapsulation before the final queue action. Static TX
597+
encapsulation/push rules are configured in YAML under `tx.flows`.
598+
594599
## Constants
595600

596601
| Name | Description |
@@ -622,7 +627,8 @@ The workflow sections above show the common call order and ownership rules.
622627
| `RDMAMode` | `CLIENT`, `SERVER`, `INVALID` |
623628
| `RDMATransportMode` | `RC`, `UC`, `UD`, `INVALID` |
624629
| `SocketMode` | `CLIENT`, `SERVER`, `INVALID` |
625-
| `FlowType` | `QUEUE` |
630+
| `FlowType` | `QUEUE`, `VLAN_PUSH`, `VLAN_POP`, `TUNNEL_ENCAP`, `TUNNEL_DECAP` |
631+
| `TunnelType` | `NONE`, `VXLAN`, `GRE`, `NVGRE` |
626632
| `FlowMatchType` | `IPV4_UDP`, `FLEX_ITEM` |
627633
| `FlowOpType` | `ADD_RX`, `ADD_RX_BATCH`, `DELETE` |
628634
| `ReorderMethod` | `INVALID`, `SEQ_BATCH_NUMBER`, `SEQ_PACKETS_PER_BATCH` |
@@ -654,10 +660,12 @@ names that mostly omit the trailing underscore from the C++ member name (e.g.
654660
| `RxQueueConfig` | RX queue wrapper with common queue fields and timeout. |
655661
| `TxQueueConfig` | TX queue wrapper with common queue fields. |
656662
| `MemoryRegionConfig` | Memory region kind, affinity, access flags, sizes, counts, and ownership. |
657-
| `FlowAction` | Flow action type and target ID. |
663+
| `VlanActionConfig` | VLAN push parameters: VLAN ID, priority, DEI, and ethertype. |
664+
| `TunnelConfig` | VXLAN, GRE, or NVGRE tunnel template fields for hardware encap/decap actions. |
665+
| `FlowAction` | Flow action type, queue target ID, optional VLAN config, and optional tunnel config. |
658666
| `FlowMatch` | Flow match fields for UDP, IPv4, and flex item matching. |
659-
| `FlowConfig` | Named flow rule combining action and match. |
660-
| `FlowRuleConfig` | Dynamic flow rule match and action. |
667+
| `FlowConfig` | Static named flow rule combining legacy `action`, ordered `actions`, and match fields. |
668+
| `FlowRuleConfig` | Dynamic RX flow rule combining legacy `action`, ordered `actions`, and match fields. |
661669
| `FlowOpResult` | Dynamic flow operation completion. Batch adds return `flow_ids` in input order. |
662670
| `FlexItemConfig` | Flexible parser item configuration. |
663671
| `FlexItemMatch` | Flexible parser match value and mask. |

docs/benchmarks/raw_benchmarking.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,20 @@ The benchmark executables and example YAML configurations are located at:
110110

111111
The fields in the YAML configs will be explained in more detail in [Understanding the Configuration File](../tutorials/configuration-walkthrough.md). For now, we'll stick to modifying the strict minimum required fields to run the application as-is on your system.
112112

113+
### Hardware tunnel transform examples
114+
115+
Raw DPDK and raw ibverbs builds can program hardware flow actions that push or
116+
encapsulate on TX and pop or decapsulate on RX. The application packet buffers
117+
remain pre-encap on TX and post-decap on RX; DAQIRI accounts for outer-header
118+
overhead when sizing MTU/wire frames.
119+
120+
| Transform | YAML config | Binary |
121+
|---|---|---|
122+
| VXLAN encap + decap | [`daqiri_bench_raw_tx_rx_vxlan.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx_vxlan.yaml) | `daqiri_bench_raw_gpudirect` |
123+
| VLAN push + pop | [`daqiri_bench_raw_tx_rx_vlan.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx_vlan.yaml) | `daqiri_bench_raw_gpudirect` |
124+
| GRE encap + decap | [`daqiri_bench_raw_tx_rx_gre.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx_gre.yaml) | `daqiri_bench_raw_gpudirect` |
125+
| NVGRE encap + decap | [`daqiri_bench_raw_tx_rx_nvgre.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx_nvgre.yaml) | `daqiri_bench_raw_gpudirect` |
126+
113127
##### Identify your NIC's PCIe addresses
114128

115129
Retrieve the PCIe addresses of both ports of your NIC. We'll arbitrarily use the first for Tx and the second for Rx here:

docs/concepts.md

Lines changed: 17 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -242,18 +242,17 @@ buffers (CPU hugepages, GPU device memory, or pinned host memory).
242242

243243
### Flow
244244

245-
A **flow** is a match pattern paired with an action. The common action
246-
is to steer matching packets into a specific queue. For example, all
247-
UDP-destination-port-4096 packets can be routed into a queue backed by
248-
GPU memory. Matching and the resulting action both run entirely in NIC
249-
hardware.
245+
A **flow** is a match pattern paired with one or more actions. The
246+
common RX action is to steer matching packets into a specific queue. For
247+
example, all UDP-destination-port-4096 packets can be routed into a
248+
queue backed by GPU memory. Matching and the resulting actions both run
249+
entirely in NIC hardware.
250250

251251
Flow rules are only available in Raw Ethernet (`stream_type: "raw"`).
252252

253253
A flow's match can combine fields such as `udp_src`, `udp_dst`, and
254254
`ipv4_len`; multiple flows can target the same queue, and the matching
255-
flow's ID is available at runtime so the application can distinguish
256-
them.
255+
flow's ID is available at runtime so the application can distinguish them.
257256

258257
Flows can be static or dynamic. Static flows are configured under
259258
`rx.flows` in the YAML and keep their configured IDs for the process lifetime.
@@ -264,16 +263,24 @@ returned in the add completion, and used as the packet marks returned by
264263
flow IDs are in input order. Only dynamic flows can be deleted dynamically. TX
265264
dynamic flows are not part of v1.
266265

266+
Raw DPDK and raw ibverbs flows can also use ordered `actions:` for hardware VLAN
267+
pop/push and VXLAN, GRE, or NVGRE decap/encap. RX decap/pop actions deliver
268+
post-decap packets to application buffers; TX encap/push actions leave
269+
application buffers as pre-encap packets and change only the wire frame.
270+
Dynamic RX flows use the same ordered action model for runtime decap/pop rules,
271+
while TX transform flows remain static startup configuration.
272+
267273
### Flow Steering
268274

269275
**Flow steering** is the NIC-level mechanism that classifies an
270276
incoming packet against the configured flows and writes it into the
271277
matching queue's buffer, entirely in hardware. Multi-queue RX works by
272278
routing each flow to a separate queue for parallel processing.
273279

274-
For Raw Ethernet, flow steering is implemented on top of RTE Flow. Flow
275-
rules are programmed during `daqiri_init()`; initialization fails if the
276-
NIC rejects a rule. The YAML options are documented in
280+
For Raw Ethernet, flow steering is implemented on top of RTE Flow in the
281+
DPDK engine and mlx5 Direct Rules in the ibverbs engine. Flow rules are
282+
programmed during `daqiri_init()`; initialization fails if the NIC
283+
rejects a rule. The YAML options are documented in
277284
[Configuration YAML Reference → Flows](api-reference/configuration.md#flows).
278285

279286
## Memory Regions

0 commit comments

Comments
 (0)