#65 - Add use-case decision tree to configuration tutorial for picking an example YAML

RamyaGuru · claude · RamyaGuru · commit badca6ce64b0 · 2026-05-08T13:42:03.000-04:00
The examples/ directory now holds 19 YAML configs across three backends and
several data-path specializations. Users had no roadmap for which one to
start from, and YAMLs weren't mapped to the binaries that consume them in
the tutorial flow.

Add a "Choosing an example config" section at the top of
docs/tutorials/configuration-walkthrough.md, structured as a top-down list
of 5 use-case questions phrased in the user's voice: baseline throughput
(with backend and hardware sub-questions), GPU packet reordering (with
algorithm, kernel-location, direction, and in-kernel type-conversion
sub-questions), header-data split, multi-queue flow routing, and recording
to disk (PCAP / GDS nested). Each leaf names both the YAML and the binary
that consumes it. The existing annotated walkthrough is preserved under a
new "Annotated walkthrough" H2.

Add a one-line cross-reference at the top of
docs/tutorials/benchmarking_examples.md pointing readers to the new
decision tree.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
Signed-off-by: Ramya Gurunathan &lt;rgurunathan@nvidia.com&gt;
diff --git a/docs/tutorials/benchmarking_examples.md b/docs/tutorials/benchmarking_examples.md
@@ -8,6 +8,8 @@ DAQIRI provides a benchmarking application named `daqiri_bench_raw_gpudirect` th
 
 Make sure to [build the DAQIRI library](../getting-started.md#build-the-daqiri-library) beforehand.
 
+**Not sure which YAML to start from?** See [Choosing an example config](configuration-walkthrough.md#choosing-an-example-config) in the configuration tutorial — a use-case-driven decision tree from "I just want to verify the build" through reorder, recording, RDMA, and sockets.
+
 !!! note "Prerequisites"
 
     Before running the benchmarking application, ensure your system has been fully configured per the [System Configuration](system_configuration.md) page.
diff --git a/docs/tutorials/configuration-walkthrough.md b/docs/tutorials/configuration-walkthrough.md
@@ -4,6 +4,97 @@ hide:
 ---
 # Understanding the Configuration File
 
+## Choosing an example config
+
+Read down the questions below and stop at the first one that matches what you're trying to do. Each section names the YAML, the binary that consumes it, and the build flags or hardware it requires. **Backend selection is a build-time choice via `DAQIRI_MGR`** — the default build enables all three backends (DPDK raw, kernel sockets, and RDMA).
+
+??? question "1. I want to measure baseline throughput"
+    Pick the backend that matches your stack, then the hardware or protocol variant.
+
+    **DPDK raw** — runs on `daqiri_bench_raw_gpudirect`. Highest performance, kernel bypass; requires a Mellanox-class NIC.
+
+    - **Generic discrete GPU** (template — replace `<placeholders>`) — `daqiri_bench_raw_tx_rx.yaml`. This is the file annotated line-by-line in the [walkthrough below](#annotated-walkthrough).
+    - **DGX Spark / GB10** (prefilled) — `daqiri_bench_raw_tx_rx_spark.yaml`. `kind: host_pinned` for the integrated GPU; cores, PCIe addresses, and IPs are prefilled. See the [Spark profile callout](benchmarking_examples.md#update-the-loopback-configuration) for run details.
+    - **No physical NIC available** — `daqiri_bench_raw_sw_loopback.yaml`. `loopback: "sw"`, no NIC required. Useful for first-time build verification, not representative of production performance.
+
+    **RDMA / RoCE** — runs on `daqiri_bench_rdma` (use `--mode {tx,rx,both}`). Low-latency interconnect; available in the default build (set `-DDAQIRI_MGR="dpdk socket rdma"` explicitly for clarity). Requires an RDMA-capable fabric. Configs use `kind: host_pinned` regardless of platform.
+
+    - **Generic** (template — replace IPs) — `daqiri_bench_rdma_tx_rx.yaml`.
+    - **DGX Spark** (prefilled) — `daqiri_bench_rdma_tx_rx_spark.yaml`. See the [Spark profile callout](benchmarking_examples.md#update-the-loopback-configuration) for run details.
+
+    **Kernel TCP/UDP sockets** — runs on `daqiri_bench_socket`. No NIC, no privileges, no special CMake flags. Useful as a comparison baseline against DPDK and RDMA. Both bind to `127.0.0.1`.
+
+    - **UDP** — `daqiri_bench_socket_udp_tx_rx.yaml`.
+    - **TCP** — `daqiri_bench_socket_tcp_tx_rx.yaml`.
+
+??? question "2. I have out-of-order UDP packets that need to be reordered on the GPU"
+    DAQIRI's flagship pipeline: a CUDA kernel reads a sequence number from each packet's header and places packets at the correct offset in a GPU buffer, so a downstream consumer sees a fully ordered stream without a CPU touch. Configs run on `daqiri_bench_raw_reorder_seq` unless 2.4 applies. Sub-questions:
+
+    **2.1 Which algorithm matches how your packets encode batches?**
+
+    - *"My protocol sends a fixed N packets per logical batch; the seqno identifies position within the batch"* — `seq_packets_per_batch`.
+    - *"My protocol identifies the batch index in the seqno; packets-per-batch is fixed at the protocol level"* — `seq_batch_number`.
+
+    **2.2 Where should the reorder run?**
+
+    - GPU kernel (default, recommended) — `reorder_type: "gpu"`.
+    - CPU (throughput-bounded; comparison/baseline path) — `reorder_type: "cpu"`.
+
+    **2.3 Self-contained, or do you have a TX peer?**
+
+    - TX+RX — closed-loop in one process.
+    - RX-only — you'll generate traffic separately. **A standalone run of any `raw_rx_*` config exits cleanly with `0` packets if no traffic arrives — that is not a bug; you need a TX peer.**
+
+    **2.4 Do you also need an in-kernel payload type conversion?**
+
+    - No — pick a leaf from the table below.
+    - Yes — `daqiri_bench_raw_tx_rx_reorder_quantize_seq_batch.yaml` (runs on `daqiri_bench_raw_reorder_quantize`, not `daqiri_bench_raw_reorder_seq`). Combines `seq_batch_number` reorder with an in-kernel payload type conversion; the `data_types` block sets the input and output types (the example uses int4 → fp32). Pick this when wire format and compute format differ.
+
+    Concrete leaves (without conversion):
+
+    | YAML | Algorithm | Kernel | Direction |
+    |---|---|---|---|
+    | `daqiri_bench_raw_tx_rx_reorder_seq_1024.yaml` | `seq_packets_per_batch` (1024) | GPU | TX+RX |
+    | `daqiri_bench_raw_tx_rx_reorder_seq_1024_cpu.yaml` | `seq_packets_per_batch` (1024) | CPU | TX+RX |
+    | `daqiri_bench_raw_rx_reorder_seq_ppb.yaml` | `seq_packets_per_batch` (128) | GPU | RX-only |
+    | `daqiri_bench_raw_rx_reorder_seq_batch.yaml` | `seq_batch_number` | GPU | RX-only |
+    | `daqiri_bench_raw_sw_loopback_reorder_seq_1024.yaml` | `seq_packets_per_batch` (1024) | CPU | TX+RX, no NIC |
+
+    *Requires: DPDK build + Mellanox-class NIC (or the SW-loopback variant for first-time validation).*
+
+??? question "3. I need to parse small per-packet metadata on the CPU while keeping payload on the GPU"
+    - `daqiri_bench_raw_tx_rx_hds.yaml` (runs on `daqiri_bench_raw_hds`).
+
+    Header-data split: segment 0 (CPU) holds the header, segment 1 (GPU) holds the payload via GPUDirect zero-copy. Pick this when the CPU needs to read small per-packet fields without ever touching the payload.
+
+    *Requires: DPDK build + Mellanox-class NIC.*
+
+??? question "4. I need flow-based load balancing across multiple RX queues"
+    - `daqiri_bench_raw_rx_multi_q.yaml` (runs on `daqiri_bench_raw_gpudirect`).
+
+    RX-only by design — drive traffic from a separate peer. Demonstrates flow-rule-based routing across multiple RX queues, each pinned to its own CPU core.
+
+    *Requires: DPDK build + Mellanox-class NIC + a separate TX traffic source.*
+
+??? question "5. I need to record packet data to disk"
+    Sub-question: **which output format?**
+
+    **5.1 Wireshark- / tcpdump-compatible PCAP** — runs on `daqiri_example_pcap_writer`. Default; works on any filesystem. Run shape: `daqiri_example_pcap_writer <yaml> <output.pcap> [--tx]` (omit `--tx` for an RX-only tcpdump-style capture).
+
+    - **Hardware loopback** — `daqiri_example_pcap_writer_tx_rx.yaml`.
+    - **No physical NIC available** — `daqiri_example_pcap_writer_sw_loopback.yaml`.
+
+    *Requires: DPDK build. No special CMake flag.*
+
+    **5.2 Zero-copy GPU → NVMe writes** (advanced) — runs on `daqiri_example_gds_write`. Pick this *only* if the GPU-to-disk zero-copy path is the specific subject of investigation; otherwise pick PCAP (5.1).
+
+    - **Hardware loopback** — `daqiri_example_gds_write_tx_rx.yaml`.
+    - **No physical NIC available** — `daqiri_example_gds_write_sw_loopback.yaml`.
+
+    *Requires: built with `-DDAQIRI_ENABLE_GDS=ON`, NVMe-backed storage, working cuFile / `nvidia_fs` stack, `gdscheck.py -p` reports `NVMe : Supported`.*
+
+## Annotated walkthrough
+
 This section walks through the YAML configuration used by the benchmark applications. The annotated example below is based on `daqiri_bench_raw_tx_rx.yaml`. Click on the :material-plus-circle: icons to expand explanations for each annotated line.
 
 Annotations are prefixed with a category: