|
| 1 | +# Bug Report: `reduce_ops` HLS Kernel Fails to Produce Output in Vivado 2024.2+ |
| 2 | + |
| 3 | +## Table of Contents |
| 4 | + |
| 5 | +1. [Overview](#overview) |
| 6 | +2. [Module Description](#module-description) |
| 7 | +3. [HLS Source](#hls-source) |
| 8 | +4. [Hardware Observation — ILA Capture](#hardware-observation--ila-capture) |
| 9 | +5. [Reproducing the Bug](#reproducing-the-bug) |
| 10 | +6. [Folder Structure](#folder-structure) |
| 11 | + |
| 12 | +--- |
| 13 | + |
| 14 | +## Overview |
| 15 | + |
| 16 | +This repository documents a regression in Vitis HLS / Vivado. The `reduce_ops` kernel — a `do-while` loop with `#pragma HLS PIPELINE II=1 style=frp` and an `ap_ctrl_none` control interface — receives valid data on both AXI-Stream inputs but **never asserts `TVALID` on its output**, effectively hanging the datapath. The root cause has not been isolated (candidate constructs are `style=frp`, the `do-while` loop termination condition, `ap_ctrl_none`, or a combination thereof). The failure is silent — synthesis, implementation and DRC all complete without errors or relevant warnings. |
| 17 | + |
| 18 | +- **Working**: Vitis HLS / Vivado ≤ 2024.1 |
| 19 | +- **Broken**: Vitis HLS / Vivado ≥ 2024.2 (confirmed also in latest version of Vivado/Vitis HLS 2025.2) |
| 20 | + |
| 21 | +Importantly, **C simulation (`csim`) and co-simulation (`cosim`) both pass** in all tested tool versions, making the regression invisible without hardware testing. |
| 22 | + |
| 23 | +This repository contains source code, logs, ILA captures, design checkpoints from Vivado 2023.2 (working) and Vivado 2025.2 (the latest available Vivado version, for which the code doesn't work). |
| 24 | + |
| 25 | +**N.B.:** This bug was first observed in ACCL: https://github.com/Xilinx/ACCL. Since building ACCL is generally more involved (larger design --> longer synthesis, harder timing closure, no support for Vitis 2025.x due to the migration to vitis-run etc.), the problematic module was isolated and added as an application to Coyote, the open-source FPGA shell. This folder contains the minimum working example, containing the problematic HLS code, the Verilog wrapper, a testbench and some C++ code to run the test. All the relavant details are explained in this README. |
| 26 | + |
| 27 | +--- |
| 28 | + |
| 29 | +## Module Description |
| 30 | + |
| 31 | +The kernel under test is `reduce_ops`, taken from the open-source [ACCL project](https://github.com/Xilinx/ACCL). It performs element-wise reduction (addition or max) over two 512-bit AXI-Stream operands, selecting the data type from the `TDEST` field of the incoming stream: |
| 32 | + |
| 33 | +| `TDEST` | Operation | |
| 34 | +|---------|-----------| |
| 35 | +| 0 | fp32 add | |
| 36 | +| 1 | fp64 add | |
| 37 | +| 2 | int32 add | |
| 38 | +| 3 | int64 add | |
| 39 | +| 5 | fp32 max | |
| 40 | +| 6 | fp64 max | |
| 41 | +| 7 | int32 max | |
| 42 | +| 8 | int64 max | |
| 43 | + |
| 44 | +The kernel processes words in a `do-while` loop, terminating when `TLAST` is asserted. The loop body is pipelined with `#pragma HLS PIPELINE II=1 style=frp`: |
| 45 | + |
| 46 | +```cpp |
| 47 | +void reduce_ops(STREAM<stream_word> &in0, STREAM<stream_word> &in1, STREAM<stream_word> &out) { |
| 48 | +#pragma HLS INTERFACE axis register both port=in0 |
| 49 | +#pragma HLS INTERFACE axis register both port=in1 |
| 50 | +#pragma HLS INTERFACE axis register both port=out |
| 51 | +#pragma HLS INTERFACE ap_ctrl_none port=return |
| 52 | + stream_word op0, op1, wword; |
| 53 | + ap_uint<DATA_WIDTH> res; |
| 54 | + |
| 55 | + do { |
| 56 | +#pragma HLS PIPELINE II=1 style=frp |
| 57 | + op0 = STREAM_READ(in0); |
| 58 | + op1 = STREAM_READ(in1); |
| 59 | + |
| 60 | + if (op0.dest == 0) res = stream_add<DATA_WIDTH, DEST_WIDTH, float> (op0.data, op1.data); |
| 61 | + else if (op0.dest == 1) res = stream_add<DATA_WIDTH, DEST_WIDTH, double> (op0.data, op1.data); |
| 62 | + // ... (further cases omitted for brevity) |
| 63 | + else res = stream_add<DATA_WIDTH, DEST_WIDTH, float> (op0.data, op1.data); |
| 64 | + |
| 65 | + wword.data = res; |
| 66 | + wword.last = op0.last; |
| 67 | + wword.keep = op0.keep; |
| 68 | + wword.dest = 0; |
| 69 | + STREAM_WRITE(out, wword); |
| 70 | + |
| 71 | + } while(op0.last != 1); |
| 72 | +} |
| 73 | +``` |
| 74 | +
|
| 75 | +The notable constructs in this kernel are the combination of `ap_ctrl_none`, a `do-while` loop whose exit condition reads a stream-derived signal (`op0.last`), and `style=frp` on the pipeline. The root cause of the regression has not been isolated — it is unclear which of these (or their combination) changed behaviour between tool versions. |
| 76 | +
|
| 77 | +In the test setup, `TDEST` is hardwired to `2` (int32 add) in the vFPGA top, and the host sends one 512-bit beat (16 × 32-bit integers) per operand stream. |
| 78 | +
|
| 79 | +--- |
| 80 | +
|
| 81 | +## HLS Source |
| 82 | +
|
| 83 | +The full HLS source, testbench, and standalone simulation script are in `source/hw/src/hls/reduce_ops/`: |
| 84 | +
|
| 85 | +| File | Description | |
| 86 | +|------|-------------| |
| 87 | +| `reduce_ops.h` | Types, constants (`DATA_WIDTH=512`, `DEST_WIDTH=8`), stream macros | |
| 88 | +| `reduce_ops.cpp` | Full kernel — template helpers `stream_add`/`stream_max` + `reduce_ops` top | |
| 89 | +| `reduce_ops_tb.cpp` | HLS C testbench: one beat of 16 int32 values, checks output | |
| 90 | +| `run_tb.tcl` | Standalone Vitis HLS script: runs csim → csynth → cosim | |
| 91 | +
|
| 92 | +To run the standalone HLS simulation (reproduces the csim/cosim pass): |
| 93 | +
|
| 94 | +```bash |
| 95 | +cd source/hw/src/hls/reduce_ops |
| 96 | +vitis_hls -f run_tb.tcl # Vitis HLS 2022.x – 2024.x |
| 97 | +vitis-run --tcl run_tb.tcl --mode hls # Vitis HLS 2025.x+ |
| 98 | +``` |
| 99 | + |
| 100 | +The complete Coyote hardware and software projects are in `source/hw/` and `source/sw/` respectively. |
| 101 | + |
| 102 | +--- |
| 103 | + |
| 104 | +## Hardware Observation — ILA Capture |
| 105 | + |
| 106 | +The ILA probes all three AXI-Stream interfaces at the vFPGA boundary (`axis_host_recv[0]`, `axis_host_recv[1]`, `axis_host_send[0]`). |
| 107 | + |
| 108 | +### Vivado 2023.2 — Working |
| 109 | + |
| 110 | +Both operand streams arrive, and the output stream fires correctly on the same transaction: |
| 111 | + |
| 112 | + |
| 113 | + |
| 114 | +- `axis_host_recv[0].tvalid` and `axis_host_recv[1].tvalid` pulse high as data is transferred. |
| 115 | +- `axis_host_send[0].tvalid` asserts in the same window, delivering the result. |
| 116 | +- Sample values confirm correct int32 addition: `recv[0]` carries odd integers (1 – 31), `recv[1]` carries even integers (2 – 32), and `send[0]` carries their element-wise sums (3 – 63). |
| 117 | + |
| 118 | +### Vivado 2025.2 — Broken |
| 119 | + |
| 120 | +The operand streams arrive identically, but **the output stream never asserts `TVALID`**: |
| 121 | + |
| 122 | + |
| 123 | + |
| 124 | +- `axis_host_recv[0].tvalid` and `axis_host_recv[1].tvalid` pulse high as before — the kernel receives data correctly. |
| 125 | +- `axis_host_send[0].tvalid` **remains 0** throughout; `tdata` is all zeros. |
| 126 | +- The kernel consumes its inputs and stalls without producing any output. |
| 127 | + |
| 128 | +There are no AXI protocol violations, handshake errors, or DRC failures in either build. The regression is entirely in the kernel's output behaviour. |
| 129 | + |
| 130 | +--- |
| 131 | + |
| 132 | +## Reproducing the Bug |
| 133 | + |
| 134 | +### Prerequisites |
| 135 | + |
| 136 | +- Alveo U55C (or U250 / U280) |
| 137 | +- Vivado + Vitis HLS (test with ≥ 2024.2 to observe the bug; ≤ 2024.1 for the working reference) |
| 138 | +- [Coyote](https://github.com/fpgasystems/Coyote) — with current branch |
| 139 | + |
| 140 | +Pre-built bitstreams, ILA probe files, routed checkpoints, and all build logs for both 2023.2 and 2025.2 are included in this repository so the bug can be observed without re-synthesising. |
| 141 | + |
| 142 | +### Using the pre-built bitstreams |
| 143 | + |
| 144 | +1. **Program the FPGA** using Vivado Hardware Manager, loading the bitstream and probe file from `bitstreams/<version>/`. |
| 145 | +2. **Rescan PCIe** or perform a warm reboot to re-enumerate the device. |
| 146 | +3. **Insert the Coyote driver**: |
| 147 | + ```bash |
| 148 | + sudo insmod coyote_driver.ko |
| 149 | + ``` |
| 150 | +4. **Build and run the software test**: |
| 151 | + ```bash |
| 152 | + cd source/sw && mkdir build && cd build |
| 153 | + cmake .. -DFDEV_NAME=u55c |
| 154 | + make && sudo ./test |
| 155 | + ``` |
| 156 | + On a working build the test prints the expected sums and exits with "Validation passed!". On a broken build the `checkCompleted` poll never returns, as no write completion is signalled. |
| 157 | + |
| 158 | +### Re-synthesising from source |
| 159 | + |
| 160 | +```bash |
| 161 | +# Hardware synth |
| 162 | +cd source/hw && mkdir build && cd build |
| 163 | +cmake .. -DFDEV_NAME=u55c |
| 164 | +make project && make bitgen |
| 165 | + |
| 166 | +# Software compilation |
| 167 | +cd source/hw && mkdir build && cd build |
| 168 | +cmake .. |
| 169 | +make |
| 170 | +``` |
| 171 | +Then, follow the steps from above to program the FPGA, insert the driver and run the test. |
| 172 | + |
| 173 | +--- |
| 174 | + |
| 175 | +## Folder Structure |
| 176 | + |
| 177 | +Artifacts are split by `<vivado-version>` (`2023.2` or `2025.2`) for direct side-by-side comparison. |
| 178 | + |
| 179 | +``` |
| 180 | +. |
| 181 | +├── source/ |
| 182 | +│ ├── hw/ # Coyote hardware project |
| 183 | +│ │ ├── CMakeLists.txt |
| 184 | +│ │ └── src/ |
| 185 | +│ │ ├── vfpga_top.svh # vFPGA top: instantiates reduce_ops_hls_ip + ILA |
| 186 | +│ │ ├── init_ip.tcl # Creates ila_reduce Vivado IP |
| 187 | +│ │ └── hls/reduce_ops/ |
| 188 | +│ │ ├── reduce_ops.h # Types, constants, stream macros |
| 189 | +│ │ ├── reduce_ops.cpp # Kernel source (stream_add, stream_max, reduce_ops) |
| 190 | +│ │ ├── tb.cpp # C testbench |
| 191 | +│ │ └── tb_hls.tcl # Standalone HLS sim script (csim/csynth/cosim) |
| 192 | +│ └── sw/ # Coyote software project |
| 193 | +│ ├── CMakeLists.txt |
| 194 | +│ └── src/main.cpp # Host software: sends 16 int32 operands to the FPGA, waits for completion, checks result |
| 195 | +│ |
| 196 | +├── images/ |
| 197 | +│ ├── waveform_23_2.png # ILA capture — Vivado 2023.2 (working) |
| 198 | +│ └── waveform_25_2.png # ILA capture — Vivado 2025.2 (broken) |
| 199 | +│ |
| 200 | +├── bitstreams/ |
| 201 | +│ └── <vivado-version>/ |
| 202 | +│ ├── cyt_top_<ver>.bit # Loadable bitstream |
| 203 | +│ └── cyt_top_<ver>.ltx # ILA probe file |
| 204 | +│ |
| 205 | +├── checkpoints/ |
| 206 | +│ └── <vivado-version>/ |
| 207 | +│ └── shell_routed_<ver>.dcp # Post-PnR routed checkpoint |
| 208 | +│ |
| 209 | +├── ila-data/ |
| 210 | +│ └── <vivado-version>/ |
| 211 | +│ ├── ila_capture_<ver>.ila # Native Vivado ILA capture |
| 212 | +│ └── ila_csv_<ver>.csv # Exported ILA capture (CSV) |
| 213 | +│ |
| 214 | +├── logs/ |
| 215 | +│ └── <vivado-version>/ |
| 216 | +│ ├── vitis_hls_reduce_ops_<ver>.log # Vitis HLS synthesis log |
| 217 | +│ ├── vivado_synth_reduce_ops_<ver>.log # Vivado OOC synth log (reduce_ops IP) |
| 218 | +│ ├── vivado_synth_top_<ver>.log # Vivado top-level synthesis log (containing the reduce_ops) |
| 219 | +│ └── vivado_pnr_<ver>.log # Place-and-route log |
| 220 | +│ |
| 221 | +└── reports/ |
| 222 | + └── <vivado-version>/ |
| 223 | + ├── route_status_<ver>.rpt # Post-PnR route status |
| 224 | + ├── timing_summary_<ver>.rpt # Timing summary |
| 225 | + └── drc_bitstream_checks_<ver>.rpt # DRC checks (no critical errors) |
| 226 | +``` |
0 commit comments