Description
When we run our workload on 1 or 2 FPGA's we do not have any issues but when we try to run on 4 or 8 FPGA's
We usually get an indication of shell pci master timeout error in one of the FPGA slots during high bandwidth DMA.
our setup:
- F1.16xlarge (8 FPGA's running in parallel)
- Amazom Linux AMI
- Small shell version - 0x04182104
- linux XDMA driver
From our internal debug this is what we see:
Our PCI AXI master (CL) is trying to write to the shell AXI transactions with typical burst size of 4KB.
At some point we see that the shell is reporting on Timeout Error on the W channel (i.e. pcim-axi-protocol-wchannel-error).
After debugging it we see that there is indeed a timeout violation between some WDATA transfers,
but this violation is caused because the WREADY is de-asserted during this period (while WVALID is asserted).
As a result of the WREADY backpressure, the CL can’t complete the transaction during the timeout period.
Some time after the timeout occurs, all writes and reads from FPGA towards PCI are stuck, including interrupts.