Skip to content

When running with F1.16xlarge on all FPGAs, PCIE access to one of them is stuck #656

Open
@NoamDualBird

Description

@NoamDualBird

When we run our workload on 1 or 2 FPGA's we do not have any issues but when we try to run on 4 or 8 FPGA's

We usually get an indication of shell pci master timeout error in one of the FPGA slots during high bandwidth DMA.
our setup:

  1. F1.16xlarge (8 FPGA's running in parallel)
  2. Amazom Linux AMI
  3. Small shell version - 0x04182104
  4. linux XDMA driver

From our internal debug this is what we see:
Our PCI AXI master (CL) is trying to write to the shell AXI transactions with typical burst size of 4KB.
At some point we see that the shell is reporting on Timeout Error on the W channel (i.e. pcim-axi-protocol-wchannel-error).
After debugging it we see that there is indeed a timeout violation between some WDATA transfers,
but this violation is caused because the WREADY is de-asserted during this period (while WVALID is asserted).
As a result of the WREADY backpressure, the CL can’t complete the transaction during the timeout period.

Some time after the timeout occurs, all writes and reads from FPGA towards PCI are stuck, including interrupts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions