Skip to content

XDMA poor performance and high latency spikes explained and fixed #332

@dmitrym1

Description

@dmitrym1

In this issue I’ll try to shortly explain the reasons behind XDMA driver poor performance and high latency spikes and how to fix them, effectively increasing throughput for non-RT systems and stabilizing latencies for RT systems.
Don’t focus on absolute throughput numbers, I did not care of my calculations correctness for this demo. Time values are correct though.

Test system:
iMX8M Mini, Quad core ARM, Artix 7 with XDMA and a BRAM buffer as AXI peripheral. MSI as interrupt, RT kernel, Yocto Linux.
Important: iMX8M Mini does not support MSI-X in my setup, it also does not support MSI IRQ direction to certain CPU core. The results for Interrupt mode could be different if it did support.

Test procedures:

  1. For throughput test, 100x reads of 8KB is performed, then result is printed and test restarts.
  2. For latency test, 1x read of 384 bytes is performed, then result is printed and test restarts.
  3. Write shows equivalent behavior and thus is excluded from this document.
  4. Legacy interrupt and MSI-X interrupt were not tested.

First, let me show you the test results.

Test results

  1. MSI interrupt
    Image

Image

  1. poll_mode=1

Image

Image

  1. poll_mode=15 but without proper userspace configuration

Image

Image

  1. poll_mode=15 with proper userspace configuration

Image

Image

Image

As you can see, MSI interrupt mode was the worst, having performance and latencies very unstable. According to XDMA documentation, switching to poll mode should improve the numbers, and it did, not so much though. Having my fix applied without proper userspace configuration shows marginal improvement, that's what would happen if you apply the patch and forget the userspace part. Having my fix and proper userspace configuration applied shows incredible result that is many times better than what original MSI interrupt mode was able to offer. Not only it improved and stabilized throughput numbers, it also made XDMA RT-capable, bringing latency down to decent numbers that are also very stable.

The problem is that the original code forces the driver to do a lot of context switching and core migration, which are not only expensive operations (~300us each on my platform) but also a source of latency spikes. Given my SoC has 128 bytes TLP size limitation, it becomes obvious that such a small packet size will lead to often context switching which would destroy the performance.

This is what happens in the original code (cmpl_ is the poll thread, UIC is userspace thread; note how the driver disperses the load across 3 CPU cores, there is a lot of switching and a lot of overhead losses):

Image

This is what happens after applying my fix (note that poll thread and userspace thread are on different CPU cores, this is just for this demo, you'll get better results if you have both threads on the same CPU core; note that there is no switching and much less overhead losses):

Image

Pictures are not ideal though, take them with a grain of salt. But I hope it made my explanation clear.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions