-
Notifications
You must be signed in to change notification settings - Fork 492
Description
In this issue I’ll try to shortly explain the reasons behind XDMA driver poor performance and high latency spikes and how to fix them, effectively increasing throughput for non-RT systems and stabilizing latencies for RT systems.
Don’t focus on absolute throughput numbers, I did not care of my calculations correctness for this demo. Time values are correct though.
Test system:
iMX8M Mini, Quad core ARM, Artix 7 with XDMA and a BRAM buffer as AXI peripheral. MSI as interrupt, RT kernel, Yocto Linux.
Important: iMX8M Mini does not support MSI-X in my setup, it also does not support MSI IRQ direction to certain CPU core. The results for Interrupt mode could be different if it did support.
Test procedures:
- For throughput test, 100x reads of 8KB is performed, then result is printed and test restarts.
- For latency test, 1x read of 384 bytes is performed, then result is printed and test restarts.
- Write shows equivalent behavior and thus is excluded from this document.
- Legacy interrupt and MSI-X interrupt were not tested.
First, let me show you the test results.
Test results
- poll_mode=1
- poll_mode=15 but without proper userspace configuration
- poll_mode=15 with proper userspace configuration
As you can see, MSI interrupt mode was the worst, having performance and latencies very unstable. According to XDMA documentation, switching to poll mode should improve the numbers, and it did, not so much though. Having my fix applied without proper userspace configuration shows marginal improvement, that's what would happen if you apply the patch and forget the userspace part. Having my fix and proper userspace configuration applied shows incredible result that is many times better than what original MSI interrupt mode was able to offer. Not only it improved and stabilized throughput numbers, it also made XDMA RT-capable, bringing latency down to decent numbers that are also very stable.
The problem is that the original code forces the driver to do a lot of context switching and core migration, which are not only expensive operations (~300us each on my platform) but also a source of latency spikes. Given my SoC has 128 bytes TLP size limitation, it becomes obvious that such a small packet size will lead to often context switching which would destroy the performance.
This is what happens in the original code (cmpl_ is the poll thread, UIC is userspace thread; note how the driver disperses the load across 3 CPU cores, there is a lot of switching and a lot of overhead losses):
This is what happens after applying my fix (note that poll thread and userspace thread are on different CPU cores, this is just for this demo, you'll get better results if you have both threads on the same CPU core; note that there is no switching and much less overhead losses):
Pictures are not ideal though, take them with a grain of salt. But I hope it made my explanation clear.










