-
Notifications
You must be signed in to change notification settings - Fork 189
Description
Preliminary Actions
- I have searched the existing issues and didn't find a duplicate.
- I have followed the AWS official troubleshoot documentation.
- I have followed the driver readme and best practices.
Driver Type
Linux kernel driver for Elastic Network Adapter (ENA)
Driver Tag/Commit
DPDK 23.11
Custom Code
No
OS Platform and Distribution
Ubuntu Noble
Support request
This is a follow-up on issues #235 and #286 as I cannot re-open any of these. I believe there is a more general issue with missing packets when using the DPDK driver for ENA.
Background
- I am using a custom TCP stack on top of the ENA DPDK driver;
- The correctness of that stack is irrelevant as the problem concerns L1/l2;
- When establishing connections to external hosts, two erroneous scenarios may happen:
a. Entire connection may disappear (ref [Support]: sudden disappearance of user-space TCP streams #286)
b. Streams of packets are lost (more or less long) (ref DPDK ENA PMD Silently Drops Packets on Rx #235)
Observations
- The ENA port never reports RX overruns, misses, or any sort of errors
- Our queues utilization never goes beyond 20% (we use the max RX queue depth of 8192)
a. More precisely, we ask to read 8192 buffers and never get more than 2000 back
b. Multiple consecutive reads may return as much, but never more - We notice missing packets even very small loads (20 TCP connections, 2MB/s bandwidth)
- Beyond the odd missing packets, we experience waves or large dropped packet streams
- Those issues appears on all instance types we tested:
c5*,c6i*,m6i*, andc7i - We have not yet tested on metal instances
- Throwing more queues at the problem does not help (tested on
c6in.8xlarge)
It is interesting to note that, for identical configurations, the kernel driver never loses a single packet (as per tcpdump, assuming it captures packets pre-reassembly). Once upon a time I was able to use traffic mirroring to verify the streams, but no more as no nitro instance is supported.
Case analysis
In one instance, we ran over 8 hours 250 external connections to multiple external hosts on a single port, with 7 queues associated to TCP traffic (c6in instance). The results of the run are below:
In that pictures, we show:
- on top, the per-minute bandwidth of the instance
- on the bottom, the per-5-minutes throughput in packets/s
- the vertical lines are the instances of large streams of packet lost
What you can see immediately is that, except for the large number of logical connections, the used bandwidth and PPS throughput are very reasonable. You can also see that the lost streams do not happen at any peak of anything (bytes/s or packets/s). Also, those lost streams are very large: in the one at 01:50, we lost packets on 105 connections for a total of 3MB.
I'm running out of ideas as I can't use traffic mirroring to check whatever comes on the wire. I have yet to test on metal instances and to benchmark the driver between two internal hosts to see if I can reproduce locally. Any help/suggestion would be appreciated.
Contact Details
No response
