[Support]: General issue of missing packets

### Preliminary Actions

- [X] I have searched the existing [issues](https://github.com/amzn/amzn-drivers/issues) and didn't find a duplicate.
- [X] I have followed the [AWS official troubleshoot documentation](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/troubleshooting-ena.html).
- [X] I have followed the driver [readme](https://github.com/amzn/amzn-drivers) and [best practices](https://github.com/amzn/amzn-drivers/blob/master/kernel/linux/ena/ENA_Linux_Best_Practices.rst).

### Driver Type

Linux kernel driver for Elastic Network Adapter (ENA)

### Driver Tag/Commit

DPDK 23.11

### Custom Code

No

### OS Platform and Distribution

Ubuntu Noble

### Support request

This is a follow-up on issues #235 and #286 as I cannot re-open any of these. I believe there is a more general issue with missing packets when using the DPDK driver for ENA.

### Background

1. I am using a custom TCP stack on top of the ENA DPDK driver;
2. The correctness of that stack is irrelevant as the problem concerns L1/l2;
3. When establishing connections to external hosts, two erroneous scenarios may happen:
  a. Entire connection may disappear (ref #286)
  b. Streams of packets are lost (more or less long) (ref #235)

### Observations

1. The ENA port **never** reports RX overruns, misses, or any sort of errors
2. Our queues utilization never goes beyond 20% (we use the max RX queue depth of 8192)
  a. More precisely, we ask to read 8192 buffers and never get more than 2000 back
  b. Multiple consecutive reads may return as much, but never more
4. We notice missing packets even very small loads (20 TCP connections, 2MB/s bandwidth)
5. Beyond the odd missing packets, we experience _waves_ or large dropped packet streams
6. Those issues appears on all instance types we tested: `c5*`, `c6i*`, `m6i*`, and `c7i`
7. We have not yet tested on metal instances
8. Throwing more queues at the problem does not help (tested on `c6in.8xlarge`)

It is interesting to note that, for identical configurations, the kernel driver **never** loses a single packet (as per `tcpdump`, assuming it captures packets pre-reassembly). Once upon a time I was able to use traffic mirroring to verify the streams, but no more as no nitro instance is supported.

### Case analysis

In one instance, we ran over 8 hours 250 external connections to multiple external hosts on a single port, with 7 queues associated to TCP traffic (`c6in` instance). The results of the run are below:

![20241212 - BinanceUS packet losses](https://github.com/user-attachments/assets/73450282-4023-4df7-844f-1548d07a7d71)

In that pictures, we show:

1. on top, the per-minute bandwidth of the instance
2. on the bottom, the per-5-minutes throughput in packets/s
3. the vertical lines are the instances of large streams of packet lost

What you can see immediately is that, except for the large number of logical connections, the used bandwidth and PPS throughput are very reasonable. You can also see that the lost streams do not happen at any peak of anything (bytes/s or packets/s). Also, those lost streams are very large: in the one at 01:50, we lost packets on 105 connections for a total of 3MB.

I'm running out of ideas as I can't use traffic mirroring to check whatever comes on the wire. I have yet to test on metal instances and to benchmark the driver between two internal hosts to see if I can reproduce locally. Any help/suggestion would be appreciated.

### Contact Details

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Support]: General issue of missing packets #332

Preliminary Actions

Driver Type

Driver Tag/Commit

Custom Code

OS Platform and Distribution

Support request

Background

Observations

Case analysis

Contact Details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Support]: General issue of missing packets #332

Description

Preliminary Actions

Driver Type

Driver Tag/Commit

Custom Code

OS Platform and Distribution

Support request

Background

Observations

Case analysis

Contact Details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions