Skip to content

[Support]: General issue of missing packets #332

@xguerin

Description

@xguerin

Preliminary Actions

Driver Type

Linux kernel driver for Elastic Network Adapter (ENA)

Driver Tag/Commit

DPDK 23.11

Custom Code

No

OS Platform and Distribution

Ubuntu Noble

Support request

This is a follow-up on issues #235 and #286 as I cannot re-open any of these. I believe there is a more general issue with missing packets when using the DPDK driver for ENA.

Background

  1. I am using a custom TCP stack on top of the ENA DPDK driver;
  2. The correctness of that stack is irrelevant as the problem concerns L1/l2;
  3. When establishing connections to external hosts, two erroneous scenarios may happen:
    a. Entire connection may disappear (ref [Support]: sudden disappearance of user-space TCP streams #286)
    b. Streams of packets are lost (more or less long) (ref DPDK ENA PMD Silently Drops Packets on Rx #235)

Observations

  1. The ENA port never reports RX overruns, misses, or any sort of errors
  2. Our queues utilization never goes beyond 20% (we use the max RX queue depth of 8192)
    a. More precisely, we ask to read 8192 buffers and never get more than 2000 back
    b. Multiple consecutive reads may return as much, but never more
  3. We notice missing packets even very small loads (20 TCP connections, 2MB/s bandwidth)
  4. Beyond the odd missing packets, we experience waves or large dropped packet streams
  5. Those issues appears on all instance types we tested: c5*, c6i*, m6i*, and c7i
  6. We have not yet tested on metal instances
  7. Throwing more queues at the problem does not help (tested on c6in.8xlarge)

It is interesting to note that, for identical configurations, the kernel driver never loses a single packet (as per tcpdump, assuming it captures packets pre-reassembly). Once upon a time I was able to use traffic mirroring to verify the streams, but no more as no nitro instance is supported.

Case analysis

In one instance, we ran over 8 hours 250 external connections to multiple external hosts on a single port, with 7 queues associated to TCP traffic (c6in instance). The results of the run are below:

20241212 - BinanceUS packet losses

In that pictures, we show:

  1. on top, the per-minute bandwidth of the instance
  2. on the bottom, the per-5-minutes throughput in packets/s
  3. the vertical lines are the instances of large streams of packet lost

What you can see immediately is that, except for the large number of logical connections, the used bandwidth and PPS throughput are very reasonable. You can also see that the lost streams do not happen at any peak of anything (bytes/s or packets/s). Also, those lost streams are very large: in the one at 01:50, we lost packets on 105 connections for a total of 3MB.

I'm running out of ideas as I can't use traffic mirroring to check whatever comes on the wire. I have yet to test on metal instances and to benchmark the driver between two internal hosts to see if I can reproduce locally. Any help/suggestion would be appreciated.

Contact Details

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    DPDK driversupportAsk a question or request supporttriageDetermine the priority and severity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions