Skip to content

[BUG] Buggy network conditions cause permanent TCP connection exhaustion #13823

Open
@ryao

Description

@ryao

Description / Steps to reproduce the issue

We have a simple application running on nuttx on the RP2040 that allows network access to a serial port and were doing stress testing of it. The NIC of the workstation that was used for stress testing has some kind of issue that causes connections to fail periodically. The motherboard was already replaced once without solving it, but that is offtopic. Anyway, after a moment of failures, NuttX got into a strange state where it would respond to pings, but attempts at connecting to listening TCP sockets would fail. Additionally, ifconfig prints no output.

I attached OpenOCD and started debugging with gdb, and found that tcp_alloc() is returning 0x0, which causes the TCP packets to be dropped:

(gdb) bt
#0  tcp_alloc (domain=domain@entry=2 '\002') at tcp/tcp_conn.c:693
#1  0x10013f1c in tcp_alloc_accept (dev=dev@entry=0x20001710 <g_encx24j600+356>, tcp=0x0, tcp@entry=0x1003d210 <__stack_chk_guard>, listener=listener@entry=0x200040bc <g_tcp_connections+1240>) at tcp/tcp_conn.c:1083
#2  0x10016c3c in tcp_input (domain=2 '\002', iplen=<optimized out>, dev=0x20001710 <g_encx24j600+356>) at tcp/tcp_input.c:815
#3  tcp_ipv4_input (dev=dev@entry=0x20001710 <g_encx24j600+356>) at tcp/tcp_input.c:1761
#4  0x10015b6e in ipv4_in (dev=0x20001710 <g_encx24j600+356>) at devif/ipv4_input.c:401
#5  0x1001645c in netdev_input (dev=0x20001710 <g_encx24j600+356>, callback=0x20007188, callback@entry=0x10015acd <ipv4_in>, reply=reply@entry=true) at netdev/netdev_input.c:90
#6  0x10015bd2 in ipv4_input (dev=dev@entry=0x20001710 <g_encx24j600+356>) at devif/ipv4_input.c:510
#7  0x10006e0e in enc_rxdispatch (priv=0x200015ac <g_encx24j600>) at net/encx24j600.c:1437
#8  enc_pktif (priv=0x200015ac <g_encx24j600>) at net/encx24j600.c:1636
#9  enc_irqworker (arg=0x200015ac <g_encx24j600>) at net/encx24j600.c:1835
#10 0x1000312e in work_thread (argc=<optimized out>, argv=<optimized out>) at wqueue/kwork_thread.c:186
#11 0x10003f0c in nxtask_start () at task/task_start.c:107
#12 0x00000000 in ?? ()

Apparently, we ran out of tcp_conn_s connection structures:

(gdb) print g_free_tcp_connections                                                                                                                                                                                                                                            
$36 = {head = 0x0, tail = 0x0}

I decided to look at the states of the TCP connections and found 5 are in TCP_CLOSED and 3 are in TCP_ALLOCATED:

(gdb) set $n = sizeof(g_tcp_connections) / sizeof(g_tcp_connections[0])                                                                                                                                                                                                       
(gdb) set $i = 0
(gdb) while ($i < $n)                                                                                                                                                                                                                                                         
 >p g_tcp_connections[$i].tcpstateflags                                                                                                                                                                                                                                       
 >set $i = $i + 1                                                                                                                                                                                                                                                             
 >end                                                                                                                                                                                                                                                                         
$47 = 0 '\000'                                                                                                                                                                                                                                                                
$48 = 0 '\000'                                                                                                                                                                                                                                                                
$49 = 0 '\000'                                                                                                                                                                                                                                                                
$50 = 0 '\000'                                                                                                                                                                                                                                                                
$51 = 1 '\001'                                                                                                                                                                                                                                                                
$52 = 1 '\001'                                                                                                                                                                                                                                                                
$53 = 1 '\001'                                                                                                                                                                                                                                                                
$54 = 0 '\000

We did not build with CONFIG_NET_SOLINGER (or NET_TCP_WRITE_BUFFERS/NET_UDP_WRITE_BUFFERS for that matter), so I wondered why the code for recycling TCP connections did not do anything. Apparently, all of the structures are marked as having references:

(gdb) while ($i < $n)                                                                                                                                                                                                                                                         
 >p g_tcp_connections[$i].crefs                                                                                                                                                                                                                                               
 >set $i = $i + 1                                                                                                                                                                                                                                                             
 >end                                                                                                                                                                                                                                                                         
$55 = 1 '\001'                                                                                                                                                                                                                                                                
$56 = 1 '\001'                                                                                                                                                                                                                                                                
$57 = 1 '\001'                                                                                                                                                                                                                                                                
$58 = 1 '\001'                                                                                                                                                                                                                                                                
$59 = 1 '\001'                                                                                                                                                                                                                                                                
$60 = 1 '\001'                                                                                                                                                                                                                                                                
$61 = 1 '\001'                                                                                                                                                                                                                                                                
$62 = 1 '\001'

We have three daemons running that have open sockets. One is telnetd and ps shows no open telnet sessions. The other two are a really simple web server that accepts a connection and returns either a webpage or a 404 depending on the request, only to close the connection afterward. The final one is the serial bridge, which only ever maintains 1 open connection and will close it if a new connection occurs. I do not understand how we got into this state.

I have not yet confirmed that the issue is producible on either the current master or the latest stable release, but I looked through the commits to net/ since our snapshot of master was taken and I do not see anything that would address this. Here is a copy of the build's .config:

config.txt

I have so far refrained from trying to reproduce it since I did not want to lose the ability to poke around the RP2040's memory to understand what is going wrong. Given that this was caused by flaky hardware at the client machine talking to nuttx over the network, I am not sure if I can reproduce the exact sequence that caused this, although I have a few ideas on how to produce similar conditions that I will try after filing this to give others a heads up that there is an issue in the TCP stack. Also, we are using the ENCX24J600 driver on the RP2040, which is not yet supported on master. I have patches for enabling that which I plan to upstream after I am sure that I did not make any mistakes on them.

On which OS does this issue occur?

[OS: Linux]

What is the version of your OS?

Ubuntu 20.04

NuttX Version

09bfaa7

Issue Architecture

[Arch: arm]

Issue Area

[Area: Networking]

Verification

  • I have verified before submitting the report.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Arch: armIssues related to ARM (32-bit) architectureArea: NetworkingEffects networking subsystem

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions