Skip to content

recv() returns EOF and silently drops received data when the peer's data and FIN are coalesced #1367

@russellromney

Description

@russellromney

What happens

When a peer sends a small reply and closes the connection in the same breath, the reply data and the connection close (FIN) arrive at the host socket together. Quark reports this to the application as a single "socket closed" event and returns EOF from recv() — skipping the data that is already sitting in the socket. The host kernel received and ACKed those bytes (confirmed by packet capture below), but the application never sees them.

In one sentence: Quark treats "the peer closed" as "there is nothing left to read", when a close actually means "nothing more after what is already buffered". On short loopback connections this loses the reply ~50% of the time.

This is Quark-specific: runc (host-kernel TCP) and gVisor runsc (its own userspace netstack) are 100% correct on the same host, server, client, and image.

Environment

  • Host: DigitalOcean droplet (KVM), Debian 13 (trixie), kernel 6.12.88+deb13-amd64, x86_64.
  • Runtime: qvisor 0.6.0 built from commit 0992988b. (The build's only source changes are a #![feature(sync_unsafe_cell)] gate and a pub mod tee; declaration, both only needed to compile that commit — no networking changes.)
  • Config: EnableTsot:false, EnableRDMA:false, UringIO:true, AsyncAccept:true.
  • docker run --runtime=quark --network host, image alpine:latest (also reproduced with python:3.13-slim).

Reproduction

A host echo server (one variant closes immediately after replying, one sleeps 50 ms before closing) and a small static C client that connects, sends 5 bytes, and calls recv() once.

#!/usr/bin/env bash
set -eu
DIR=$(mktemp -d); cd "$DIR"
PORT_NOW=47020; PORT_DELAY=47021; N=16; IMG=alpine:latest

cat > cli.c <<'C'
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/socket.h>
#include <sys/time.h>
#include <netinet/in.h>
#include <arpa/inet.h>
int main(int c, char**v){
  int fd=socket(AF_INET,SOCK_STREAM,0);
  struct timeval tv={5,0}; setsockopt(fd,SOL_SOCKET,SO_RCVTIMEO,&tv,sizeof tv);
  struct sockaddr_in a; memset(&a,0,sizeof a);
  a.sin_family=AF_INET; a.sin_port=htons(atoi(v[1])); inet_pton(AF_INET,"127.0.0.1",&a.sin_addr);
  if(connect(fd,(struct sockaddr*)&a,sizeof a)){puts("CONNFAIL");return 0;}
  send(fd,"ping\n",5,0);
  char b[64]; ssize_t n=recv(fd,b,sizeof b-1,0);
  puts(n>0?"OK":"EMPTY");   /* EMPTY = data lost */
  return 0;
}
C
cc -static -O2 -o cli cli.c

cat > srv.py <<'PY'
import socket,sys,threading,time
port=int(sys.argv[1]); delay=float(sys.argv[2])
s=socket.socket(); s.setsockopt(socket.SOL_SOCKET,socket.SO_REUSEADDR,1)
s.bind(("127.0.0.1",port)); s.listen(128)
def h(c):
    try:
        f=c.makefile("rwb"); line=f.readline()
        if line: f.write(b"echo:"+line); f.flush()
        if delay: time.sleep(delay)
    finally: c.close()
while True:
    c,_=s.accept(); threading.Thread(target=h,args=(c,),daemon=True).start()
PY
python3 srv.py $PORT_NOW   0    & python3 srv.py $PORT_DELAY 0.05 &
sleep 1

cell(){ ok=0; for i in $(seq 1 $N); do
  r=$(docker run --rm --runtime="$1" --network host --entrypoint "" -v "$DIR:/p" $IMG /p/cli "$2" 2>/dev/null)
  [ "$r" = OK ] && ok=$((ok+1)); done
  echo "  $1 close=$3: $ok/$N ok"; }

for rt in runc runsc quark; do
  echo "runtime: $rt"; cell "$rt" $PORT_NOW immediate; cell "$rt" $PORT_DELAY delay50ms
done

Results (n=16 per cell, identical host/server/client/image)

                  close=immediate   close=delay50ms
runc  (host TCP)    16/16 ok          16/16 ok
runsc (gVisor)      16/16 ok          16/16 ok
quark                7/16 ok          16/16 ok

Two points:

  1. Quark-specific — gVisor's own userspace netstack on the same host is perfect.
  2. It is the coalesced data + FIN — making the server sleep 50 ms before closing (so the FIN is separated from the data) makes Quark 16/16.

Packet capture (host lo, during a failing batch)

The wire is clean: every connection completes SYN / SYN-ACK / ACK / PSH(data) / ACK / PSH(reply) / ACK / FIN, with 0 RST and 0 retransmits, and the client ACKs the reply. So the host kernel received and ACKed the bytes — the loss is in Quark's delivery of that data to the application.

Likely cause

The reply is received correctly — an io_uring read completion returns the bytes, and a second completion returns 0 for the FIN. The loss is a race between the data and the close: the read-closed flag is set with a bare atomic store (rClosed.store(..., Release)), not under the lock that guards the receive buffer, while the data is produced under that lock. A receiver holding the buffer lock can therefore observe an empty buffer together with read-closed and return EOF, abandoning data that is produced a moment later. Reads are also armed from two places (the read-completion's resubmit and the consumer's re-arm after a consume), which allows the data and EOF completions to be processed concurrently.

A correct stream socket must never let the read path observe "closed" without also observing all data that arrived before the FIN. Applying and checking the read-closed state under the same lock that guards the receive buffer (and keeping socket reads ordered so data is always surfaced before EOF) would close the window.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions