Skip to content

Connection reset during max-requests auto-restart with gthread #3038

@christianbundy

Description

@christianbundy

We're observing intermittent HTTP 502s in production, which seem to be correlated with the "autorestarting worker after current request" log line, and are less frequent as we increase max_requests. I've reproduced this on 21.2.0 and 20.1.0, but it doesn't seem to happen in 20.0.4.

I've produced a minimal reproduction case following the gunicorn.org example as closely as possible, but please let me know if there are other changes you'd recommend:

Application

def app(environ, start_response):
    data = b"Hello, World!\n"
    start_response("200 OK", [
        ("Content-Type", "text/plain"),
        ("Content-Length", str(len(data)))
    ])
    return iter([data])

Configuration

gunicorn --max-requests 10 --worker-class gthread myapp:app

Reproduction

#!/usr/bin/env bash

set -euo pipefail

# start gunicorn
gunicorn --max-requests 10 --log-level debug --worker-class gthread myapp:app &

# wait for gunicorn to be ready
sleep 5

# send 20 sequential HTTP requests with minimal output
for i in {1..20}; do
    curl -S -s -o /dev/null -w "[REPRO]: Received HTTP %{http_code} from GET /${i}\n" "127.0.0.1:8000/${i}" || true
done

Quickstart

For convenience, I've packaged this into a single command that consistently reproduces the problem on my machine. If you have Docker installed, this should Just Work™️.

cat <<DOCKER_EOF | docker build -t gunicorn-repro-max-requests-rst - && docker run --rm gunicorn-repro-max-requests-rst
FROM python:3.11

RUN pip install --disable-pip-version-check --no-cache gunicorn

RUN cat <<EOF > myapp.py
def app(environ, start_response):
    data = b"Hello, World!\n"
    start_response("200 OK", [
        ("Content-Type", "text/plain"),
        ("Content-Length", str(len(data)))
    ])
    return iter([data])
EOF

RUN cat <<EOF > repro.sh
#!/usr/bin/env bash

set -euo pipefail

# start gunicorn
gunicorn --max-requests 10 --log-level debug --worker-class gthread myapp:app &

# wait for gunicorn to be ready
sleep 5

# send 20 sequential HTTP requests with minimal output
for i in {1..20}; do
    curl -S -s -o /dev/null -w "[REPRO]: Received HTTP %{http_code} from GET /\\\${i}\n" "127.0.0.1:8000/\\\${i}" || true
done
EOF

RUN chmod +x repro.sh

CMD ["./repro.sh"]

DOCKER_EOF

Example

Logs

[2023-07-21 00:20:10 +0000] [6] [DEBUG] Current configuration:
  config: ./gunicorn.conf.py
  wsgi_app: None
  bind: ['127.0.0.1:8000']
  backlog: 2048
  workers: 1
  worker_class: gthread
  threads: 1
  worker_connections: 1000
  max_requests: 10
  max_requests_jitter: 0
  timeout: 30
  graceful_timeout: 30
  keepalive: 2
  limit_request_line: 4094
  limit_request_fields: 100
  limit_request_field_size: 8190
  reload: False
  reload_engine: auto
  reload_extra_files: []
  spew: False
  check_config: False
  print_config: False
  preload_app: False
  sendfile: None
  reuse_port: False
  chdir: /
  daemon: False
  raw_env: []
  pidfile: None
  worker_tmp_dir: None
  user: 0
  group: 0
  umask: 0
  initgroups: False
  tmp_upload_dir: None
  secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
  forwarded_allow_ips: ['127.0.0.1']
  accesslog: None
  disable_redirect_access_to_syslog: False
  access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
  errorlog: -
  loglevel: debug
  capture_output: False
  logger_class: gunicorn.glogging.Logger
  logconfig: None
  logconfig_dict: {}
  logconfig_json: None
  syslog_addr: udp://localhost:514
  syslog: False
  syslog_prefix: None
  syslog_facility: user
  enable_stdio_inheritance: False
  statsd_host: None
  dogstatsd_tags: 
  statsd_prefix: 
  proc_name: None
  default_proc_name: myapp:app
  pythonpath: None
  paste: None
  on_starting: <function OnStarting.on_starting at 0xffff935672e0>
  on_reload: <function OnReload.on_reload at 0xffff93567420>
  when_ready: <function WhenReady.when_ready at 0xffff93567560>
  pre_fork: <function Prefork.pre_fork at 0xffff935676a0>
  post_fork: <function Postfork.post_fork at 0xffff935677e0>
  post_worker_init: <function PostWorkerInit.post_worker_init at 0xffff93567920>
  worker_int: <function WorkerInt.worker_int at 0xffff93567a60>
  worker_abort: <function WorkerAbort.worker_abort at 0xffff93567ba0>
  pre_exec: <function PreExec.pre_exec at 0xffff93567ce0>
  pre_request: <function PreRequest.pre_request at 0xffff93567e20>
  post_request: <function PostRequest.post_request at 0xffff93567ec0>
  child_exit: <function ChildExit.child_exit at 0xffff92db8040>
  worker_exit: <function WorkerExit.worker_exit at 0xffff92db8180>
  nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0xffff92db82c0>
  on_exit: <function OnExit.on_exit at 0xffff92db8400>
  ssl_context: <function NewSSLContext.ssl_context at 0xffff92db85e0>
  proxy_protocol: False
  proxy_allow_ips: ['127.0.0.1']
  keyfile: None
  certfile: None
  ssl_version: 2
  cert_reqs: 0
  ca_certs: None
  suppress_ragged_eofs: True
  do_handshake_on_connect: False
  ciphers: None
  raw_paste_global_conf: []
  strip_header_spaces: False
[2023-07-21 00:20:10 +0000] [6] [INFO] Starting gunicorn 21.2.0
[2023-07-21 00:20:10 +0000] [6] [DEBUG] Arbiter booted
[2023-07-21 00:20:10 +0000] [6] [INFO] Listening at: http://127.0.0.1:8000 (6)
[2023-07-21 00:20:10 +0000] [6] [INFO] Using worker: gthread
[2023-07-21 00:20:10 +0000] [8] [INFO] Booting worker with pid: 8
[2023-07-21 00:20:10 +0000] [6] [DEBUG] 1 workers
[2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /1
[REPRO]: Received HTTP 200 from GET /1
[2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection. 
[2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /2
[2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection. 
[REPRO]: Received HTTP 200 from GET /2
[2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /3
[REPRO]: Received HTTP 200 from GET /3
[2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection. 
[2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /4
[2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection. 
[REPRO]: Received HTTP 200 from GET /4
[2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /5
[2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection. 
[REPRO]: Received HTTP 200 from GET /5
[2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /6
[2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection. 
[REPRO]: Received HTTP 200 from GET /6
[2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /7
[REPRO]: Received HTTP 200 from GET /7
[2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection. 
[2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /8
[2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection. 
[REPRO]: Received HTTP 200 from GET /8
[2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /9
[2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection. 
[REPRO]: Received HTTP 200 from GET /9
[2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /10
[2023-07-21 00:20:15 +0000] [8] [INFO] Autorestarting worker after current request.
[2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection.
[REPRO]: Received HTTP 200 from GET /10
curl: (56) Recv failure: Connection reset by peer
[2023-07-21 00:20:15 +0000] [8] [INFO] Worker exiting (pid: 8)
[REPRO]: Received HTTP 000 from GET /11
[2023-07-21 00:20:15 +0000] [22] [INFO] Booting worker with pid: 22
[2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /12
[2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection. 
[REPRO]: Received HTTP 200 from GET /12
[2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /13
[REPRO]: Received HTTP 200 from GET /13
[2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection. 
[2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /14
[2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection. 
[REPRO]: Received HTTP 200 from GET /14
[2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /15
[2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection. 
[REPRO]: Received HTTP 200 from GET /15
[2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /16
[REPRO]: Received HTTP 200 from GET /16
[2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection. 
[2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /17
[2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection. 
[REPRO]: Received HTTP 200 from GET /17
[2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /18
[2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection. 
[REPRO]: Received HTTP 200 from GET /18
[2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /19
[REPRO]: Received HTTP 200 from GET /19
[2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection. 
[2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /20
[2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection. 
[REPRO]: Received HTTP 200 from GET /20

Expected

I'd expect to receive an HTTP 200 for each request, regardless of the max-requests restart. We should see [DEBUG] GET /11 when the worker handles the 11th request.

Actual

The reproduction script sends GET /11, but the worker never sees it, and we see a connection reset instead. The repro script reports a status code of 000, but that's just a quirk of libcurl. I've used tcpdump and can confirm the RST.

In case it's useful, I've also seen curl: (52) Empty reply from server, but it happens less frequently and I'm not 100% sure that it's the same problem.

Workaround

Increasing max-requests makes this happen less frequently, but the only way to resolve it is to disable max-requests (or maybe switch to a different worker type?). Increasing the number of workers or threads doesn't seem to resolve the problem either, from what I've seen.

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions