Skip to content

WS reconnect loop burns 100% CPU: backoff resets on connect-then-immediate-RST, TLS certs re-parsed every attempt #327

@yongkangc

Description

@yongkangc

Problem

When Polymarket's WS endpoint is returning 429s or immediately resetting connections (e.g. during Cloudflare rate limiting), the reconnect loop in ConnectionManager::connection_loop burns 100% CPU on a single core.

perf profile (5 second sample on a live process):

38.54%  rustls_pki_types::base64::decode_public
 8.13%  rustls_pki_types::pem::from_buf_inner
 1.11%  rustls_pki_types::pem::read
 1.58%  aws_lc_0_37_0_p384_montjdouble
 0.69%  aws_lc_0_37_0_sha512_block_data_order_avx

~48% of CPU is parsing PEM certificates. Every reconnect creates a new TLS connection via connect_async, which re-reads and re-parses the entire system root cert store from /etc/ssl/certs/.

strace (2 second sample):

26.34%  read       14,204 calls
16.53%  statx       6,598 calls  
16.03%  openat      6,528 calls
11.49%  close       6,547 calls

All filesystem I/O is cert file reads.

Root Cause

Two issues:

1. Backoff resets on "successful" connections that immediately die

In connection_loop (connection.rs):

match connect_async(&endpoint).await {
    Ok((ws_stream, _)) => {
        attempt = 0;
        backoff.reset();  // ← resets to initial_backoff (1s)
        if let Err(e) = Self::handle_connection(...).await {
            // Connection died immediately (RST, 429, etc.)
        }
    }
    ...
}
if let Some(duration) = backoff.next_backoff() {
    sleep(duration).await;  // ← always ~1s because backoff was just reset
}

When the TCP+TLS handshake succeeds but the server immediately sends RST or closes the WS frame, connect_async returns Ok but handle_connection errors instantly. The backoff was already reset, so the retry starts from initial_backoff (1s) every time. The exponential growth never kicks in.

2. TLS root cert store not cached across connections

Each connect_async call constructs a new TLS connector, which reads and parses every PEM file in /etc/ssl/certs/. This turns a 1-second reconnect loop into a CPU-intensive operation.

Suggested Fix

  1. Don't reset backoff unless the connection was alive for a minimum duration (e.g., >5s). If handle_connection returns in <1s, treat it the same as a connection failure for backoff purposes.

  2. Cache the rustls::RootCertStore (or equivalent TLS config) and reuse it across reconnections via a shared TlsConnector.

Impact

On a 2-core AWS instance running a news-taker bot, this caused:

  • 102% CPU on the pm-news-taker process
  • 1,719 minutes of CPU time in 28 hours
  • 138 threads, 210 open connections to Cloudflare
  • 2,272 error log lines per minute

Environment

  • polymarket-client-sdk 0.4.4
  • rustls (via tokio-tungstenite)
  • Linux 6.17, AWS EC2 (2 vCPU)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions