Problem
When Polymarket's WS endpoint is returning 429s or immediately resetting connections (e.g. during Cloudflare rate limiting), the reconnect loop in ConnectionManager::connection_loop burns 100% CPU on a single core.
perf profile (5 second sample on a live process):
38.54% rustls_pki_types::base64::decode_public
8.13% rustls_pki_types::pem::from_buf_inner
1.11% rustls_pki_types::pem::read
1.58% aws_lc_0_37_0_p384_montjdouble
0.69% aws_lc_0_37_0_sha512_block_data_order_avx
~48% of CPU is parsing PEM certificates. Every reconnect creates a new TLS connection via connect_async, which re-reads and re-parses the entire system root cert store from /etc/ssl/certs/.
strace (2 second sample):
26.34% read 14,204 calls
16.53% statx 6,598 calls
16.03% openat 6,528 calls
11.49% close 6,547 calls
All filesystem I/O is cert file reads.
Root Cause
Two issues:
1. Backoff resets on "successful" connections that immediately die
In connection_loop (connection.rs):
match connect_async(&endpoint).await {
Ok((ws_stream, _)) => {
attempt = 0;
backoff.reset(); // ← resets to initial_backoff (1s)
if let Err(e) = Self::handle_connection(...).await {
// Connection died immediately (RST, 429, etc.)
}
}
...
}
if let Some(duration) = backoff.next_backoff() {
sleep(duration).await; // ← always ~1s because backoff was just reset
}
When the TCP+TLS handshake succeeds but the server immediately sends RST or closes the WS frame, connect_async returns Ok but handle_connection errors instantly. The backoff was already reset, so the retry starts from initial_backoff (1s) every time. The exponential growth never kicks in.
2. TLS root cert store not cached across connections
Each connect_async call constructs a new TLS connector, which reads and parses every PEM file in /etc/ssl/certs/. This turns a 1-second reconnect loop into a CPU-intensive operation.
Suggested Fix
-
Don't reset backoff unless the connection was alive for a minimum duration (e.g., >5s). If handle_connection returns in <1s, treat it the same as a connection failure for backoff purposes.
-
Cache the rustls::RootCertStore (or equivalent TLS config) and reuse it across reconnections via a shared TlsConnector.
Impact
On a 2-core AWS instance running a news-taker bot, this caused:
- 102% CPU on the pm-news-taker process
- 1,719 minutes of CPU time in 28 hours
- 138 threads, 210 open connections to Cloudflare
- 2,272 error log lines per minute
Environment
- polymarket-client-sdk 0.4.4
- rustls (via tokio-tungstenite)
- Linux 6.17, AWS EC2 (2 vCPU)
Problem
When Polymarket's WS endpoint is returning 429s or immediately resetting connections (e.g. during Cloudflare rate limiting), the reconnect loop in
ConnectionManager::connection_loopburns 100% CPU on a single core.perf profile (5 second sample on a live process):
~48% of CPU is parsing PEM certificates. Every reconnect creates a new TLS connection via
connect_async, which re-reads and re-parses the entire system root cert store from/etc/ssl/certs/.strace (2 second sample):
All filesystem I/O is cert file reads.
Root Cause
Two issues:
1. Backoff resets on "successful" connections that immediately die
In
connection_loop(connection.rs):When the TCP+TLS handshake succeeds but the server immediately sends RST or closes the WS frame,
connect_asyncreturns Ok buthandle_connectionerrors instantly. The backoff was already reset, so the retry starts from initial_backoff (1s) every time. The exponential growth never kicks in.2. TLS root cert store not cached across connections
Each
connect_asynccall constructs a new TLS connector, which reads and parses every PEM file in/etc/ssl/certs/. This turns a 1-second reconnect loop into a CPU-intensive operation.Suggested Fix
Don't reset backoff unless the connection was alive for a minimum duration (e.g., >5s). If
handle_connectionreturns in <1s, treat it the same as a connection failure for backoff purposes.Cache the
rustls::RootCertStore(or equivalent TLS config) and reuse it across reconnections via a sharedTlsConnector.Impact
On a 2-core AWS instance running a news-taker bot, this caused:
Environment