WS reconnect loop burns 100% CPU: backoff resets on connect-then-immediate-RST, TLS certs re-parsed every attempt

## Problem

When Polymarket's WS endpoint is returning 429s or immediately resetting connections (e.g. during Cloudflare rate limiting), the reconnect loop in `ConnectionManager::connection_loop` burns 100% CPU on a single core.

**perf profile** (5 second sample on a live process):
```
38.54%  rustls_pki_types::base64::decode_public
 8.13%  rustls_pki_types::pem::from_buf_inner
 1.11%  rustls_pki_types::pem::read
 1.58%  aws_lc_0_37_0_p384_montjdouble
 0.69%  aws_lc_0_37_0_sha512_block_data_order_avx
```

**~48% of CPU** is parsing PEM certificates. Every reconnect creates a new TLS connection via `connect_async`, which re-reads and re-parses the entire system root cert store from `/etc/ssl/certs/`.

**strace** (2 second sample):
```
26.34%  read       14,204 calls
16.53%  statx       6,598 calls  
16.03%  openat      6,528 calls
11.49%  close       6,547 calls
```

All filesystem I/O is cert file reads.

## Root Cause

Two issues:

### 1. Backoff resets on "successful" connections that immediately die

In `connection_loop` ([connection.rs](https://github.com/polymarket/rs-clob-client/blob/main/src/ws/connection.rs)):

```rust
match connect_async(&endpoint).await {
    Ok((ws_stream, _)) => {
        attempt = 0;
        backoff.reset();  // ← resets to initial_backoff (1s)
        if let Err(e) = Self::handle_connection(...).await {
            // Connection died immediately (RST, 429, etc.)
        }
    }
    ...
}
if let Some(duration) = backoff.next_backoff() {
    sleep(duration).await;  // ← always ~1s because backoff was just reset
}
```

When the TCP+TLS handshake succeeds but the server immediately sends RST or closes the WS frame, `connect_async` returns Ok but `handle_connection` errors instantly. The backoff was already reset, so the retry starts from initial_backoff (1s) every time. The exponential growth never kicks in.

### 2. TLS root cert store not cached across connections

Each `connect_async` call constructs a new TLS connector, which reads and parses every PEM file in `/etc/ssl/certs/`. This turns a 1-second reconnect loop into a CPU-intensive operation.

## Suggested Fix

1. **Don't reset backoff unless the connection was alive for a minimum duration** (e.g., >5s). If `handle_connection` returns in <1s, treat it the same as a connection failure for backoff purposes.

2. **Cache the `rustls::RootCertStore`** (or equivalent TLS config) and reuse it across reconnections via a shared `TlsConnector`.

## Impact

On a 2-core AWS instance running a news-taker bot, this caused:
- 102% CPU on the pm-news-taker process
- 1,719 minutes of CPU time in 28 hours
- 138 threads, 210 open connections to Cloudflare
- 2,272 error log lines per minute

## Environment

- polymarket-client-sdk 0.4.4
- rustls (via tokio-tungstenite)
- Linux 6.17, AWS EC2 (2 vCPU)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WS reconnect loop burns 100% CPU: backoff resets on connect-then-immediate-RST, TLS certs re-parsed every attempt #327

Problem

Root Cause

1. Backoff resets on "successful" connections that immediately die

2. TLS root cert store not cached across connections

Suggested Fix

Impact

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

WS reconnect loop burns 100% CPU: backoff resets on connect-then-immediate-RST, TLS certs re-parsed every attempt #327

Description

Problem

Root Cause

1. Backoff resets on "successful" connections that immediately die

2. TLS root cert store not cached across connections

Suggested Fix

Impact

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions