[feat] Make Connection Pooling configurable

### Product

BAML

### Problem Statement / Use Case

In engine/baml-runtime/src/request/mod.rs, the reqwest client is currently hardcoded to disable connection pooling:

```rust
// Current implementation in baml-runtime
.pool_max_idle_per_host(0)
.pool_idle_timeout(std::time::Duration::from_nanos(1))
```

While I understand this is a defensive measure to prevent deadlocks when using Python's multiprocessing with the fork() start method (as noted in the comments referencing issues like reqwest#600 and hyper#2312), it imposes a significant performance penalty on users running in pure asyncio environments or those using spawn as their multiprocessing start method.

In my testing (macOS, asyncio), every single BAML call initiates a fresh TCP handshake and TLS negotiation. For cloud-based LLM endpoints (e.g., OpenAI), this adds significant unnecessary latency (~100ms-300ms depending on RTT) per request because HTTP Keep-Alive is effectively disabled.

The command line I'm using is:

`sudo tcpdump -i any -n 'host <my-llm-provider>  and tcp[tcpflags] & (tcp-syn) != 0'`

When sending 10 requests to <my-llm-provider> with asyncio.Semaphore(2) to control concurrency, I observed 10 unique SYN packets, confirming that a new connection is created for every single request:

```
# tcpdump snippet showing multiple SYN flags for a single batch of requests
14:32:21.180253 IP 192.168.x.x.56333 > 8.152.x.x.443: Flags [S], ...
14:32:21.205122 IP 192.168.x.x.56334 > 8.152.x.x.443: Flags [S], ...
```

In contrast, when sending the same 10 concurrent requests using Python's aiohttp library, only 2 SYN packets (handshakes) are observed, as the connections are properly pooled and reused.

Enabling pooling would drastically improve throughput and reduce the "time to first token" for high-frequency agentic workflows that rely on multiple sequential or parallel LLM calls.

### Proposed Solution

I would like the ability to enable or configure connection pooling within the BAML client definition.

Ideally, the client block could accept a configuration property to opt-in to pooling:

```
// Example BAML suggestion
client<llm> MyLLM {
  provider "openai" // or "anthropic", "vertex", etc.
  options {
    model "gpt-4o"
    api_key env.OPENAI_API_KEY
  }
  // Allow users to opt-in to connection reuse
  enable_pooling true
}
```

### Alternative Solutions

_No response_

### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Make Connection Pooling configurable #3072

Product

Problem Statement / Use Case

Proposed Solution

Alternative Solutions

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[feat] Make Connection Pooling configurable #3072

Description

Product

Problem Statement / Use Case

Proposed Solution

Alternative Solutions

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions