Skip to content

[feat] Make Connection Pooling configurable #3072

@flowjzh

Description

@flowjzh

Product

BAML

Problem Statement / Use Case

In engine/baml-runtime/src/request/mod.rs, the reqwest client is currently hardcoded to disable connection pooling:

// Current implementation in baml-runtime
.pool_max_idle_per_host(0)
.pool_idle_timeout(std::time::Duration::from_nanos(1))

While I understand this is a defensive measure to prevent deadlocks when using Python's multiprocessing with the fork() start method (as noted in the comments referencing issues like reqwest#600 and hyper#2312), it imposes a significant performance penalty on users running in pure asyncio environments or those using spawn as their multiprocessing start method.

In my testing (macOS, asyncio), every single BAML call initiates a fresh TCP handshake and TLS negotiation. For cloud-based LLM endpoints (e.g., OpenAI), this adds significant unnecessary latency (~100ms-300ms depending on RTT) per request because HTTP Keep-Alive is effectively disabled.

The command line I'm using is:

sudo tcpdump -i any -n 'host <my-llm-provider> and tcp[tcpflags] & (tcp-syn) != 0'

When sending 10 requests to with asyncio.Semaphore(2) to control concurrency, I observed 10 unique SYN packets, confirming that a new connection is created for every single request:

# tcpdump snippet showing multiple SYN flags for a single batch of requests
14:32:21.180253 IP 192.168.x.x.56333 > 8.152.x.x.443: Flags [S], ...
14:32:21.205122 IP 192.168.x.x.56334 > 8.152.x.x.443: Flags [S], ...

In contrast, when sending the same 10 concurrent requests using Python's aiohttp library, only 2 SYN packets (handshakes) are observed, as the connections are properly pooled and reused.

Enabling pooling would drastically improve throughput and reduce the "time to first token" for high-frequency agentic workflows that rely on multiple sequential or parallel LLM calls.

Proposed Solution

I would like the ability to enable or configure connection pooling within the BAML client definition.

Ideally, the client block could accept a configuration property to opt-in to pooling:

// Example BAML suggestion
client<llm> MyLLM {
  provider "openai" // or "anthropic", "vertex", etc.
  options {
    model "gpt-4o"
    api_key env.OPENAI_API_KEY
  }
  // Allow users to opt-in to connection reuse
  enable_pooling true
}

Alternative Solutions

No response

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions