-
Notifications
You must be signed in to change notification settings - Fork 398
Description
Product
BAML
Problem Statement / Use Case
In engine/baml-runtime/src/request/mod.rs, the reqwest client is currently hardcoded to disable connection pooling:
// Current implementation in baml-runtime
.pool_max_idle_per_host(0)
.pool_idle_timeout(std::time::Duration::from_nanos(1))While I understand this is a defensive measure to prevent deadlocks when using Python's multiprocessing with the fork() start method (as noted in the comments referencing issues like reqwest#600 and hyper#2312), it imposes a significant performance penalty on users running in pure asyncio environments or those using spawn as their multiprocessing start method.
In my testing (macOS, asyncio), every single BAML call initiates a fresh TCP handshake and TLS negotiation. For cloud-based LLM endpoints (e.g., OpenAI), this adds significant unnecessary latency (~100ms-300ms depending on RTT) per request because HTTP Keep-Alive is effectively disabled.
The command line I'm using is:
sudo tcpdump -i any -n 'host <my-llm-provider> and tcp[tcpflags] & (tcp-syn) != 0'
When sending 10 requests to with asyncio.Semaphore(2) to control concurrency, I observed 10 unique SYN packets, confirming that a new connection is created for every single request:
# tcpdump snippet showing multiple SYN flags for a single batch of requests
14:32:21.180253 IP 192.168.x.x.56333 > 8.152.x.x.443: Flags [S], ...
14:32:21.205122 IP 192.168.x.x.56334 > 8.152.x.x.443: Flags [S], ...
In contrast, when sending the same 10 concurrent requests using Python's aiohttp library, only 2 SYN packets (handshakes) are observed, as the connections are properly pooled and reused.
Enabling pooling would drastically improve throughput and reduce the "time to first token" for high-frequency agentic workflows that rely on multiple sequential or parallel LLM calls.
Proposed Solution
I would like the ability to enable or configure connection pooling within the BAML client definition.
Ideally, the client block could accept a configuration property to opt-in to pooling:
// Example BAML suggestion
client<llm> MyLLM {
provider "openai" // or "anthropic", "vertex", etc.
options {
model "gpt-4o"
api_key env.OPENAI_API_KEY
}
// Allow users to opt-in to connection reuse
enable_pooling true
}
Alternative Solutions
No response
Additional Context
No response