A precise profiling tool to benchmark Standard Tool Loading vs. Anthropic's Deferred Tool Search (Beta).
Building agents with 50+ tools introduces a critical trade-off: Context Bloat vs. Latency. This repo reverse-engineers the "Double Pass" behavior of server-side tool search to help engineers make data-driven architecture decisions.
Through empirical testing, this profiler confirms that Deferred Loading triggers a server-side loop:
- Pass 1: Model receives user prompt +
tool_searchtool only. - Server Action: Anthropic executes search against your deferred tool definitions.
- Pass 2: Model receives the original prompt + only the relevant tools found.
The Trade-off:
- Standard: Higher Cost (Input Tokens), Lower Latency.
- Deferred: Lower Cost (Tokens), Higher Latency (~1.5x - 2x).
-
Clone and Install
git clone [https://github.com/yourusername/llm-tool-performance-audit.git](https://github.com/yourusername/llm-tool-performance-audit.git) cd llm-tool-performance-audit make install -
Configure Environment Create a
.envfile in the root directory:ANTHROPIC_API_KEY=sk-ant-api03-...
-
Run the Audit
make run
Optional: Simulate a heavier load with 100 tools:
make run-heavy
This project leverages the undocumented behavior of the advanced-tool-use beta. Here is how we implement the Deferred Profiler:
We must specifically opt-in to the 2025-11-20 beta and use the tool_search_tool_bm25 type.
# src/profilers/deferred.py
response = client.beta.messages.create(
model="claude-3-5-sonnet-20241022",
# ⚠️ CRITICAL: The beta flag enables server-side loops
betas=["advanced-tool-use-2025-11-20"],
max_tokens=1024,
messages=[{"role": "user", "content": user_message}],
tools=[
{
"type": "tool_search_tool_bm25_20251119", # The searcher
"name": "tool_search"
}
] + deferred_tools # Tools marked with defer_loading=True
)To get realistic metrics, we generate "heavy" tool definitions that mimic complex enterprise schemas.
# src/utils.py
def generate_dummy_tools(count=50):
return [{
"name": f"get_metric_{i}",
"description": "Complex retrieval tool for specific user segments...",
"defer_loading": True # <--- This triggers the search behavior
} for i in range(count)]The CLI uses rich to visualize the delta between architectures.
Performance Comparison
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Metric ┃ Standard ┃ Deferred (Search) ┃ Delta ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ Latency (s) │ 2.105s │ 3.850s │ +1.745s │
│ Input Tokens │ 18,450 │ 1,240 │ -17,210 │
└──────────────────┴────────────────────┴─────────────────────┴──────────────┘
Analysis: Deferred loading reduced token consumption by 93%, but increased latency by 82% due to the sequential inference passes.
PRs are welcome! Specifically looking for:
- Profiling for the Regex search tool variant.
- Cost estimation calculator based on current token prices.
MIT © Alex