A comprehensive benchmark for evaluating large language models' ability to discover, retrieve, and use tools effectively in complex multi-step reasoning tasks. This repository contains the code and infrastructure for our research on tool-use capabilities in LLMs.
HOHW addresses the challenge of evaluating how well large language models can work with tools in realistic scenarios. Rather than providing all available tools upfront, our benchmark requires models to:
- Search and discover relevant tools from a large corpus
- Understand tool documentation and functionality
- Compose multiple tool calls to solve complex tasks
- Handle tool failures and adversarial conditions
The benchmark is built on database querying tasks derived from the Spider dataset, where each database operation is abstracted into callable functions with natural language documentation.
- Dynamic Tool Discovery: Models must search for and retrieve relevant tools from a corpus of 1000+ functions
- Realistic Tool Documentation: Tools are documented with natural language descriptions, not raw SQL
- Multi-step Reasoning: Tasks require composing multiple tool calls in sequence
- Adversarial Evaluation: Optional tool failure simulation to test robustness
- Multiple LLM Support: Compatible with OpenAI, Google Gemini, and local vLLM deployments
- Comprehensive Metrics: Tracks success rates, tool usage patterns, and error analysis
- Multi-turn conversational agent
- Code generation and execution capabilities
- Error handling and recovery mechanisms
- Support for multiple LLM backends
- Semantic similarity search using sentence transformers
- Configurable retrieval models (GTE-Qwen, NV-Embed)
- Dynamic corpus filtering and ranking
- Configurable tool failure simulation
- Multiple failure strategies (block_first, block_simple)
- Robustness testing under degraded conditions
- Interactive evaluation environment
- Side-by-side comparison capabilities
- Manual verification of model outputs
The system consists of several key components:
- Agent: The LLM being evaluated (CodeAct framework)
- Search Engine: Tool retrieval system using semantic embedding search
- Sandbox Environment: Secure execution environment for generated code
- Controller: Manages database access and query execution
- Adversary: Optional component for simulating tool failures
- Python 3.8+
- CUDA-compatible GPU (recommended for retrieval models)
- Spider dataset
- Clone the repository:
git clone <repository-url>
cd hell-or-high-water- Install dependencies:
# TODO: Create requirements.txt for easier installation
pip install torch transformers datasets sentence-transformers
pip install sqlglot flask requests openai google-generativeai
pip install pyyaml tqdm jinja2
pip install jupyter_kernel_gateway- Install the local retriever package:
cd retriever_wrapper
pip install --editable .
cd ..- Download the Spider dataset and place it in the appropriate directory structure
Run an evaluation on the benchmark:
python -m codeact.run --config_file configs/your_config.yamlFor different model types:
# OpenAI models
python -m codeact.run_openai --config_file configs/openai_config.yaml
# Google Gemini models
python -m codeact.run_google --config_file configs/gemini_config.yaml
# Human evaluation interface
python -m human.run --config_file configs/human_config.yamlAll runtime configurations are specified in YAML config files. See configs/defaults.yaml for a template:
data:
dataset_path: # Path to processed dataset
split: train/validation/test
corpus_path: # Path to tool corpus embeddings
documents_path: # Path to tool documentation JSON
variables:
turns: 20 # Max function calls allowed
k: 9 # Number of tools to retrieve per search
use_gold: false # Whether to use oracle tool retrieval
search:
retriever_path: Alibaba-NLP/gte-Qwen2-1.5B-instruct
retriever_device: cuda
generation_configs:
model_name: # Model identifier
temperature: 0.0
max_tokens: 1024
random_state: 42
llm_service:
url: # Inference server endpoint
openai_api_key: # API key for OpenAI/compatible services
controller:
url: localhost
port: 8000
db_base_path: spider_data/database
sandbox:
url: localhost
port: 5000
adversary:
enabled: false # Enable adversarial evaluation
strategy: block_first # Tool failure strategyUse the config generator for quick setup:
python make_config.py configs/defaults.yaml configs/my_config.yaml \
--model_name gpt-4 \
--variables_k 5 \
--adversary_enabledThe benchmark includes tools for processing the Spider dataset into the tool-use format:
Convert Spider queries into parameterized tool functions:
cd dataset
python preprocess.py --config_file configs/defaults.yamlCreate the executable simulation environment:
cd dataset
python make_simulation.py --config_file configs/defaults.yamlFilter and augment the dataset:
cd dataset
python postprocessing.py --config_file configs/defaults.yamlThe evaluation system uses the CodeAct sandbox architecture and requires three main servers to operate:
- Controller Server (
api.py) - Orchestrates evaluation sessions and manages communication (runs locally) - Jupyter Kernel Gateway Server - Provides isolated code execution environments (runs in remote VM)
- Sandbox Server (
sandbox_server/search_api.py) - Handles tool search and documentation retrieval (runs in remote VM)
The controller server orchestrates the evaluation process and manages communication between components.
- Session Management: Handles multiple concurrent evaluation sessions with automatic cleanup
- Jupyter Kernel Coordination: Creates and manages connections to Jupyter kernel gateway
- Communication Hub: Routes messages between agents and execution environments
- Lifecycle Management: Handles kernel creation, cleanup, and timeout management
api.py # Main controller server
jupyter_helper.py # Jupyter kernel management utilities
environment.py # Game environment and execution logic
# Start the controller server (default port 8088)
python api.py --port 8000
# Or specify a custom port
python api.py --port 9000The controller server accepts these parameters:
--port: Port number for the server (default: 8088)- Environment variables:
DEBUG=True: Enable debug loggingCLEANUP_TIMEOUT_MS=60000: Kernel cleanup interval in milliseconds
The kernel gateway server provides the execution environment for agent-generated code using Jupyter's kernel gateway. This server runs in a remote VM to provide isolation and security for code execution.
- Jupyter Kernel Management: Creates and manages isolated Python execution environments per conversation
- Code Execution: Executes agent-generated Python code safely in sandboxed kernels
- Database Access: Provides access to Spider databases for SQL query execution
- WebSocket Communication: Handles real-time communication with evaluation clients
sandbox_utils.py # Generated tool functions (from dataset processing)
spider_data/database/ # Spider dataset databases
-
Install Jupyter Kernel Gateway (on remote VM):
pip install jupyter_kernel_gateway
-
Start the Kernel Gateway Server (on remote VM):
jupyter kernelgateway --KernelGatewayApp.api=kernel_gateway.jupyter_websocket --port=8000
The kernel gateway server accepts these parameters:
--port: Port number for the server (default: 8888)--KernelGatewayApp.api=kernel_gateway.jupyter_websocket: Use WebSocket API for communication- Additional Jupyter configuration options can be set via environment variables or config files
The sandbox server provides tool search and documentation retrieval capabilities. This server runs in a remote VM alongside the Jupyter Kernel Gateway to provide secure tool access.
- Tool Search Engine: Semantic search over tool corpus using embedding models
- Documentation Lookup: Retrieves detailed tool documentation and usage information
- Embedding Cache: Pre-computed embeddings for fast tool retrieval
- Multiple Retriever Support: Compatible with GTE-Qwen and NV-Embed models
sandbox_server/
├── search_api.py # Main search server
├── embeddings.pkl # Pre-computed tool embeddings
└── documentation_lookup.json # Tool documentation database
-
Prepare Required Files (generate locally, then transfer to remote VM):
# Generate embeddings and documentation (after dataset processing) cd dataset python make_simulation.py --config_file configs/defaults.yaml # This creates: # - embeddings.pkl (tool corpus embeddings) # - documentation_lookup.json (tool documentation) # - sandbox_utils.py (executable tools)
-
Start the Sandbox Server (on remote VM):
cd sandbox_server python search_api.pyThe server runs on port 5000 by default.
- POST /search: Search for relevant tools
{"query": "find movies by director"} - POST /info: Get tool documentation
{"tool_name": "function_123"} - POST /debug: Debug endpoint for testing queries
Both servers require:
# Core dependencies
pip install tornado flask requests
pip install sentence-transformers torch
pip install jupyter-client ipykernel
# For kernel gateway server
pip install jupyter_kernel_gateway
# For sandbox server specifically
pip install sentence-transformers-
Process Dataset (creates tools and documentation - run locally):
cd dataset python preprocess.py --config_file configs/defaults.yaml python make_simulation.py --config_file configs/defaults.yaml -
Transfer Files to Remote VM:
# Transfer generated files to remote VM scp sandbox_utils.py user@remote-vm:/path/to/workdir/ scp embeddings.pkl user@remote-vm:/path/to/workdir/sandbox_server/ scp documentation_lookup.json user@remote-vm:/path/to/workdir/sandbox_server/ scp -r spider_data/ user@remote-vm:/path/to/workdir/ -
Start Jupyter Kernel Gateway Server (on remote VM):
jupyter kernelgateway --KernelGatewayApp.api=kernel_gateway.jupyter_websocket --port=8000
-
Start Sandbox Server (on remote VM):
cd sandbox_server python search_api.py -
Start Controller Server (locally):
python api.py --port 8088
-
Run Evaluation (locally):
python -m codeact.run --config_file configs/your_config.yaml
Update your evaluation config to point to the running servers. Note that the controller runs locally while the sandbox and jupyter servers run on a remote VM:
controller:
url: <remote-vm-ip> # IP address of remote VM
port: 8000
db_base_path: spider_data/database
sandbox:
url: <remote-vm-ip> # IP address of remote VM
port: 5000Controller Server Issues:
- Ensure Spider databases are in
spider_data/database/ - Check that Jupyter kernels can be created:
jupyter kernelspec list - Verify port availability:
lsof -i :8088 - Check tornado and other dependencies are installed
Jupyter Kernel Gateway Issues (on remote VM):
- Ensure
sandbox_utils.pyexists (created by dataset processing) - Ensure Spider databases are in
spider_data/database/ - Check that Jupyter kernels can be created:
jupyter kernelspec list - Verify port availability:
lsof -i :8000 - Check kernel gateway installation:
jupyter kernelgateway --help - Ensure network connectivity between local machine and remote VM
Sandbox Server Issues (on remote VM):
- Ensure
embeddings.pklanddocumentation_lookup.jsonexist - Check retriever model download: models are cached in
~/.cache/huggingface/ - Verify embedding model compatibility (GTE-Qwen or NV-Embed)
- Ensure network connectivity between local machine and remote VM
- Check retriever model download: models are cached in
~/.cache/huggingface/ - Verify embedding model compatibility (GTE-Qwen or NV-Embed)
