Hell or High Water (HOHW): Evaluating Agentic Recovery from External Failures

A comprehensive benchmark for evaluating large language models' ability to discover, retrieve, and use tools effectively in complex multi-step reasoning tasks. This repository contains the code and infrastructure for our research on tool-use capabilities in LLMs.

Overview

HOHW addresses the challenge of evaluating how well large language models can work with tools in realistic scenarios. Rather than providing all available tools upfront, our benchmark requires models to:

Search and discover relevant tools from a large corpus
Understand tool documentation and functionality
Compose multiple tool calls to solve complex tasks
Handle tool failures and adversarial conditions

The benchmark is built on database querying tasks derived from the Spider dataset, where each database operation is abstracted into callable functions with natural language documentation.

Key Features

Dynamic Tool Discovery: Models must search for and retrieve relevant tools from a corpus of 1000+ functions
Realistic Tool Documentation: Tools are documented with natural language descriptions, not raw SQL
Multi-step Reasoning: Tasks require composing multiple tool calls in sequence
Adversarial Evaluation: Optional tool failure simulation to test robustness
Multiple LLM Support: Compatible with OpenAI, Google Gemini, and local vLLM deployments
Comprehensive Metrics: Tracks success rates, tool usage patterns, and error analysis

Key Components

Agent Framework (CodeAct)

Multi-turn conversational agent
Code generation and execution capabilities
Error handling and recovery mechanisms
Support for multiple LLM backends

Tool Search and Retrieval

Semantic similarity search using sentence transformers
Configurable retrieval models (GTE-Qwen, NV-Embed)
Dynamic corpus filtering and ranking

Adversarial Evaluation

Configurable tool failure simulation
Multiple failure strategies (block_first, block_simple)
Robustness testing under degraded conditions

Human Evaluation Interface

Interactive evaluation environment
Side-by-side comparison capabilities
Manual verification of model outputs

Architecture

The system consists of several key components:

Agent: The LLM being evaluated (CodeAct framework)
Search Engine: Tool retrieval system using semantic embedding search
Sandbox Environment: Secure execution environment for generated code
Controller: Manages database access and query execution
Adversary: Optional component for simulating tool failures

Installation

Prerequisites

Python 3.8+
CUDA-compatible GPU (recommended for retrieval models)
Spider dataset

Setup

Clone the repository:

git clone <repository-url>
cd hell-or-high-water

Install dependencies:

# TODO: Create requirements.txt for easier installation
pip install torch transformers datasets sentence-transformers
pip install sqlglot flask requests openai google-generativeai
pip install pyyaml tqdm jinja2
pip install jupyter_kernel_gateway

Install the local retriever package:

cd retriever_wrapper
pip install --editable .
cd ..

Download the Spider dataset and place it in the appropriate directory structure

Usage

Basic Evaluation

Run an evaluation on the benchmark:

python -m codeact.run --config_file configs/your_config.yaml

For different model types:

# OpenAI models
python -m codeact.run_openai --config_file configs/openai_config.yaml

# Google Gemini models  
python -m codeact.run_google --config_file configs/gemini_config.yaml

# Human evaluation interface
python -m human.run --config_file configs/human_config.yaml

Configuration

All runtime configurations are specified in YAML config files. See configs/defaults.yaml for a template:

data:
  dataset_path: # Path to processed dataset
  split: train/validation/test
  corpus_path: # Path to tool corpus embeddings
  documents_path: # Path to tool documentation JSON

variables:
  turns: 20 # Max function calls allowed
  k: 9 # Number of tools to retrieve per search
  use_gold: false # Whether to use oracle tool retrieval

search:
  retriever_path: Alibaba-NLP/gte-Qwen2-1.5B-instruct
  retriever_device: cuda

generation_configs:
  model_name: # Model identifier
  temperature: 0.0
  max_tokens: 1024
  random_state: 42

llm_service:
  url: # Inference server endpoint
  openai_api_key: # API key for OpenAI/compatible services

controller:
  url: localhost
  port: 8000
  db_base_path: spider_data/database

sandbox:
  url: localhost
  port: 5000

adversary:
  enabled: false # Enable adversarial evaluation
  strategy: block_first # Tool failure strategy

Creating Custom Configurations

Use the config generator for quick setup:

python make_config.py configs/defaults.yaml configs/my_config.yaml \
  --model_name gpt-4 \
  --variables_k 5 \
  --adversary_enabled

Dataset Processing

The benchmark includes tools for processing the Spider dataset into the tool-use format:

1. Preprocessing

Convert Spider queries into parameterized tool functions:

cd dataset
python preprocess.py --config_file configs/defaults.yaml

2. Simulation Environment

Create the executable simulation environment:

cd dataset
python make_simulation.py --config_file configs/defaults.yaml

3. Postprocessing and Augmentation

Filter and augment the dataset:

cd dataset
python postprocessing.py --config_file configs/defaults.yaml

Server Infrastructure

The evaluation system uses the CodeAct sandbox architecture and requires three main servers to operate:

Controller Server (api.py) - Orchestrates evaluation sessions and manages communication (runs locally)
Jupyter Kernel Gateway Server - Provides isolated code execution environments (runs in remote VM)
Sandbox Server (sandbox_server/search_api.py) - Handles tool search and documentation retrieval (runs in remote VM)

Controller Server

The controller server orchestrates the evaluation process and manages communication between components.

Components

Session Management: Handles multiple concurrent evaluation sessions with automatic cleanup
Jupyter Kernel Coordination: Creates and manages connections to Jupyter kernel gateway
Communication Hub: Routes messages between agents and execution environments
Lifecycle Management: Handles kernel creation, cleanup, and timeout management

Required Files

api.py                    # Main controller server
jupyter_helper.py         # Jupyter kernel management utilities
environment.py           # Game environment and execution logic

Setup and Running

# Start the controller server (default port 8088)
python api.py --port 8000

# Or specify a custom port
python api.py --port 9000

Configuration

The controller server accepts these parameters:

--port: Port number for the server (default: 8088)
Environment variables:
- DEBUG=True: Enable debug logging
- CLEANUP_TIMEOUT_MS=60000: Kernel cleanup interval in milliseconds

Jupyter Kernel Gateway Server

The kernel gateway server provides the execution environment for agent-generated code using Jupyter's kernel gateway. This server runs in a remote VM to provide isolation and security for code execution.

Components

Jupyter Kernel Management: Creates and manages isolated Python execution environments per conversation
Code Execution: Executes agent-generated Python code safely in sandboxed kernels
Database Access: Provides access to Spider databases for SQL query execution
WebSocket Communication: Handles real-time communication with evaluation clients

Required Files

sandbox_utils.py    # Generated tool functions (from dataset processing)
spider_data/database/       # Spider dataset databases

Setup and Running

Install Jupyter Kernel Gateway (on remote VM):
```
pip install jupyter_kernel_gateway
```

Start the Kernel Gateway Server (on remote VM):

jupyter kernelgateway --KernelGatewayApp.api=kernel_gateway.jupyter_websocket --port=8000

Configuration

The kernel gateway server accepts these parameters:

--port: Port number for the server (default: 8888)
--KernelGatewayApp.api=kernel_gateway.jupyter_websocket: Use WebSocket API for communication
Additional Jupyter configuration options can be set via environment variables or config files

Sandbox Server

The sandbox server provides tool search and documentation retrieval capabilities. This server runs in a remote VM alongside the Jupyter Kernel Gateway to provide secure tool access.

Components

Tool Search Engine: Semantic search over tool corpus using embedding models
Documentation Lookup: Retrieves detailed tool documentation and usage information
Embedding Cache: Pre-computed embeddings for fast tool retrieval
Multiple Retriever Support: Compatible with GTE-Qwen and NV-Embed models

Required Files

sandbox_server/
├── search_api.py           # Main search server
├── embeddings.pkl          # Pre-computed tool embeddings
└── documentation_lookup.json  # Tool documentation database

Setup and Running

Prepare Required Files (generate locally, then transfer to remote VM):

# Generate embeddings and documentation (after dataset processing)
cd dataset
python make_simulation.py --config_file configs/defaults.yaml

# This creates:
# - embeddings.pkl (tool corpus embeddings)
# - documentation_lookup.json (tool documentation)
# - sandbox_utils.py (executable tools)

Start the Sandbox Server (on remote VM):
```
cd sandbox_server
python search_api.py
```
The server runs on port 5000 by default.

API Endpoints

POST /search: Search for relevant tools
```
{"query": "find movies by director"}
```
POST /info: Get tool documentation
```
{"tool_name": "function_123"}
```
POST /debug: Debug endpoint for testing queries

Server Dependencies

Both servers require:

# Core dependencies
pip install tornado flask requests
pip install sentence-transformers torch
pip install jupyter-client ipykernel

# For kernel gateway server
pip install jupyter_kernel_gateway

# For sandbox server specifically  
pip install sentence-transformers

Complete Server Setup Workflow

Process Dataset (creates tools and documentation - run locally):

cd dataset
python preprocess.py --config_file configs/defaults.yaml
python make_simulation.py --config_file configs/defaults.yaml

Transfer Files to Remote VM:

# Transfer generated files to remote VM
scp sandbox_utils.py user@remote-vm:/path/to/workdir/
scp embeddings.pkl user@remote-vm:/path/to/workdir/sandbox_server/
scp documentation_lookup.json user@remote-vm:/path/to/workdir/sandbox_server/
scp -r spider_data/ user@remote-vm:/path/to/workdir/

Start Jupyter Kernel Gateway Server (on remote VM):

jupyter kernelgateway --KernelGatewayApp.api=kernel_gateway.jupyter_websocket --port=8000

Start Sandbox Server (on remote VM):
```
cd sandbox_server
python search_api.py
```
Start Controller Server (locally):
```
python api.py --port 8088
```

Run Evaluation (locally):

python -m codeact.run --config_file configs/your_config.yaml

Server Configuration in Config Files

Update your evaluation config to point to the running servers. Note that the controller runs locally while the sandbox and jupyter servers run on a remote VM:

controller:
  url: <remote-vm-ip>  # IP address of remote VM
  port: 8000
  db_base_path: spider_data/database

sandbox:
  url: <remote-vm-ip>  # IP address of remote VM
  port: 5000

Troubleshooting

Controller Server Issues:

Ensure Spider databases are in spider_data/database/
Check that Jupyter kernels can be created: jupyter kernelspec list
Verify port availability: lsof -i :8088
Check tornado and other dependencies are installed

Jupyter Kernel Gateway Issues (on remote VM):

Ensure sandbox_utils.py exists (created by dataset processing)
Ensure Spider databases are in spider_data/database/
Check that Jupyter kernels can be created: jupyter kernelspec list
Verify port availability: lsof -i :8000
Check kernel gateway installation: jupyter kernelgateway --help
Ensure network connectivity between local machine and remote VM

Sandbox Server Issues (on remote VM):

Ensure embeddings.pkl and documentation_lookup.json exist
Check retriever model download: models are cached in ~/.cache/huggingface/
Verify embedding model compatibility (GTE-Qwen or NV-Embed)
Ensure network connectivity between local machine and remote VM
Check retriever model download: models are cached in ~/.cache/huggingface/
Verify embedding model compatibility (GTE-Qwen or NV-Embed)

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
codeact		codeact
configs		configs
dataset		dataset
human		human
logs		logs
media		media
prompts		prompts
retriever_wrapper		retriever_wrapper
sandbox_server		sandbox_server
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
api.py		api.py
environment.py		environment.py
extra_tools.py		extra_tools.py
jupyter_helper.py		jupyter_helper.py
make_config.py		make_config.py
model_serving.py		model_serving.py
prompting.py		prompting.py
utils.py		utils.py

License

JHU-CLSP/hell-or-high-water

Folders and files

Latest commit

History

Repository files navigation

Hell or High Water (HOHW): Evaluating Agentic Recovery from External Failures

Overview

Key Features

Key Components

Agent Framework (CodeAct)

Tool Search and Retrieval

Adversarial Evaluation

Human Evaluation Interface

Architecture

Installation

Prerequisites

Setup

Usage

Basic Evaluation

Configuration

Creating Custom Configurations

Dataset Processing

1. Preprocessing

2. Simulation Environment

3. Postprocessing and Augmentation

Server Infrastructure

Controller Server

Components

Required Files

Setup and Running

Configuration

Jupyter Kernel Gateway Server

Components

Required Files

Setup and Running

Configuration

Sandbox Server

Components

Required Files

Setup and Running

API Endpoints

Server Dependencies

Complete Server Setup Workflow

Server Configuration in Config Files

Troubleshooting

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages