Skip to content

JHU-CLSP/hell-or-high-water

Repository files navigation

Hell or High Water (HOHW): Evaluating Agentic Recovery from External Failures

A comprehensive benchmark for evaluating large language models' ability to discover, retrieve, and use tools effectively in complex multi-step reasoning tasks. This repository contains the code and infrastructure for our research on tool-use capabilities in LLMs.

Overview

HOHW addresses the challenge of evaluating how well large language models can work with tools in realistic scenarios. Rather than providing all available tools upfront, our benchmark requires models to:

  1. Search and discover relevant tools from a large corpus
  2. Understand tool documentation and functionality
  3. Compose multiple tool calls to solve complex tasks
  4. Handle tool failures and adversarial conditions

The benchmark is built on database querying tasks derived from the Spider dataset, where each database operation is abstracted into callable functions with natural language documentation.

Key Features

  • Dynamic Tool Discovery: Models must search for and retrieve relevant tools from a corpus of 1000+ functions
  • Realistic Tool Documentation: Tools are documented with natural language descriptions, not raw SQL
  • Multi-step Reasoning: Tasks require composing multiple tool calls in sequence
  • Adversarial Evaluation: Optional tool failure simulation to test robustness
  • Multiple LLM Support: Compatible with OpenAI, Google Gemini, and local vLLM deployments
  • Comprehensive Metrics: Tracks success rates, tool usage patterns, and error analysis

Key Components

Agent Framework (CodeAct)

  • Multi-turn conversational agent
  • Code generation and execution capabilities
  • Error handling and recovery mechanisms
  • Support for multiple LLM backends

Tool Search and Retrieval

  • Semantic similarity search using sentence transformers
  • Configurable retrieval models (GTE-Qwen, NV-Embed)
  • Dynamic corpus filtering and ranking

Adversarial Evaluation

  • Configurable tool failure simulation
  • Multiple failure strategies (block_first, block_simple)
  • Robustness testing under degraded conditions

Human Evaluation Interface

  • Interactive evaluation environment
  • Side-by-side comparison capabilities
  • Manual verification of model outputs

Architecture

Infrastructure

The system consists of several key components:

  • Agent: The LLM being evaluated (CodeAct framework)
  • Search Engine: Tool retrieval system using semantic embedding search
  • Sandbox Environment: Secure execution environment for generated code
  • Controller: Manages database access and query execution
  • Adversary: Optional component for simulating tool failures

Installation

Prerequisites

  • Python 3.8+
  • CUDA-compatible GPU (recommended for retrieval models)
  • Spider dataset

Setup

  1. Clone the repository:
git clone <repository-url>
cd hell-or-high-water
  1. Install dependencies:
# TODO: Create requirements.txt for easier installation
pip install torch transformers datasets sentence-transformers
pip install sqlglot flask requests openai google-generativeai
pip install pyyaml tqdm jinja2
pip install jupyter_kernel_gateway
  1. Install the local retriever package:
cd retriever_wrapper
pip install --editable .
cd ..
  1. Download the Spider dataset and place it in the appropriate directory structure

Usage

Basic Evaluation

Run an evaluation on the benchmark:

python -m codeact.run --config_file configs/your_config.yaml

For different model types:

# OpenAI models
python -m codeact.run_openai --config_file configs/openai_config.yaml

# Google Gemini models  
python -m codeact.run_google --config_file configs/gemini_config.yaml

# Human evaluation interface
python -m human.run --config_file configs/human_config.yaml

Configuration

All runtime configurations are specified in YAML config files. See configs/defaults.yaml for a template:

data:
  dataset_path: # Path to processed dataset
  split: train/validation/test
  corpus_path: # Path to tool corpus embeddings
  documents_path: # Path to tool documentation JSON

variables:
  turns: 20 # Max function calls allowed
  k: 9 # Number of tools to retrieve per search
  use_gold: false # Whether to use oracle tool retrieval

search:
  retriever_path: Alibaba-NLP/gte-Qwen2-1.5B-instruct
  retriever_device: cuda

generation_configs:
  model_name: # Model identifier
  temperature: 0.0
  max_tokens: 1024
  random_state: 42

llm_service:
  url: # Inference server endpoint
  openai_api_key: # API key for OpenAI/compatible services

controller:
  url: localhost
  port: 8000
  db_base_path: spider_data/database

sandbox:
  url: localhost
  port: 5000

adversary:
  enabled: false # Enable adversarial evaluation
  strategy: block_first # Tool failure strategy

Creating Custom Configurations

Use the config generator for quick setup:

python make_config.py configs/defaults.yaml configs/my_config.yaml \
  --model_name gpt-4 \
  --variables_k 5 \
  --adversary_enabled

Dataset Processing

The benchmark includes tools for processing the Spider dataset into the tool-use format:

1. Preprocessing

Convert Spider queries into parameterized tool functions:

cd dataset
python preprocess.py --config_file configs/defaults.yaml

2. Simulation Environment

Create the executable simulation environment:

cd dataset
python make_simulation.py --config_file configs/defaults.yaml

3. Postprocessing and Augmentation

Filter and augment the dataset:

cd dataset
python postprocessing.py --config_file configs/defaults.yaml

Server Infrastructure

The evaluation system uses the CodeAct sandbox architecture and requires three main servers to operate:

  1. Controller Server (api.py) - Orchestrates evaluation sessions and manages communication (runs locally)
  2. Jupyter Kernel Gateway Server - Provides isolated code execution environments (runs in remote VM)
  3. Sandbox Server (sandbox_server/search_api.py) - Handles tool search and documentation retrieval (runs in remote VM)

Controller Server

The controller server orchestrates the evaluation process and manages communication between components.

Components

  • Session Management: Handles multiple concurrent evaluation sessions with automatic cleanup
  • Jupyter Kernel Coordination: Creates and manages connections to Jupyter kernel gateway
  • Communication Hub: Routes messages between agents and execution environments
  • Lifecycle Management: Handles kernel creation, cleanup, and timeout management

Required Files

api.py                    # Main controller server
jupyter_helper.py         # Jupyter kernel management utilities
environment.py           # Game environment and execution logic

Setup and Running

# Start the controller server (default port 8088)
python api.py --port 8000

# Or specify a custom port
python api.py --port 9000

Configuration

The controller server accepts these parameters:

  • --port: Port number for the server (default: 8088)
  • Environment variables:
    • DEBUG=True: Enable debug logging
    • CLEANUP_TIMEOUT_MS=60000: Kernel cleanup interval in milliseconds

Jupyter Kernel Gateway Server

The kernel gateway server provides the execution environment for agent-generated code using Jupyter's kernel gateway. This server runs in a remote VM to provide isolation and security for code execution.

Components

  • Jupyter Kernel Management: Creates and manages isolated Python execution environments per conversation
  • Code Execution: Executes agent-generated Python code safely in sandboxed kernels
  • Database Access: Provides access to Spider databases for SQL query execution
  • WebSocket Communication: Handles real-time communication with evaluation clients

Required Files

sandbox_utils.py    # Generated tool functions (from dataset processing)
spider_data/database/       # Spider dataset databases

Setup and Running

  1. Install Jupyter Kernel Gateway (on remote VM):

    pip install jupyter_kernel_gateway
  2. Start the Kernel Gateway Server (on remote VM):

    jupyter kernelgateway --KernelGatewayApp.api=kernel_gateway.jupyter_websocket --port=8000

Configuration

The kernel gateway server accepts these parameters:

  • --port: Port number for the server (default: 8888)
  • --KernelGatewayApp.api=kernel_gateway.jupyter_websocket: Use WebSocket API for communication
  • Additional Jupyter configuration options can be set via environment variables or config files

Sandbox Server

The sandbox server provides tool search and documentation retrieval capabilities. This server runs in a remote VM alongside the Jupyter Kernel Gateway to provide secure tool access.

Components

  • Tool Search Engine: Semantic search over tool corpus using embedding models
  • Documentation Lookup: Retrieves detailed tool documentation and usage information
  • Embedding Cache: Pre-computed embeddings for fast tool retrieval
  • Multiple Retriever Support: Compatible with GTE-Qwen and NV-Embed models

Required Files

sandbox_server/
├── search_api.py           # Main search server
├── embeddings.pkl          # Pre-computed tool embeddings
└── documentation_lookup.json  # Tool documentation database

Setup and Running

  1. Prepare Required Files (generate locally, then transfer to remote VM):

    # Generate embeddings and documentation (after dataset processing)
    cd dataset
    python make_simulation.py --config_file configs/defaults.yaml
    
    # This creates:
    # - embeddings.pkl (tool corpus embeddings)
    # - documentation_lookup.json (tool documentation)
    # - sandbox_utils.py (executable tools)
  2. Start the Sandbox Server (on remote VM):

    cd sandbox_server
    python search_api.py

    The server runs on port 5000 by default.

API Endpoints

  • POST /search: Search for relevant tools
    {"query": "find movies by director"}
  • POST /info: Get tool documentation
    {"tool_name": "function_123"}
  • POST /debug: Debug endpoint for testing queries

Server Dependencies

Both servers require:

# Core dependencies
pip install tornado flask requests
pip install sentence-transformers torch
pip install jupyter-client ipykernel

# For kernel gateway server
pip install jupyter_kernel_gateway

# For sandbox server specifically  
pip install sentence-transformers

Complete Server Setup Workflow

  1. Process Dataset (creates tools and documentation - run locally):

    cd dataset
    python preprocess.py --config_file configs/defaults.yaml
    python make_simulation.py --config_file configs/defaults.yaml
  2. Transfer Files to Remote VM:

    # Transfer generated files to remote VM
    scp sandbox_utils.py user@remote-vm:/path/to/workdir/
    scp embeddings.pkl user@remote-vm:/path/to/workdir/sandbox_server/
    scp documentation_lookup.json user@remote-vm:/path/to/workdir/sandbox_server/
    scp -r spider_data/ user@remote-vm:/path/to/workdir/
  3. Start Jupyter Kernel Gateway Server (on remote VM):

    jupyter kernelgateway --KernelGatewayApp.api=kernel_gateway.jupyter_websocket --port=8000
  4. Start Sandbox Server (on remote VM):

    cd sandbox_server
    python search_api.py
  5. Start Controller Server (locally):

    python api.py --port 8088
  6. Run Evaluation (locally):

    python -m codeact.run --config_file configs/your_config.yaml

Server Configuration in Config Files

Update your evaluation config to point to the running servers. Note that the controller runs locally while the sandbox and jupyter servers run on a remote VM:

controller:
  url: <remote-vm-ip>  # IP address of remote VM
  port: 8000
  db_base_path: spider_data/database

sandbox:
  url: <remote-vm-ip>  # IP address of remote VM
  port: 5000

Troubleshooting

Controller Server Issues:

  • Ensure Spider databases are in spider_data/database/
  • Check that Jupyter kernels can be created: jupyter kernelspec list
  • Verify port availability: lsof -i :8088
  • Check tornado and other dependencies are installed

Jupyter Kernel Gateway Issues (on remote VM):

  • Ensure sandbox_utils.py exists (created by dataset processing)
  • Ensure Spider databases are in spider_data/database/
  • Check that Jupyter kernels can be created: jupyter kernelspec list
  • Verify port availability: lsof -i :8000
  • Check kernel gateway installation: jupyter kernelgateway --help
  • Ensure network connectivity between local machine and remote VM

Sandbox Server Issues (on remote VM):

  • Ensure embeddings.pkl and documentation_lookup.json exist
  • Check retriever model download: models are cached in ~/.cache/huggingface/
  • Verify embedding model compatibility (GTE-Qwen or NV-Embed)
  • Ensure network connectivity between local machine and remote VM
  • Check retriever model download: models are cached in ~/.cache/huggingface/
  • Verify embedding model compatibility (GTE-Qwen or NV-Embed)

About

Code and data for the paper: "Hell or High Water: Evaluating Agentic Recovery from External Failures"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages