Evaluates OpenSymbolicAI on the TravelPlanner benchmark (ICML 2024) — a challenging test of real-world planning where agents must produce complete multi-day travel itineraries satisfying budget, transportation, cuisine, and accommodation constraints.
Even GPT-4 achieves only a 0.6% final pass rate on this benchmark. The task requires gathering information from multiple sources, tracking costs, respecting constraints, and assembling a coherent day-by-day plan — exactly the kind of structured, multi-step reasoning that the GoalSeeking pattern is designed for.
OpenSymbolicAI achieves 100% on train, 99.4% on validation, and 97.9% on the full 1,000-task test set — near-perfect scores on every commonsense and hard constraint check, with zero errors and 100% delivery rate.
Same model, same tools, same evaluation — only the framework differs. Full analysis in COMPARISON.md. For multi-model results across 11 LLMs and 4 providers, see MODEL-LANDSCAPE.md.
| OpenSymbolicAI | LangChain | CrewAI | |
|---|---|---|---|
| Pass Rate | 100% | 77.8% | 73.3% |
| Tokens / Task | 13,936 | 43,801 (3.1x) | 81,331 (5.8x) |
| LLM Calls / Task | 2.3 | 13.5 (5.9x) | 39.6 (17x) |
| Cost / Passing Task | $0.013 | $0.051 (4.1x) | $0.100 (8x) |
| Latency | 47s | 73s (1.5x) | 124s (2.6x) |
Multipliers show how much more each framework consumes relative to OpenSymbolicAI. Measured on 45 train tasks (15 easy + 15 medium + 15 hard) with
gpt-oss-120bvia Fireworks AI.
| Level | Tasks | Delivery | Commonsense | Hard Constraints | Final Pass | Avg Time |
|---|---|---|---|---|---|---|
| Easy | 15 | 100% | 100% | 100% | 100% | — |
| Medium | 15 | 100% | 100% | 100% | 100% | — |
| Hard | 15 | 100% | 100% | 100% | 100% | — |
| All | 45 | 100% | 100% | 100% | 100% | 52.6s |
| Level | Tasks | Delivery | Commonsense | Hard Constraints | Final Pass | Avg Time |
|---|---|---|---|---|---|---|
| Easy | 60 | 100% | 100% | 100% | 100% | — |
| Medium | 60 | 100% | 98.3% | 100% | 98.3% | — |
| Hard | 60 | 100% | 100% | 100% | 100% | — |
| All | 180 | 100% | 99.4% | 100% | 99.4% | 55.5s |
| Level | Tasks | Delivery | Commonsense | Hard Constraints | Final Pass | Avg Time |
|---|---|---|---|---|---|---|
| Easy | 348 | 100% | 98.6% | 100% | 98.6% | — |
| Medium | 333 | 100% | 97.0% | 100% | 97.0% | — |
| Hard | 319 | 100% | 98.1% | 100% | 98.1% | — |
| All | 1,000 | 100% | 97.9% | 100% | 97.9% | 52.4s |
All hard constraint checks pass at 100% across all splits. Commonsense micro averages 99.7% on the full test set:
| Constraint Category | Check | Train | Validation | Test (1,000) |
|---|---|---|---|---|
| Commonsense | Within Sandbox (entities exist in DB) | 100% | 100% | 100% |
| Commonsense | Complete Information (no missing fields) | 100% | 100% | 99.9% |
| Commonsense | Within Current City (activities match city) | 100% | 100% | 99.6% |
| Commonsense | Reasonable City Route (origin → dest → origin) | 100% | 100% | 99.5% |
| Commonsense | Diverse Restaurants (no duplicates) | 100% | 100% | 98.9% |
| Commonsense | Diverse Attractions (no duplicates) | 100% | 100% | 100% |
| Commonsense | Non-Conflicting Transport (single mode) | 100% | 100% | 100% |
| Commonsense | Valid Accommodation (minimum nights) | 100% | 100% | 100% |
| Hard | Budget (total cost within limit) | 100% | 100% | 100% |
| Hard | Room Rule (house rules compliance) | 100% | 100% | 100% |
| Hard | Room Type (entire home/private/shared) | 100% | 100% | 100% |
| Hard | Cuisine (all required cuisines covered) | 100% | 100% | 100% |
| Hard | Transportation (forbidden mode not used) | 100% | 100% | 100% |
Results from the TravelPlanner paper (ICML 2024) on the validation split:
| Method | Delivery | Commonsense | Hard | Final Pass |
|---|---|---|---|---|
| GPT-3.5-Turbo | 100% | 2.9% | 1.7% | 0.6% |
| GPT-4 | 100% | 6.4% | 3.7% | 0.6% |
| GPT-4-Turbo | 99.4% | 11.7% | 4.6% | 4.4% |
| Gemini 1.5 Pro | 98.3% | 7.8% | 4.5% | 3.9% |
| OpenSymbolicAI (ours) | 100% | 99.4% | 100% | 99.4% |
Model:
gpt-oss-120bvia Fireworks AI. Each task uses 1 retrieval iteration + 1 assembly iteration (2 LLM calls total). No retries needed. All splits are full: train (45), validation (180), test (1,000). The 0.6% validation miss is a single commonsense constraint (within_current_city) on one medium task. Hard constraints are 100% across all 1,225 tasks.
TravelPlanner gives the agent a natural language travel request and asks it to produce a complete itinerary:
Query: Plan a 3-day trip from Sarasota to Chicago for 1 person with a budget of $1,900, from March 22nd to March 24th, 2022.
The agent must:
- Search for flights, restaurants, accommodations, and attractions
- Plan a day-by-day itinerary with transportation, meals, sightseeing, and lodging
- Satisfy all explicit constraints (budget, cuisine, room type, etc.)
- Respect commonsense rules (no duplicate restaurants, valid city routes, etc.)
| Level | Description | Constraints |
|---|---|---|
| Easy | Single city, 1 person | Budget only |
| Medium | Single/multi city, 2-8 people | Budget + 1 constraint (cuisine, room type, or room rule) |
| Hard | Multi-city, variable group | Budget + 3 constraints (cuisine + room type/rule + transportation) |
| Split | Size | Purpose |
|---|---|---|
| Train | 45 | Human-annotated reference plans |
| Validation | 180 | Evaluation with ground truth |
| Test | 1,000 | Blind test set |
Source: HuggingFace osunlp/TravelPlanner
The benchmark evaluates four categories with 13 individual checks:
| Check | Description |
|---|---|
| Within Sandbox | All entities (flights, restaurants, hotels, attractions) exist in the database |
| Complete Information | No excessive missing fields in the itinerary |
| Within Current City | Daily activities match the designated city |
| Reasonable City Route | Starts from origin, returns to origin, logical sequence |
| Diverse Restaurants | No restaurant visited more than once |
| Diverse Attractions | No attraction visited more than once |
| Non-Conflicting Transport | No mixing self-driving with flights |
| Valid Accommodation | Meets minimum-nights requirements |
| Check | Description |
|---|---|
| Budget | Total cost (flights + meals + accommodation) within stated budget |
| Room Rule | Accommodation complies with house rules (no smoking, no parties, etc.) |
| Room Type | Correct room type (entire home, private room, shared room) |
| Cuisine | All required cuisines represented in meals |
| Transportation | Forbidden transport mode not used (e.g., "no flights") |
- Delivery Rate — Did the agent produce a parseable plan?
- Commonsense Macro — Fraction of plans passing ALL 8 commonsense checks
- Hard Macro — Fraction of plans passing ALL applicable hard checks
- Final Pass Rate — Plans passing both commonsense AND hard constraints (headline metric)
- Python 3.12+
- uv — fast Python package manager
- opensymbolicai-core — the core GoalSeeking framework
- An API key for at least one LLM provider
git clone https://github.com/OpenSymbolicAI/benchmark-py-TravelPlanner.git
cd benchmark-py-TravelPlanner
uv syncCreate a .env file in the project root:
# LLM provider (pick one based on --provider flag)
FIREWORKS_API_KEY=... # Default provider - https://fireworks.ai
OPENAI_API_KEY=sk-... # For --provider openai
ANTHROPIC_API_KEY=sk-ant-... # For --provider anthropic
GROQ_API_KEY=gsk_... # For --provider groq# Run 5 easy validation tasks with GPT-4o
uv run travelplanner-bench --model gpt-4o --provider openai --level easy -n 5
# Run 5 easy tasks with Fireworks
uv run travelplanner-bench --model gpt-oss-120b --provider fireworks --level easy -n 5# Easy tasks only (budget constraint, single city)
uv run travelplanner-bench --model gpt-4o --provider openai --level easy
# Medium tasks (budget + 1 constraint)
uv run travelplanner-bench --model gpt-4o --provider openai --level medium
# Hard tasks (budget + 3 constraints, multi-city)
uv run travelplanner-bench --model gpt-4o --provider openai --level harduv run travelplanner-bench --model gpt-4o --provider openai --parallel 5# Anthropic Claude
uv run travelplanner-bench --model claude-sonnet-4-20250514 --provider anthropic -n 10
# Fireworks (open-source models)
uv run travelplanner-bench --model gpt-oss-120b --provider fireworks -n 10
# Local Ollama
uv run travelplanner-bench --model llama3 --provider ollama -n 5Send structured traces to the local observability stack for per-span inspection of planning, execution, and goal-seeking iterations:
# Start the observability stack (collector + dashboard)
cd /path/to/OpenSymbolicAI/observability && docker compose up
# Run with tracing enabled
uv run travelplanner-bench --model gpt-oss-120b --provider fireworks -n 5 --observeTraces are sent to http://localhost:8100/events and viewable in the dashboard at http://localhost:8101. Each task emits traces for all three agent layers:
- TravelPlannerAgent — top-level GoalSeeking orchestrator spans
- RetrievalAgent — GoalSeeking iteration spans (search primitives, evaluations)
- PlanAssemblerAgent — DesignExecute spans (plan generation, execution steps)
uv run travelplanner-bench --model MODEL --provider PROVIDER [OPTIONS]
Required:
--model MODEL Model name/ID (e.g., gpt-4o, gpt-oss-120b)
--provider {ollama,openai,anthropic,fireworks,groq} LLM provider
Options:
--split {train,validation,test} Dataset split (default: validation)
-l, --level {easy,medium,hard} Filter by difficulty level
-n, --num NUM Number of tasks (default: all)
--max-iterations N Max agent iterations per task (default: 10)
-p, --parallel N Parallel workers (default: 3)
--shuffle Shuffle tasks
--seed SEED Random seed (default: 42)
--observe Enable observability traces (http://localhost:8100)
Each run creates a timestamped directory under logs/:
logs/<timestamp>_<model>/
summary.json # Aggregate metrics (delivery rate, constraint scores, final pass rate)
results.json # Per-task results
task_0001_tp_0000.md # Detailed per-task log with plan and constraint results
agent_debug.log # Full agent iteration trace
Contains delivery rate, commonsense/hard constraint micro/macro scores, final pass rate, per-level breakdown, and timing.
Each file contains:
- Original query and constraints
- Plan delivered (yes/no) and final pass (yes/no)
- All 8 commonsense + 5 hard constraint results
- The generated JSON itinerary
- Error details if the agent failed
Travel Query + Constraints
|
v
[Stage 1: Information Gathering]
|
v
Plan LLM Call --> Python code using search primitives
|
v
Execute: search_flights, search_restaurants,
search_accommodations, search_attractions,
get_distance, search_cities
|
v
Introspect into TravelPlanContext
(flights_found, restaurants_found, ...)
|
v
Evaluate: enough data gathered?
| |
No Yes
| |
next iteration v
[Stage 2: Plan Assembly]
|
v
Plan LLM Call --> Python code building itinerary
|
v
Execute: set_plan([day1, day2, ...])
|
v
Evaluate: plan submitted? --> Done
- 1 LLM call per iteration generates Python code with multiple primitive calls
- ReferenceDatabase indexes each task's pre-collected data (flights, restaurants, etc.) for tool queries
- Symbolic firewall: raw search results stay in app memory; the LLM sees structured context
- set_plan() as terminal primitive: the agent explicitly decides when the plan is complete
- Evaluation against same database: constraint checks validate against the same reference data the tools provide
| Primitive | Stage | Description |
|---|---|---|
search_flights(origin, dest, date) |
Gather | Find flights between cities on a date |
search_restaurants(city) |
Gather | Find restaurants in a city |
search_accommodations(city) |
Gather | Find hotels/apartments in a city |
search_attractions(city) |
Gather | Find attractions in a city |
get_distance(origin, dest, mode) |
Gather | Get driving/taxi distance and cost |
search_cities(state) |
Gather | List cities in a state (for multi-city trips) |
set_plan(plan) |
Build | Submit the final day-by-day itinerary |
Compare OpenSymbolicAI against LangChain and CrewAI on the same tasks, same model, side by side. Measures token utilization and reliability across frameworks.
# LangChain only
uv add --optional langchain "langchain-core>=0.3.0" "langchain-openai>=0.3.0" "langgraph>=0.2.0"
# CrewAI only
uv add --optional crewai "crewai>=0.80.0"
# Both (for full comparison)
uv sync --extra langchain --extra crewai# All 3 frameworks on the train split (45 tasks)
uv run travelplanner-compare \
--frameworks opensymbolicai,langchain,crewai \
--model gpt-4o --provider openai
# Quick 5-task test with LangChain
uv run travelplanner-compare \
--frameworks langchain --model gpt-4o --provider openai --num 5
# Two-framework head-to-head on easy tasks
uv run travelplanner-compare \
--frameworks opensymbolicai,langchain --level easy \
--model gpt-4o --provider openai
# Specific tasks only
uv run travelplanner-compare \
--frameworks opensymbolicai,crewai \
--model gpt-4o --provider openai \
--task-ids tp_0003,tp_0010,tp_0015uv run travelplanner-compare [OPTIONS]
Required:
--model MODEL LLM model (same for all frameworks)
--provider {openai,anthropic,fireworks,groq,ollama} LLM provider
Options:
-f, --frameworks LIST Comma-separated frameworks (default: all three)
--split {train,validation,test} Dataset split (default: train)
-l, --level {easy,medium,hard} Filter by difficulty
-n, --num NUM Number of tasks (default: all)
--max-iterations N Max agent iterations per task (default: 10)
-p, --parallel N Parallel workers per framework (default: 1)
--task-ids IDS Comma-separated task IDs to run
Each run creates a directory under logs/:
logs/compare_<timestamp>/
comparison_report.md # Side-by-side Markdown tables
comparison_summary.json # Machine-readable metrics
opensymbolicai/
results.json # Per-task results
task_0001_tp_0000.md # Detailed per-task logs
langchain/
results.json
task_0001_tp_0000.md
crewai/
results.json
task_0001_tp_0000.md
Reliability — delivery rate, final pass rate, commonsense/hard constraint macro rates, error rate
Token Efficiency — total tokens, avg tokens/task, retrieval vs assembly split, LLM calls/task, estimated cost (USD), cost per passing task
Timing — avg/p50/p95 wall time per task
| Framework | Retrieval Phase | Assembly Phase | Post-processing |
|---|---|---|---|
| OpenSymbolicAI | GoalSeeking agent iteratively calls search primitives via LLM-generated Python code | DesignExecute agent generates plan via LLM-generated Python code | Shared 8-phase _fill_missing_fields |
| LangChain | create_react_agent (langgraph) with 6 search tools |
Single structured LLM call to generate plan JSON | Same shared post-processing |
| CrewAI | Sequential Crew: Researcher agent with search tools | Planner agent receives research context, outputs plan JSON | Same shared post-processing |
All three frameworks share the same ReferenceDatabase, search primitives, evaluation pipeline, and deterministic post-processing. This isolates the comparison to framework overhead and LLM interaction patterns.
benchmark-py-TravelPlanner/
travelplanner_bench/
__init__.py # Package exports
models.py # TravelPlannerTask, TravelPlanContext, TravelPlannerResult
data.py # HuggingFace dataset loader + JSON parsing
tools.py # ReferenceDatabase + 6 search tool functions
agent.py # TravelPlannerAgent (GoalSeeking)
evaluation.py # 8 commonsense + 5 hard constraint checks
runner.py # CLI benchmark runner (single framework)
backend.py # AgentBackend protocol, TokenUsage, BackendResult
token_tracking.py # Model pricing + token extraction helpers
tool_wrappers.py # LangChain/CrewAI tool adapters
comparison_runner.py # Multi-framework comparison CLI
comparison_report.py # Side-by-side Markdown + JSON report generator
backends/
__init__.py # Backend registry
opensymbolicai_backend.py # Wraps existing TravelPlannerAgent
langchain_backend.py # LangChain ReAct agent
crewai_backend.py # CrewAI Crew with 2 agents
tests/
test_models.py # Model creation and serialization tests
test_data.py # Data parsing tests
test_tools.py # ReferenceDatabase and search function tests
test_evaluation.py # All 13 constraint checker tests
logs/ # Per-run logs
main.py # Entry point
pyproject.toml
# Run unit tests
uv run pytest
# With coverage
uv run pytest --cov=travelplanner_bench
# Verbose output
uv run pytest -v