Skip to content

RUCAIBox/ClawGym-Bench

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

About ClawGym-Bench

ClawGym-Bench is a diagnostic benchmark of 200 instances for Claw-style agents. Each task contains a user instruction, mock workspace resources, and a task-specific verifier.

  • 156 tasks use code-based verification.
  • 44 tasks use hybrid verification, combining code checks with rubric-based judgment.
  • Hybrid scoring uses 0.7 weight for code-based verification and 0.3 weight for rubric-based verification.

The benchmark is selected through difficulty-aware filtering and human-LLM review. It covers six workspace-grounded categories:


Category Product.
& Collab.
Systems
& Auto.
Analysis
& Reason.
Content
& Domain
Planning
& Knowl.
Software
Dev.
# Tasks 44 42 35 28 26 25

Benchmark Data is available in ClawGym-Bench.

Performance on ClawGym-Bench

overall

Evaluation Usage

# step 1: install OpenClaw
curl -fsSL https://openclaw.ai/install.sh | bash

# step 2: configure the model
Deploy a model in OpenClaw. If you use a local model, you can serve it with sglang or vllm.

# step 3: set the parameters and run the script. Benchmark data is in data/benchmark_data.jsonl
cd ClawGym-Bench/evaluation/localclawbench
bash scripts/eval.sh

# Note that the code checker is provided in "input_files" with the file path "reward/test.py", but it is not exposed in the workspace during task execution. It is used only after the model finishes the task for post-hoc evaluation, thereby avoiding reward hacking.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 98.3%
  • Shell 1.7%