GitHub - RUCAIBox/ClawGym-Bench

About ClawGym-Bench

ClawGym-Bench is a diagnostic benchmark of 200 instances for Claw-style agents. Each task contains a user instruction, mock workspace resources, and a task-specific verifier.

156 tasks use code-based verification.
44 tasks use hybrid verification, combining code checks with rubric-based judgment.
Hybrid scoring uses 0.7 weight for code-based verification and 0.3 weight for rubric-based verification.

The benchmark is selected through difficulty-aware filtering and human-LLM review. It covers six workspace-grounded categories:

Category	Product. & Collab.	Systems & Auto.	Analysis & Reason.	Content & Domain	Planning & Knowl.	Software Dev.
# Tasks	44	42	35	28	26	25

Benchmark Data is available in ClawGym-Bench.

Performance on ClawGym-Bench

Evaluation Usage

# step 1: install OpenClaw
curl -fsSL https://openclaw.ai/install.sh | bash

# step 2: configure the model
Deploy a model in OpenClaw. If you use a local model, you can serve it with sglang or vllm.

# step 3: set the parameters and run the script. Benchmark data is in data/benchmark_data.jsonl
cd ClawGym-Bench/evaluation/localclawbench
bash scripts/eval.sh

# Note that the code checker is provided in "input_files" with the file path "reward/test.py", but it is not exposed in the workspace during task execution. It is used only after the model finishes the task for post-hoc evaluation, thereby avoiding reward hacking.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
data		data
evaluation/localclawbench		evaluation/localclawbench
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About ClawGym-Bench

Performance on ClawGym-Bench

Evaluation Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About ClawGym-Bench

Performance on ClawGym-Bench

Evaluation Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages