Skip to content

Conversation

@ZixuanZZhang
Copy link

Summary

Adds environment support for Terminal-Bench 2.0, a benchmark for evaluating AI agents on real-world terminal tasks. Each task runs in an isolated Docker container with the agent interacting via shell commands. The same environment can be used for Terminal-Bench benchmark.

Changes: new environment: TerminalBenchEnv

  • Docker-based environment using Harbor for container management
  • Provides execute_command tool for shell interaction within containers
  • Automatic container lifecycle management (build, start, cleanup)
  • Binary reward function based on test script execution

Test results:

All existing unit tests pass (102 passed in 4.15s)

Additional dependencies:

Require Harbor for Docker environment management.

Copy link
Contributor

@Lawhy Lawhy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Lawhy Lawhy merged commit c637e83 into horizon-rl:main Feb 11, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants