An open-source benchmark framework for running agents against real GitHub software releases, letting agents explore the live environment, discover latent bugs, and receive verifier-backed QA scores.
The autonomous discovery of bugs remains a significant challenge in modern software development. Compared to code generation, the complexity of dynamic runtime environments makes bug discovery considerably harder for LLMs. A GBQA task points to a real GitHub software release, defines how that software should run in an isolated sandbox, exposes supported interaction modes, and provides verifier-owned ground truth for scoring.
GBQA requires Python 3.12 or newer.
pip install -e .Create a root .env file from the template:
cp .env.example .envFill in the required runtime fields:
DAYTONA_API_KEY=
API_KEY=
BASE_URL=https://zenmux.ai/api/v1
MODEL_NAME=
GITHUB_TOKEN=Run the default GBQA Harbor agent against a task package in a remote Daytona sandbox:
python -m gbqa.cli.harbor_run run \
-p gbqa/tasks/<task-id> \
-e daytona \
--agent-import-path gbqa.harbor.agent:GBQAHarborAgent \
--ak interaction_mode=api \
--ak max_steps=10Use browser interaction by switching the interaction mode:
python -m gbqa.cli.harbor_run run \
-p gbqa/tasks/<task-id> \
-e daytona \
--agent-import-path gbqa.harbor.agent:GBQAHarborAgent \
--ak interaction_mode=browser \
--ak max_steps=10Warning
Warning for computer_use: computer-use (experimental) needs a separate GUI/Cua environment image, so we recommend to use python -m gbqa.cli.harbor_run run for stable execution, harbor run cannot handle environment image selection and may raise errors.
GBQA's gbqa.cli.harbor_run wrapper loads the root .env and forwards all arguments to Harbor. When a local path or registered dataset contains many task packages, Harbor can launch multiple Daytona sandboxes at the same time and run one evaluation per task environment.
For example, once gbqa/tasks contains many verified task packages, run up to 100 task evaluations concurrently:
python -m gbqa.cli.harbor_run run \
-p gbqa/tasks \
-e daytona \
--agent-import-path gbqa.harbor.agent:GBQAHarborAgent \
--ak interaction_mode=api \
--ak max_steps=10 \
--n-tasks 100 \
--n-concurrent 100Here --n-concurrent controls how many Harbor trials can run at once. In the Daytona path, that means many independent remote sandboxes can be active in parallel. It is not intended to create multiple concurrent agents inside the same task sandbox.
In Harbor benchmark runs, evaluation is performed automatically by the verifier phase after the agent writes normalized artifacts.
- Agent artifacts:
/logs/agent/gbqa/run.json,/logs/agent/gbqa/bugs.json,/logs/agent/gbqa/steps.jsonl - Harbor reward outputs:
/logs/verifier/reward.txt,/logs/verifier/reward.json - Full GBQA evaluation payload:
/logs/verifier/gbqa_result.json
Each benchmark task is a Harbor-compatible package under gbqa/tasks/<task-id>. The task package defines the GitHub software release, sandbox runtime assets, interaction modes, verifier entrypoint, ground-truth bug file, and artifact contract.
Environment discovery and preparation live outside the runtime package in environment/. This offline toolchain searches GitHub repositories, detects deployable sub-environments, filters and ranks candidates, runs optional Daytona deployment verification, supports human review, and exports approved task packages into gbqa/tasks.
python -m environment.sourcing.cli run \
--provider github \
--query "archived:false fork:false stars:>=10 mirror:false" \
--limit 500 \
--top-k 100 \
--output-dir environment/catalog/runs/devpython -m environment.export.cli generate \
--input environment/catalog/runs/dev/approved_task_seeds.jsonl \
--output gbqa/tasksGBQAHarborAgentas the default custom QA agent wrapper.- Example real GitHub software environment: Dark Castle in a remote Daytona sandbox.
- API and browser interaction modes for the example task.
- Harbor-compatible verifier and reward outputs.
- Add
CodexHarborAgentandClaudeCodeHarborAgent. - Support local sandbox + colocated agent.
- Support local agent + remote sandbox.
- Add more verified benchmark environments and task manifests.
- Scale parallel evaluation in Daytona sandboxes.
- Support API, browser, computer-use, and mixed interaction methods.
- Extend sandbox support from Linux toward Windows and macOS.
- Run broader LLM evaluation experiments and release a leaderboard.
- Collect trajectory data.
- Standardize reward signals.
- Support RL training and optimization workflows.
Contributions are welcome. The highest-priority areas are new Harbor-compatible task packages, additional agent harness adapters, verifier improvements, and sandbox/runtime robustness.