This repository documents the customized frontier-evals setup used inside
AiScientist, with a focus on running PaperBench from a vendored directory:
AiScientist/
└── benchmark/
└── frontier-evals/
This copy is not intended to be used as a standalone Git repository. It is
designed to live under AiScientist/benchmark/frontier-evals, and all Git and
Git LFS operations should be executed from the AiScientist repository root.
This integration keeps a narrow, opinionated runtime surface:
- Solver families:
AiScientistBasicAgentIterativeAgent
- Supported solver backends:
glm-5gemini-3-flash-preview
- Judge backend:
gpt-5.4
All preserved runner scripts write their outputs under:
AiScientist/output_dir/run/paper_bench/
The commands in this README assume the following structure:
AiScientist/
├── benchmark/
│ └── frontier-evals/
│ └── project/
│ └── paperbench/
└── output_dir/
Define the following variables before running any commands:
export AISCIENTIST_ROOT="/path/to/AiScientist"
export FE_ROOT="${AISCIENTIST_ROOT}/benchmark/frontier-evals"
export PB_ROOT="${FE_ROOT}/project/paperbench"You should have the following installed on the host machine:
gitgit-lfsuvpythonsupport required byuv sync --python=3.11dockerif you plan to prebuild runtime images or use container-based flows
Optional but commonly needed:
- A Hugging Face token for tasks that download gated assets
- Access to the model providers used by the solver and judge
Because frontier-evals is vendored into AiScientist, run all LFS commands
from the AiScientist repository root rather than from
benchmark/frontier-evals.
cd "${AISCIENTIST_ROOT}"
git lfs install
git lfs pull --include="benchmark/frontier-evals/project/paperbench/**"
git lfs checkoutThis restores the large assets used by PaperBench, including materials under:
benchmark/frontier-evals/project/paperbench/data/papers/**benchmark/frontier-evals/project/paperbench/data/judge_eval/**- selected LFS-tracked experiment artifacts under
project/paperbench/experiments/**
If you see files that still contain Git LFS pointer text such as:
version https://git-lfs.github.com/spec/v1
then the LFS assets have not been fully restored yet.
Move into the PaperBench project directory and create the local environment:
cd "${PB_ROOT}"
uv sync --python=3.11
source .venv/bin/activateFor the current custom tooling stack used by this vendored setup, the following extra packages are also recommended:
uv pip install protobuf==3.20.3
uv pip install omegaconf
uv pip install -U 'volcengine-python-sdk[ark]'If your internal or private environment requires additional packages, install
them after the base uv sync step.
At minimum, export the following variables:
export AISCIENTIST_ROOT="/path/to/AiScientist"
export FE_ROOT="${AISCIENTIST_ROOT}/benchmark/frontier-evals"
export PB_ROOT="${FE_ROOT}/project/paperbench"
export PAPERBENCH_DATA_DIR="${PB_ROOT}/data"
export HF_TOKEN="<optional_huggingface_token>"
export PB_TOOL_USER="<optional_tool_user>"Notes:
PAPERBENCH_DATA_DIRshould point to${PB_ROOT}/dataHF_TOKENis optional, but some papers require it for model or dataset accessPB_TOOL_USERis optional and only needed if your tooling stack uses it
export PB_GLM5_AZURE_OPENAI_ENDPOINT="<your_azure_compatible_endpoint>"
export PB_GLM5_AZURE_OPENAI_API_KEY="<your_glm5_key>"
export PB_GLM5_OPENAI_BASE_URL="<your_openai_compatible_base_url>"
export PB_JUDGE_OPENAI_API_KEY="<your_gpt54_judge_key>"export PB_GEMINI_API_KEY="<your_gemini_key>"
export PB_JUDGE_OPENAI_API_KEY="<your_gpt54_judge_key>"The Gemini runner defaults to the Google OpenAI-compatible endpoint:
https://generativelanguage.googleapis.com/v1beta/openai/If you need to override it:
export PB_GEMINI_OPENAI_BASE_URL="<your_gemini_openai_compatible_base_url>"These are optional, but they are the most useful runtime controls:
export PAPER_SPLIT="all" # e.g. all, lite, debug, dev, subset1, subset2, compare
export RUN_TIME="86400" # per-task wall clock limit, in seconds
export GPU_CANDIDATE_IDS="0,1,2,3"
export GPU_COUNT="1"
export GPU_AUTO_ALLOCATE="true"
export GPU_SHARE_MODE="true"
export RESUME_RUN_GROUP_ID=""
export MAX_RESPONSE_TOKENS="32768"
export JUDGE_MAX_TOKENS="16384"For AiScientist runs, you can also choose the subagent profile:
export SUBAGENT_CONFIG_PROFILE="default"If you want to prepare the runtime images in advance:
cd "${PB_ROOT}"
bash paperbench/scripts/build-docker-images.shThis builds:
pb-envpb-reproducer
If your network requires a proxy, set http_proxy and https_proxy before
running the build script. The script forwards those values into docker build.
All commands below should be executed from the PaperBench root:
cd "${PB_ROOT}"GLM-5:
bash scripts/aiScientist/aisci_glm5.shGemini:
bash scripts/aiScientist/aisci_gemini3.shGLM-5:
bash scripts/basicAgent/basic_all_run_glm5.shGemini:
bash scripts/basicAgent/basic_all_run_gemini3_boe.shGLM-5:
bash scripts/iterativeAgent/iterative_run_glm5.shGemini:
bash scripts/iterativeAgent/iterative_run_gemini3_boe.shAll preserved runner scripts write to:
${AISCIENTIST_ROOT}/output_dir/run/paper_bench/
The high-level structure is:
output_dir/run/paper_bench/
└── <paper_split>_run/
├── aiscientist/
├── basicagent/
└── iterativeagent/
Within each family, runs are grouped by model configuration and time limit. Log files are typically written to:
.../log/run_<timestamp>.log
The scripts also generate:
${PB_ROOT}/paperbench/solvers/agent.env
This file is created automatically from the current shell environment and normally does not need to be edited by hand.
To resume an interrupted run group, set:
export RESUME_RUN_GROUP_ID="<existing_run_group_id>"Then rerun the same script that created the original run.
This repository includes a score aggregation helper:
${PB_ROOT}/scripts/eval/paper_bench.py
It:
- reads one or more
run-groupdirectories - detects
re_grade/automatically - prefers regraded scores when available
- writes
all_result.jsoninto each run-group directory
cd "${PB_ROOT}"
./.venv/bin/python scripts/eval/paper_bench.py \
"${AISCIENTIST_ROOT}/output_dir/run/paper_bench/all_run/aiscientist/default_glm-5_gpt-5.4_86400/<run-group-id>"Example for AiScientist + GLM-5:
cd "${PB_ROOT}"
RUN_BASE="${AISCIENTIST_ROOT}/output_dir/run/paper_bench/all_run/aiscientist/default_glm-5_gpt-5.4_86400"
LATEST_RUN_GROUP="$(ls -dt "${RUN_BASE}"/*run-group* | head -n 1)"
./.venv/bin/python scripts/eval/paper_bench.py "${LATEST_RUN_GROUP}"The resulting summary file is written to:
<run-group-dir>/all_result.json
If you want the smallest working example for
AiScientist + glm-5 + gpt-5.4:
export AISCIENTIST_ROOT="/path/to/AiScientist"
export FE_ROOT="${AISCIENTIST_ROOT}/benchmark/frontier-evals"
export PB_ROOT="${FE_ROOT}/project/paperbench"
cd "${AISCIENTIST_ROOT}"
git lfs install
git lfs pull --include="benchmark/frontier-evals/project/paperbench/**"
git lfs checkout
cd "${PB_ROOT}"
uv sync --python=3.11
source .venv/bin/activate
export PAPERBENCH_DATA_DIR="${PB_ROOT}/data"
export PB_GLM5_AZURE_OPENAI_ENDPOINT="<your_endpoint>"
export PB_GLM5_AZURE_OPENAI_API_KEY="<your_glm5_key>"
export PB_GLM5_OPENAI_BASE_URL="<your_base_url>"
export PB_JUDGE_OPENAI_API_KEY="<your_gpt54_judge_key>"
export PAPER_SPLIT="all"
bash scripts/aiScientist/aisci_glm5.sh- If
PAPERBENCH_DATA_DIRis reported missing, verify that it points to${PB_ROOT}/data - If a file still looks like a Git LFS pointer, rerun
git lfs pullandgit lfs checkoutfrom theAiScientistroot - If custom web or tool-related modules fail to import, your environment is
missing extra dependencies beyond the base
uv sync - If the judge fails, inspect the corresponding run-group logs and grader artifacts under the output directory
- If Docker builds are slow or failing, verify your network and proxy settings