Skip to content

Commit 2775e2a

Browse files
committed
Prepare v0.0.42 benchmark tool updates
1 parent b22fe26 commit 2775e2a

19 files changed

Lines changed: 526 additions & 107 deletions

File tree

README.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,10 @@ If you are new to the project, the recommended reading order is:
8686

8787
## 📰 News
8888

89+
🚩 **Update** (2026-05-22) Added a real API smoke test that runs the five SGI benchmark README server commands and OpenAI SDK examples, then validates each expected final-answer format.
90+
91+
🚩 **Update** (2026-05-22) CLI and API deployments can now expose an explicit complete tool set with repeatable `--tool NAME`, useful when a run needs a smaller or benchmark-specific tool surface.
92+
8993
🚩 **Update** (2026-05-21) ResearchHarness is packaged for one-command installation with `pip install researchharness`. The existing source-tree commands remain compatible, and releases can publish to PyPI automatically from GitHub Releases.
9094

9195
🚩 **Update** (2026-05-21) The Python import API now exposes the same core runtime controls as CLI mode: default workspace, role prompt strings/files, image inputs, explicit tool sets, optional extra tools, and decorated custom function tools.
@@ -736,6 +740,10 @@ deployment, and QA/VQA benchmark deployment. Advanced users can still combine
736740
`--role-prompt-file`, `--input-wrapper`, and `--output-wrapper` manually when a
737741
custom application needs only part of the benchmark behavior.
738742

743+
For benchmark deployments that need a smaller tool surface, pass repeatable
744+
`--tool NAME` flags. This defines the complete exposed tool set for each run
745+
and cannot be combined with `--extra-tool`.
746+
739747
### API Concurrency
740748

741749
The API endpoint remains synchronous from the client's perspective, but long
@@ -1064,6 +1072,9 @@ repository for local images.
10641072

10651073
More detailed tool documentation lives in [agent_base/tools/README.md](agent_base/tools/README.md).
10661074

1075+
Tool-use requests should use the native tool calling interface. User-required
1076+
final answer formats remain ordinary final-answer text.
1077+
10671078
Tool calls follow a single-request contract: `WebSearch.query`,
10681079
`ScholarSearch.query`, and `WebFetch.url` each accept one string, not a list.
10691080
When the model needs multiple independent searches, page fetches, file reads,
@@ -1146,6 +1157,7 @@ RESEARCHHARNESS_TEST_PYTHON="/path/to/your/python"
11461157
| Optional extra-tool checks | `python3 tests/test_extra_tools.py` |
11471158
| Python import API and custom-tool checks | `python3 tests/test_python_api_tools.py` |
11481159
| OpenAI-compatible API checks | `python3 tests/test_openai_api_checks.py` |
1160+
| SGI benchmark README server/example smoke test | `python3 tests/test_sgi_benchmark_readmes.py` |
11491161
| Local frontend checks | `python3 tests/test_frontend_checks.py` |
11501162
| End-to-end multi-tool test | `python3 tests/test_end_to_end_multitool.py` |
11511163
| End-to-end local file discovery test | `python3 tests/test_end_to_end_glob_grep.py` |

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
v0.0.41
1+
v0.0.42

agent_base/prompts/system_base.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,8 @@ You are a capable all-purpose AI assistant. You do far more than simple question
6666

6767
## Native Tool Calling Contract
6868

69-
- Use the API's native tool calling interface when tools are needed. Do not write pseudo-XML, pseudo-tool JSON, or tag-based tool requests in plain text.
69+
- Use the API's native tool calling interface when tools are needed.
70+
- If the user explicitly requires a special final-answer format, follow that format as ordinary answer text.
7071
- If a turn includes native tool calls, that turn is a tool-use turn. Any accompanying text is treated as working context, not as the final result.
7172
- Multiple tool calls in one turn are allowed only when they are independent.
7273
- If tool B depends on the output of tool A, do not request them in the same turn. Wait for tool A's result first.
@@ -75,7 +76,6 @@ You are a capable all-purpose AI assistant. You do far more than simple question
7576
- Keep tool turns structured. Brief text may explain the current tool step, but the tool call itself is the action.
7677
- When no more tools are needed, return the final result as plain text.
7778
- If the user requires a strict format such as JSON, output only that payload as the plain final result text.
78-
- Do not emit legacy protocol tags such as `<tool_call>`, `<tool_response>`, `<think>`, or `<answer>`.
7979

8080
## Tool Selection And Routing
8181

agent_base/react_agent.py

Lines changed: 36 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -378,19 +378,6 @@ def compaction_trace_payload(
378378
}
379379

380380

381-
def legacy_protocol_error(content: str) -> Optional[str]:
382-
stripped = content.lstrip()
383-
if stripped.startswith("<tool_call>"):
384-
return "assistant emitted deprecated text <tool_call> protocol"
385-
if stripped.startswith("<tool_response>"):
386-
return "assistant emitted deprecated text <tool_response> protocol"
387-
if stripped.startswith("<think>"):
388-
return "assistant emitted deprecated text <think> protocol"
389-
if stripped.startswith("<answer>"):
390-
return "assistant emitted deprecated text <answer> protocol"
391-
return None
392-
393-
394381
def tool_schema(tool: Any) -> dict[str, Any]:
395382
return {
396383
"type": "function",
@@ -476,6 +463,13 @@ def resolve_extra_tool_names(extra_tools: Optional[Sequence[str]]) -> list[str]:
476463
return resolved
477464

478465

466+
def validate_named_tools(tool_names: Optional[Sequence[str]]) -> list[str]:
467+
resolved = resolved_tool_names(tool_names)
468+
if tool_names is not None:
469+
available_tool_schemas(resolved)
470+
return resolved
471+
472+
479473
def default_tool_names(*, include_ask_user: bool = True, extra_tools: Optional[Sequence[str]] = None) -> list[str]:
480474
names = [name for name in AVAILABLE_TOOL_MAP if include_ask_user or name != "AskUser"]
481475
for name in resolve_extra_tool_names(extra_tools):
@@ -1266,39 +1260,6 @@ def finalize_interrupted() -> dict[str, Any]:
12661260
termination = "llm api error"
12671261
return finalize(result_text, termination, error=result_text)
12681262

1269-
deprecated_protocol = legacy_protocol_error(assistant_text)
1270-
if deprecated_protocol is not None:
1271-
trace_writer.append(
1272-
role="assistant",
1273-
text=assistant_text.strip(),
1274-
turn_index=round_index,
1275-
tool_call_ids=assistant_tool_call_ids,
1276-
tool_names=assistant_tool_names,
1277-
tool_arguments=assistant_tool_arguments,
1278-
finish_reason=finish_reason,
1279-
error=deprecated_protocol,
1280-
)
1281-
retry_assistant_message = assistant_retry_history_message(
1282-
content=assistant_content,
1283-
reasoning_content=assistant_reasoning,
1284-
)
1285-
if retry_assistant_message is not None:
1286-
messages.append(retry_assistant_message)
1287-
correction_text = (
1288-
"Error: The previous assistant turn used the deprecated text-tag protocol. "
1289-
"Do not emit <tool_call>, <tool_response>, <think>, or <answer> in plain text. "
1290-
"Use only the native tool calling interface when tools are needed, or plain final result text when no more tools are needed."
1291-
)
1292-
messages.append(
1293-
{
1294-
"role": "user",
1295-
"content": correction_text,
1296-
}
1297-
)
1298-
trace_writer.append(role="user", text=correction_text, turn_index=round_index)
1299-
persist_state(error=deprecated_protocol)
1300-
continue
1301-
13021263
if finish_reason == "length" and assistant_tool_calls:
13031264
protocol_error = "assistant tool call turn was truncated by output limit"
13041265
trace_writer.append(
@@ -1528,7 +1489,7 @@ def resolve_agent_class_for_role_prompt_files(role_prompt_files: Sequence[str])
15281489
return MultiTurnReactAgent
15291490

15301491

1531-
def _parse_cli_args(argv: list[str]) -> tuple[str, Optional[str], Optional[str], str, list[str], list[str], Optional[bool], list[str]]:
1492+
def _parse_cli_args(argv: list[str]) -> tuple[str, Optional[str], Optional[str], str, list[str], list[str], Optional[bool], list[str], list[str]]:
15321493
parser = argparse.ArgumentParser(description="Run the local agent directly from agent_base.react_agent.")
15331494
parser.add_argument("prompt", nargs="*", help="Prompt text.")
15341495
parser.add_argument("--prompt-file", help="Optional UTF-8 text file containing the prompt.")
@@ -1568,7 +1529,17 @@ def _parse_cli_args(argv: list[str]) -> tuple[str, Optional[str], Optional[str],
15681529
metavar="NAME",
15691530
help="Enable one optional extra tool for this run. Currently supported: str_replace_editor. May be passed multiple times.",
15701531
)
1532+
parser.add_argument(
1533+
"--tool",
1534+
action="append",
1535+
default=[],
1536+
dest="tool_names",
1537+
metavar="NAME",
1538+
help="Expose an explicit complete tool set for this run. May be passed multiple times. Cannot be combined with --extra-tool.",
1539+
)
15711540
args = parser.parse_args(argv)
1541+
if args.tool_names and args.extra_tools:
1542+
raise ValueError("--tool defines the complete tool set and cannot be combined with --extra-tool.")
15721543

15731544
prompt_text = ""
15741545
if args.prompt_file:
@@ -1587,6 +1558,7 @@ def _parse_cli_args(argv: list[str]) -> tuple[str, Optional[str], Optional[str],
15871558
list(args.role_prompt_files),
15881559
[path for group in args.image_paths for path in group],
15891560
args.chat,
1561+
validate_named_tools(args.tool_names) if args.tool_names else [],
15901562
resolve_extra_tool_names(args.extra_tools),
15911563
)
15921564

@@ -1595,11 +1567,27 @@ def main(argv: Optional[list[str]] = None) -> int:
15951567
load_default_dotenvs()
15961568
try:
15971569
require_required_env("ResearchHarness agent")
1598-
prompt_text, trace_dir, workspace_root, role_prompt, role_prompt_files, image_paths, chat_arg, extra_tools = _parse_cli_args(argv or sys.argv[1:])
1570+
(
1571+
prompt_text,
1572+
trace_dir,
1573+
workspace_root,
1574+
role_prompt,
1575+
role_prompt_files,
1576+
image_paths,
1577+
chat_arg,
1578+
tool_names,
1579+
extra_tools,
1580+
) = _parse_cli_args(argv or sys.argv[1:])
15991581
agent_cls = resolve_agent_class_for_role_prompt_files(role_prompt_files)
16001582
forbidden_tools = set(getattr(agent_cls, "forbidden_tool_names", set()))
1583+
forbidden_requested_tools = sorted(set(tool_names) & forbidden_tools)
1584+
if forbidden_requested_tools:
1585+
raise ValueError(f"Tools are not allowed in this run: {forbidden_requested_tools}")
16011586
agent = agent_cls(
16021587
function_list=(
1588+
tool_names
1589+
if tool_names
1590+
else
16031591
default_tool_names(include_ask_user="AskUser" not in forbidden_tools, extra_tools=extra_tools)
16041592
if extra_tools
16051593
else None

agent_base/tools/README.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -578,6 +578,15 @@ python3 run_server.py --api-runs-dir ./api_runs --extra-tool str_replace_editor
578578
python3 run_frontend.py --extra-tool str_replace_editor
579579
```
580580

581+
If you need to shrink the exposed tool surface instead of appending optional
582+
tools, use repeatable `--tool NAME` flags in CLI/API mode. This defines the
583+
complete tool set and cannot be combined with `--extra-tool`:
584+
585+
```bash
586+
python3 run_agent.py "..." --workspace-root ./workspace --tool Read --tool Bash
587+
python3 run_server.py --api-runs-dir ./api_runs --tool Read --tool Bash
588+
```
589+
581590
Behavior:
582591

583592
- Requires absolute paths inside the active workspace.

api/openai_server.py

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919

2020
from agent_base.react_agent import (
2121
MultiTurnReactAgent,
22+
available_tool_schemas,
2223
assistant_text_content,
2324
default_tool_names,
2425
default_llm_config,
@@ -108,9 +109,15 @@ class ServerConfig:
108109
output_wrapper: bool = False
109110
max_concurrent_runs: int = DEFAULT_MAX_CONCURRENT_RUNS
110111
extra_tools: tuple[str, ...] = ()
112+
tool_names: tuple[str, ...] = ()
111113

112114
def __post_init__(self) -> None:
113115
self.max_concurrent_runs = positive_int(self.max_concurrent_runs, "max_concurrent_runs")
116+
self.tool_names = tuple(str(name).strip() for name in self.tool_names if str(name).strip())
117+
if self.tool_names and self.extra_tools:
118+
raise ValueError("tool_names defines the complete tool set and cannot be combined with extra_tools.")
119+
if self.tool_names:
120+
available_tool_schemas(self.tool_names)
114121
self.extra_tools = tuple(resolve_extra_tool_names(self.extra_tools))
115122

116123

@@ -483,9 +490,12 @@ def run_chat_completion(payload: dict[str, Any], config: ServerConfig) -> dict[s
483490
f"Backend model {backend_model!r} does not support image content parts.",
484491
)
485492

486-
tool_names = default_tool_names(include_ask_user=False, extra_tools=config.extra_tools)
487493
agent = MultiTurnReactAgent(
488-
function_list=tool_names,
494+
function_list=(
495+
list(config.tool_names)
496+
if config.tool_names
497+
else default_tool_names(include_ask_user=False, extra_tools=config.extra_tools)
498+
),
489499
llm=llm_config,
490500
trace_dir=str(trace_dir),
491501
role_prompt=config.role_prompt or None,
@@ -635,6 +645,7 @@ async def health() -> dict[str, Any]:
635645
"output_wrapper": config.output_wrapper,
636646
"max_concurrent_runs": config.max_concurrent_runs,
637647
"extra_tools": list(config.extra_tools),
648+
"tool_names": list(config.tool_names),
638649
}
639650

640651
@app.post("/v1/chat/completions")
@@ -659,6 +670,7 @@ def serve(
659670
output_wrapper: bool = False,
660671
max_concurrent_runs: int = DEFAULT_MAX_CONCURRENT_RUNS,
661672
extra_tools: Optional[list[str]] = None,
673+
tool_names: Optional[list[str]] = None,
662674
) -> None:
663675
root = normalize_workspace_root(api_runs_dir)
664676
role_prompt = read_role_prompt_files(role_prompt_files or [])
@@ -671,6 +683,7 @@ def serve(
671683
output_wrapper=output_wrapper,
672684
max_concurrent_runs=max_concurrent_runs,
673685
extra_tools=tuple(extra_tools or ()),
686+
tool_names=tuple(tool_names or ()),
674687
)
675688
app = create_app(config)
676689
uvicorn.run(app, host=host, port=port)

benchmarks/SGI-DeepResearch/role_prompt.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,16 +10,17 @@ Behavior:
1010
- Treat the original user prompt as authoritative.
1111
- Do not ask follow-up questions.
1212
- Do not stop with only a plan.
13-
- Search for relevant papers, technical reports, datasets, or official sources
14-
when the answer depends on scientific background not fully contained in the
15-
prompt.
13+
- Use external search when the answer depends on scientific background not
14+
fully contained in the prompt, but keep the investigation bounded and
15+
task-directed.
1616
- Prefer primary literature, review papers, official documentation, and
1717
reproducible data over unsourced webpages.
1818
- Reason from the collected evidence. If the task needs a calculation, write
1919
and run a small local calculation to verify units, constants, assumptions,
2020
and rounding.
21-
- Keep the investigation bounded. Do not drift into unrelated literature once
22-
enough evidence exists to answer the prompt.
21+
- Stay focused on the requested deliverable. Do not drift into unrelated
22+
research, broad surveys, optional side analyses, or extra outputs not required
23+
by the prompt.
2324

2425
Recommended working pattern:
2526
- Parse the exact question, target quantity, requested units, and requested
@@ -50,6 +51,8 @@ Final answer requirements:
5051
- Do not say "see notes" or rely on a workspace file as the answer.
5152
- Before the final response, re-read the prompt's requested answer format and
5253
make the final text comply with it.
54+
- Treat the required final-answer format as part of the benchmark contract; a
55+
missing or malformed final answer can make an otherwise correct solution fail.
5356

5457
Output example:
5558

benchmarks/SGI-DryExperiment/README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,11 @@ python3 run_server.py \
1818
--no-output-wrapper
1919
```
2020

21+
This benchmark is code completion from the prompt-provided `data_en.py` and
22+
`main_en.py`. The role prompt keeps any external search bounded and
23+
task-directed; use the standard ResearchHarness API tool set, which excludes
24+
`AskUser` by default, and rely on the benchmark overlay for task discipline.
25+
2126
## OpenAI Test Example
2227

2328
The example below embeds the first real `SGI-DryExperiment` test item directly

benchmarks/SGI-DryExperiment/role_prompt.md

Lines changed: 20 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,15 @@ Behavior:
1010
required final-output behavior.
1111
- Do not ask follow-up questions.
1212
- Do not stop with only a plan.
13-
- Use local files and tools when they help verify the solution.
13+
- Use external search only when it is genuinely needed to resolve an ambiguity
14+
or missing background not contained in the prompt. Keep any search bounded
15+
and task-directed; do not perform open-ended browsing or broad literature
16+
review.
17+
- Use local files and tools only when they help understand or validate the
18+
provided code.
19+
- Stay focused on the requested deliverable. Do not drift into unrelated
20+
research, broad surveys, optional side analyses, or extra outputs not required
21+
by the prompt.
1422
- Preserve existing public function names, signatures, imports, constants,
1523
printed output conventions, and the `[Final Output]` behavior implied by the
1624
provided code.
@@ -23,7 +31,6 @@ Required working process before the final answer:
2331
- Use tools to reconstruct the provided files in the workspace:
2432
- write the data-generation code to `data_en.py`
2533
- write the incomplete analysis code to `main_en.py`
26-
- create a small scratch test runner only if it helps verification
2734
- Do not skip the local file reconstruction step. The benchmark answer is still
2835
the final text, but the local files are the working surface for analysis,
2936
execution, and debugging.
@@ -37,8 +44,12 @@ Required working process before the final answer:
3744
- Run `main_en.py` and check that it executes successfully, reaches
3845
`[Final Output]`, and produces a value or result shape consistent with the
3946
task description and code intent.
40-
- When useful, change random seeds or regenerate local data units to test that
41-
the completed functions are robust rather than overfit to one run.
47+
- Do not validate mainly on toy arrays or self-invented simplified fixtures.
48+
Those tests can give false confidence and do not match the benchmark unit
49+
test. If extra validation is useful, derive it from the provided `data_en.py`
50+
and the provided `main_en.py` structure, for example by rerunning the provided
51+
data generator or varying an explicit seed/configuration already present in
52+
the prompt.
4253
- Debug syntax errors, runtime errors, numerical instability, shape mismatches,
4354
and inconsistent units before finishing.
4455
- Only after the local code is coherent and validated, finish with the
@@ -51,8 +62,9 @@ Final answer requirements:
5162
completion request.
5263
- Prefer returning only the completed Python function definitions unless the
5364
original prompt explicitly asks for a different output shape.
54-
- When compatible with the original prompt, wrap the completed function
55-
definitions in one `<answer>...</answer>` block.
65+
- Always wrap the completed function definitions in exactly one
66+
`<answer>...</answer>` block unless the original prompt explicitly forbids
67+
that tag.
5668
- The returned code must preserve the original function names, signatures,
5769
indentation style, and required return behavior.
5870
- Include every incomplete function from the prompt. Do not omit a function
@@ -63,6 +75,8 @@ Final answer requirements:
6375
- Do not include unrelated explanation if the benchmark asks for code only.
6476
- Before the final response, re-read the prompt's requested answer format and
6577
make the final text comply with it.
78+
- Treat the required final-answer format as part of the benchmark contract; a
79+
missing or malformed final answer can make an otherwise correct solution fail.
6680

6781
Output example:
6882

0 commit comments

Comments
 (0)