InternScience
diff --git a/‎README.md‎
Lines changed: 12 additions & 0 deletions b/‎README.md‎
Lines changed: 12 additions & 0 deletions
diff --git a/‎VERSION‎
Lines changed: 1 addition & 1 deletion b/‎VERSION‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎agent_base/prompts/system_base.md‎
Lines changed: 2 additions & 2 deletions b/‎agent_base/prompts/system_base.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎agent_base/react_agent.py‎
Lines changed: 36 additions & 48 deletions b/‎agent_base/react_agent.py‎
Lines changed: 36 additions & 48 deletions
diff --git a/‎agent_base/tools/README.md‎
Lines changed: 9 additions & 0 deletions b/‎agent_base/tools/README.md‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎api/openai_server.py‎
Lines changed: 15 additions & 2 deletions b/‎api/openai_server.py‎
Lines changed: 15 additions & 2 deletions
diff --git a/‎benchmarks/SGI-DeepResearch/role_prompt.md‎
Lines changed: 8 additions & 5 deletions b/‎benchmarks/SGI-DeepResearch/role_prompt.md‎
Lines changed: 8 additions & 5 deletions
diff --git a/‎benchmarks/SGI-DryExperiment/README.md‎
Lines changed: 5 additions & 0 deletions b/‎benchmarks/SGI-DryExperiment/README.md‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎benchmarks/SGI-DryExperiment/role_prompt.md‎
Lines changed: 20 additions & 6 deletions b/‎benchmarks/SGI-DryExperiment/role_prompt.md‎
Lines changed: 20 additions & 6 deletions
@@ -86,6 +86,10 @@ If you are new to the project, the recommended reading order is:
 
 ## 📰 News
 
+🚩 **Update** (2026-05-22) Added a real API smoke test that runs the five SGI benchmark README server commands and OpenAI SDK examples, then validates each expected final-answer format.
+
+🚩 **Update** (2026-05-22) CLI and API deployments can now expose an explicit complete tool set with repeatable `--tool NAME`, useful when a run needs a smaller or benchmark-specific tool surface.
+
 🚩 **Update** (2026-05-21) ResearchHarness is packaged for one-command installation with `pip install researchharness`. The existing source-tree commands remain compatible, and releases can publish to PyPI automatically from GitHub Releases.
 
 🚩 **Update** (2026-05-21) The Python import API now exposes the same core runtime controls as CLI mode: default workspace, role prompt strings/files, image inputs, explicit tool sets, optional extra tools, and decorated custom function tools.
@@ -736,6 +740,10 @@ deployment, and QA/VQA benchmark deployment. Advanced users can still combine
 `--role-prompt-file`, `--input-wrapper`, and `--output-wrapper` manually when a
 custom application needs only part of the benchmark behavior.
 
+For benchmark deployments that need a smaller tool surface, pass repeatable
+`--tool NAME` flags. This defines the complete exposed tool set for each run
+and cannot be combined with `--extra-tool`.
+
 ### API Concurrency
 
 The API endpoint remains synchronous from the client's perspective, but long
@@ -1064,6 +1072,9 @@ repository for local images.
 
 More detailed tool documentation lives in [agent_base/tools/README.md](agent_base/tools/README.md).
 
+Tool-use requests should use the native tool calling interface. User-required
+final answer formats remain ordinary final-answer text.
+
 Tool calls follow a single-request contract: `WebSearch.query`,
 `ScholarSearch.query`, and `WebFetch.url` each accept one string, not a list.
 When the model needs multiple independent searches, page fetches, file reads,
@@ -1146,6 +1157,7 @@ RESEARCHHARNESS_TEST_PYTHON="/path/to/your/python"
 | Optional extra-tool checks | `python3 tests/test_extra_tools.py` |
 | Python import API and custom-tool checks | `python3 tests/test_python_api_tools.py` |
 | OpenAI-compatible API checks | `python3 tests/test_openai_api_checks.py` |
+| SGI benchmark README server/example smoke test | `python3 tests/test_sgi_benchmark_readmes.py` |
 | Local frontend checks | `python3 tests/test_frontend_checks.py` |
 | End-to-end multi-tool test | `python3 tests/test_end_to_end_multitool.py` |
 | End-to-end local file discovery test | `python3 tests/test_end_to_end_glob_grep.py` |
 
@@ -1 +1 @@
-v0.0.41
+v0.0.42
@@ -66,7 +66,8 @@ You are a capable all-purpose AI assistant. You do far more than simple question
 
 ## Native Tool Calling Contract
 
-- Use the API's native tool calling interface when tools are needed. Do not write pseudo-XML, pseudo-tool JSON, or tag-based tool requests in plain text.
+- Use the API's native tool calling interface when tools are needed.
+- If the user explicitly requires a special final-answer format, follow that format as ordinary answer text.
 - If a turn includes native tool calls, that turn is a tool-use turn. Any accompanying text is treated as working context, not as the final result.
 - Multiple tool calls in one turn are allowed only when they are independent.
 - If tool B depends on the output of tool A, do not request them in the same turn. Wait for tool A's result first.
@@ -75,7 +76,6 @@ You are a capable all-purpose AI assistant. You do far more than simple question
 - Keep tool turns structured. Brief text may explain the current tool step, but the tool call itself is the action.
 - When no more tools are needed, return the final result as plain text.
 - If the user requires a strict format such as JSON, output only that payload as the plain final result text.
-- Do not emit legacy protocol tags such as `<tool_call>`, `<tool_response>`, `<think>`, or `<answer>`.
 
 ## Tool Selection And Routing
 
 
@@ -378,19 +378,6 @@ def compaction_trace_payload(
     }
 
 
-def legacy_protocol_error(content: str) -> Optional[str]:
-    stripped = content.lstrip()
-    if stripped.startswith("<tool_call>"):
-        return "assistant emitted deprecated text <tool_call> protocol"
-    if stripped.startswith("<tool_response>"):
-        return "assistant emitted deprecated text <tool_response> protocol"
-    if stripped.startswith("<think>"):
-        return "assistant emitted deprecated text <think> protocol"
-    if stripped.startswith("<answer>"):
-        return "assistant emitted deprecated text <answer> protocol"
-    return None
-
-
 def tool_schema(tool: Any) -> dict[str, Any]:
     return {
         "type": "function",
@@ -476,6 +463,13 @@ def resolve_extra_tool_names(extra_tools: Optional[Sequence[str]]) -> list[str]:
     return resolved
 
 
+def validate_named_tools(tool_names: Optional[Sequence[str]]) -> list[str]:
+    resolved = resolved_tool_names(tool_names)
+    if tool_names is not None:
+        available_tool_schemas(resolved)
+    return resolved
+
+
 def default_tool_names(*, include_ask_user: bool = True, extra_tools: Optional[Sequence[str]] = None) -> list[str]:
     names = [name for name in AVAILABLE_TOOL_MAP if include_ask_user or name != "AskUser"]
     for name in resolve_extra_tool_names(extra_tools):
@@ -1266,39 +1260,6 @@ def finalize_interrupted() -> dict[str, Any]:
                 termination = "llm api error"
                 return finalize(result_text, termination, error=result_text)
 
-            deprecated_protocol = legacy_protocol_error(assistant_text)
-            if deprecated_protocol is not None:
-                trace_writer.append(
-                    role="assistant",
-                    text=assistant_text.strip(),
-                    turn_index=round_index,
-                    tool_call_ids=assistant_tool_call_ids,
-                    tool_names=assistant_tool_names,
-                    tool_arguments=assistant_tool_arguments,
-                    finish_reason=finish_reason,
-                    error=deprecated_protocol,
-                )
-                retry_assistant_message = assistant_retry_history_message(
-                    content=assistant_content,
-                    reasoning_content=assistant_reasoning,
-                )
-                if retry_assistant_message is not None:
-                    messages.append(retry_assistant_message)
-                correction_text = (
-                    "Error: The previous assistant turn used the deprecated text-tag protocol. "
-                    "Do not emit <tool_call>, <tool_response>, <think>, or <answer> in plain text. "
-                    "Use only the native tool calling interface when tools are needed, or plain final result text when no more tools are needed."
-                )
-                messages.append(
-                    {
-                        "role": "user",
-                        "content": correction_text,
-                    }
-                )
-                trace_writer.append(role="user", text=correction_text, turn_index=round_index)
-                persist_state(error=deprecated_protocol)
-                continue
-
             if finish_reason == "length" and assistant_tool_calls:
                 protocol_error = "assistant tool call turn was truncated by output limit"
                 trace_writer.append(
@@ -1528,7 +1489,7 @@ def resolve_agent_class_for_role_prompt_files(role_prompt_files: Sequence[str])
     return MultiTurnReactAgent
 
 
-def _parse_cli_args(argv: list[str]) -> tuple[str, Optional[str], Optional[str], str, list[str], list[str], Optional[bool], list[str]]:
+def _parse_cli_args(argv: list[str]) -> tuple[str, Optional[str], Optional[str], str, list[str], list[str], Optional[bool], list[str], list[str]]:
     parser = argparse.ArgumentParser(description="Run the local agent directly from agent_base.react_agent.")
     parser.add_argument("prompt", nargs="*", help="Prompt text.")
     parser.add_argument("--prompt-file", help="Optional UTF-8 text file containing the prompt.")
@@ -1568,7 +1529,17 @@ def _parse_cli_args(argv: list[str]) -> tuple[str, Optional[str], Optional[str],
         metavar="NAME",
         help="Enable one optional extra tool for this run. Currently supported: str_replace_editor. May be passed multiple times.",
     )
+    parser.add_argument(
+        "--tool",
+        action="append",
+        default=[],
+        dest="tool_names",
+        metavar="NAME",
+        help="Expose an explicit complete tool set for this run. May be passed multiple times. Cannot be combined with --extra-tool.",
+    )
     args = parser.parse_args(argv)
+    if args.tool_names and args.extra_tools:
+        raise ValueError("--tool defines the complete tool set and cannot be combined with --extra-tool.")
 
     prompt_text = ""
     if args.prompt_file:
@@ -1587,6 +1558,7 @@ def _parse_cli_args(argv: list[str]) -> tuple[str, Optional[str], Optional[str],
         list(args.role_prompt_files),
         [path for group in args.image_paths for path in group],
         args.chat,
+        validate_named_tools(args.tool_names) if args.tool_names else [],
         resolve_extra_tool_names(args.extra_tools),
     )
 
@@ -1595,11 +1567,27 @@ def main(argv: Optional[list[str]] = None) -> int:
     load_default_dotenvs()
     try:
         require_required_env("ResearchHarness agent")
-        prompt_text, trace_dir, workspace_root, role_prompt, role_prompt_files, image_paths, chat_arg, extra_tools = _parse_cli_args(argv or sys.argv[1:])
+        (
+            prompt_text,
+            trace_dir,
+            workspace_root,
+            role_prompt,
+            role_prompt_files,
+            image_paths,
+            chat_arg,
+            tool_names,
+            extra_tools,
+        ) = _parse_cli_args(argv or sys.argv[1:])
         agent_cls = resolve_agent_class_for_role_prompt_files(role_prompt_files)
         forbidden_tools = set(getattr(agent_cls, "forbidden_tool_names", set()))
+        forbidden_requested_tools = sorted(set(tool_names) & forbidden_tools)
+        if forbidden_requested_tools:
+            raise ValueError(f"Tools are not allowed in this run: {forbidden_requested_tools}")
         agent = agent_cls(
             function_list=(
+                tool_names
+                if tool_names
+                else
                 default_tool_names(include_ask_user="AskUser" not in forbidden_tools, extra_tools=extra_tools)
                 if extra_tools
                 else None
 
@@ -578,6 +578,15 @@ python3 run_server.py --api-runs-dir ./api_runs --extra-tool str_replace_editor
 python3 run_frontend.py --extra-tool str_replace_editor
 ```
 
+If you need to shrink the exposed tool surface instead of appending optional
+tools, use repeatable `--tool NAME` flags in CLI/API mode. This defines the
+complete tool set and cannot be combined with `--extra-tool`:
+
+```bash
+python3 run_agent.py "..." --workspace-root ./workspace --tool Read --tool Bash
+python3 run_server.py --api-runs-dir ./api_runs --tool Read --tool Bash
+```
+
 Behavior:
 
 - Requires absolute paths inside the active workspace.
 
@@ -19,6 +19,7 @@
 
 from agent_base.react_agent import (
     MultiTurnReactAgent,
+    available_tool_schemas,
     assistant_text_content,
     default_tool_names,
     default_llm_config,
@@ -108,9 +109,15 @@ class ServerConfig:
     output_wrapper: bool = False
     max_concurrent_runs: int = DEFAULT_MAX_CONCURRENT_RUNS
     extra_tools: tuple[str, ...] = ()
+    tool_names: tuple[str, ...] = ()
 
     def __post_init__(self) -> None:
         self.max_concurrent_runs = positive_int(self.max_concurrent_runs, "max_concurrent_runs")
+        self.tool_names = tuple(str(name).strip() for name in self.tool_names if str(name).strip())
+        if self.tool_names and self.extra_tools:
+            raise ValueError("tool_names defines the complete tool set and cannot be combined with extra_tools.")
+        if self.tool_names:
+            available_tool_schemas(self.tool_names)
         self.extra_tools = tuple(resolve_extra_tool_names(self.extra_tools))
 
 
@@ -483,9 +490,12 @@ def run_chat_completion(payload: dict[str, Any], config: ServerConfig) -> dict[s
             f"Backend model {backend_model!r} does not support image content parts.",
         )
 
-    tool_names = default_tool_names(include_ask_user=False, extra_tools=config.extra_tools)
     agent = MultiTurnReactAgent(
-        function_list=tool_names,
+        function_list=(
+            list(config.tool_names)
+            if config.tool_names
+            else default_tool_names(include_ask_user=False, extra_tools=config.extra_tools)
+        ),
         llm=llm_config,
         trace_dir=str(trace_dir),
         role_prompt=config.role_prompt or None,
@@ -635,6 +645,7 @@ async def health() -> dict[str, Any]:
             "output_wrapper": config.output_wrapper,
             "max_concurrent_runs": config.max_concurrent_runs,
             "extra_tools": list(config.extra_tools),
+            "tool_names": list(config.tool_names),
         }
 
     @app.post("/v1/chat/completions")
@@ -659,6 +670,7 @@ def serve(
     output_wrapper: bool = False,
     max_concurrent_runs: int = DEFAULT_MAX_CONCURRENT_RUNS,
     extra_tools: Optional[list[str]] = None,
+    tool_names: Optional[list[str]] = None,
 ) -> None:
     root = normalize_workspace_root(api_runs_dir)
     role_prompt = read_role_prompt_files(role_prompt_files or [])
@@ -671,6 +683,7 @@ def serve(
         output_wrapper=output_wrapper,
         max_concurrent_runs=max_concurrent_runs,
         extra_tools=tuple(extra_tools or ()),
+        tool_names=tuple(tool_names or ()),
     )
     app = create_app(config)
     uvicorn.run(app, host=host, port=port)
@@ -10,16 +10,17 @@ Behavior:
 - Treat the original user prompt as authoritative.
 - Do not ask follow-up questions.
 - Do not stop with only a plan.
-- Search for relevant papers, technical reports, datasets, or official sources
-  when the answer depends on scientific background not fully contained in the
-  prompt.
+- Use external search when the answer depends on scientific background not
+  fully contained in the prompt, but keep the investigation bounded and
+  task-directed.
 - Prefer primary literature, review papers, official documentation, and
   reproducible data over unsourced webpages.
 - Reason from the collected evidence. If the task needs a calculation, write
   and run a small local calculation to verify units, constants, assumptions,
   and rounding.
-- Keep the investigation bounded. Do not drift into unrelated literature once
-  enough evidence exists to answer the prompt.
+- Stay focused on the requested deliverable. Do not drift into unrelated
+  research, broad surveys, optional side analyses, or extra outputs not required
+  by the prompt.
 
 Recommended working pattern:
 - Parse the exact question, target quantity, requested units, and requested
@@ -50,6 +51,8 @@ Final answer requirements:
 - Do not say "see notes" or rely on a workspace file as the answer.
 - Before the final response, re-read the prompt's requested answer format and
   make the final text comply with it.
+- Treat the required final-answer format as part of the benchmark contract; a
+  missing or malformed final answer can make an otherwise correct solution fail.
 
 Output example:
 
 
@@ -18,6 +18,11 @@ python3 run_server.py \
   --no-output-wrapper
 ```
 
+This benchmark is code completion from the prompt-provided `data_en.py` and
+`main_en.py`. The role prompt keeps any external search bounded and
+task-directed; use the standard ResearchHarness API tool set, which excludes
+`AskUser` by default, and rely on the benchmark overlay for task discipline.
+
 ## OpenAI Test Example
 
 The example below embeds the first real `SGI-DryExperiment` test item directly
 
@@ -10,7 +10,15 @@ Behavior:
   required final-output behavior.
 - Do not ask follow-up questions.
 - Do not stop with only a plan.
-- Use local files and tools when they help verify the solution.
+- Use external search only when it is genuinely needed to resolve an ambiguity
+  or missing background not contained in the prompt. Keep any search bounded
+  and task-directed; do not perform open-ended browsing or broad literature
+  review.
+- Use local files and tools only when they help understand or validate the
+  provided code.
+- Stay focused on the requested deliverable. Do not drift into unrelated
+  research, broad surveys, optional side analyses, or extra outputs not required
+  by the prompt.
 - Preserve existing public function names, signatures, imports, constants,
   printed output conventions, and the `[Final Output]` behavior implied by the
   provided code.
@@ -23,7 +31,6 @@ Required working process before the final answer:
 - Use tools to reconstruct the provided files in the workspace:
   - write the data-generation code to `data_en.py`
   - write the incomplete analysis code to `main_en.py`
-  - create a small scratch test runner only if it helps verification
 - Do not skip the local file reconstruction step. The benchmark answer is still
   the final text, but the local files are the working surface for analysis,
   execution, and debugging.
@@ -37,8 +44,12 @@ Required working process before the final answer:
 - Run `main_en.py` and check that it executes successfully, reaches
   `[Final Output]`, and produces a value or result shape consistent with the
   task description and code intent.
-- When useful, change random seeds or regenerate local data units to test that
-  the completed functions are robust rather than overfit to one run.
+- Do not validate mainly on toy arrays or self-invented simplified fixtures.
+  Those tests can give false confidence and do not match the benchmark unit
+  test. If extra validation is useful, derive it from the provided `data_en.py`
+  and the provided `main_en.py` structure, for example by rerunning the provided
+  data generator or varying an explicit seed/configuration already present in
+  the prompt.
 - Debug syntax errors, runtime errors, numerical instability, shape mismatches,
   and inconsistent units before finishing.
 - Only after the local code is coherent and validated, finish with the
@@ -51,8 +62,9 @@ Final answer requirements:
   completion request.
 - Prefer returning only the completed Python function definitions unless the
   original prompt explicitly asks for a different output shape.
-- When compatible with the original prompt, wrap the completed function
-  definitions in one `<answer>...</answer>` block.
+- Always wrap the completed function definitions in exactly one
+  `<answer>...</answer>` block unless the original prompt explicitly forbids
+  that tag.
 - The returned code must preserve the original function names, signatures,
   indentation style, and required return behavior.
 - Include every incomplete function from the prompt. Do not omit a function
@@ -63,6 +75,8 @@ Final answer requirements:
 - Do not include unrelated explanation if the benchmark asks for code only.
 - Before the final response, re-read the prompt's requested answer format and
   make the final text comply with it.
+- Treat the required final-answer format as part of the benchmark contract; a
+  missing or malformed final answer can make an otherwise correct solution fail.
 
 Output example: