Skip to content

feat: CodeMode sandbox tool for custom one-off tools, and multi-step tool chaining #726

@kingpanther13

Description

@kingpanther13

Summary

Add an optional sandboxed code execution tool that lets LLMs create custom one-off tools on the fly when no existing tool covers the user's request, using FastMCP 3.1's CodeMode sandbox (pydantic-monty).

Problem

The ha-mcp tool catalog — no matter how large — will never cover every possible Home Assistant operation. Today, when a user asks for something there's no dedicated tool for (e.g., "edit entity categories", "pull debug logs", "interact with a third-party add-on"), the LLM fails in one of several ways:

  • Gives up entirely: "Sorry, I don't have a tool for that" — even though the operation is technically possible through existing lower-level APIs.
  • Confabulates: Claims the operation is impossible or unsupported by Home Assistant, when it's really just unsupported by the current tool catalog.
  • Thrashes through similar tools: Tries a series of related-but-wrong tools, failing repeatedly, wasting tokens and round-trips before ultimately giving up anyway.

But the LLM could accomplish many of these tasks by writing custom code that calls the existing lower-level tools or HA APIs in the right combination. The gap isn't always capability — it's that the LLM has no way to run custom code to bridge the gaps.

Proposal

Add an ha_execute_code tool that accepts Python code with access to call_tool(name, args). The code runs in Monty — a sandboxed Python interpreter (Rust-based, no filesystem/network access, configurable resource limits).

Primary use case: On-the-fly custom tools

The core value is giving the LLM a general-purpose fallback. When no existing tool matches the user's request, the LLM writes custom Python code on the spot to accomplish it:

User: "Move my kitchen sensors to the 'Climate' category"
LLM:  No dedicated tool exists for editing entity categories.
      → Writes Python that calls the HA REST API via available tools to modify categories
      → Returns the result

User: "Show me the debug logs for the Zigbee integration"
LLM:  No tool exposes debug logs directly.
      → Writes Python to access the debug log endpoint and filter for the relevant integration
      → Returns the result

User: "Check the status of my AdGuard add-on"
LLM:  No tool covers third-party add-on interaction.
      → Writes Python to query the Supervisor add-on API
      → Returns the result

The code isn't persistent — it runs once, returns a result, and is gone. The LLM essentially gains the ability to improvise rather than being limited to a fixed set of pre-built tools. This dramatically expands what ha-mcp can handle without needing to anticipate and build a dedicated tool for every possible request.

Secondary use case: Multi-step chaining

As a bonus, the sandbox also lets LLMs chain multiple tool calls in a single round-trip (e.g., "find all lights on for more than 2 hours and turn them off") without burning tokens on intermediate results flowing back through the context window.

Two implementation options:

Option A: Custom standalone tool (simpler)

Register ha_execute_code as a regular MCP tool using pydantic-monty directly (already a dependency via fastmcp[code-mode]). Can be pinned or placed behind a write proxy. No transform needed.

@mcp.tool(annotations=ToolAnnotations(destructiveHint=True))
async def ha_execute_code(code: str, ctx: Context) -> Any:
    """Run sandboxed Python that can chain tool calls via call_tool(name, args)."""
    async def call_tool(tool_name: str, arguments: dict) -> Any:
        return await ctx.fastmcp.call_tool(tool_name, arguments)

    m = pydantic_monty.Monty(code, external_functions=['call_tool'])
    return await run_monty_async(m,
        external_functions={'call_tool': call_tool},
        limits=ResourceLimits(max_duration_secs=30.0, max_memory=10*1024*1024))

Option B: Separate CodeMode provider (more integrated)

Mount a sub-server with the CodeMode transform that runs concurrently with the main tool system. Provides the full CodeMode discovery flow (search → get_schema → execute) scoped to its own provider.

Sandbox constraints (security features)

  • No class definitions, no match statements
  • No third-party imports (only sys, os, typing, asyncio, re)
  • No filesystem or network access — all I/O through call_tool() only
  • Configurable limits: time, memory, allocations, recursion depth

Configuration

  • Toggle via config: ENABLE_CODE_MODE: bool = False
  • Resource limits configurable via settings

Gotchas & Guardrails

This tool is powerful but comes with a significant risk: LLMs may reach for it when they shouldn't.

  • Lazy fallback: If the LLM can't immediately find the right tool, it may jump straight to ha_execute_code instead of trying harder to discover the correct existing tool. The hand-written code will almost always do a worse job than a purpose-built tool would have.
  • Hallucinated necessity: The LLM may convince itself that a custom tool is needed when a perfectly good existing tool is available — it just didn't search for it properly, or misunderstood what an existing tool can do.
  • Quality gap: Purpose-built tools have proper error handling, validation, and tested behavior. One-off code written by an LLM in a sandbox will be more fragile and less reliable.

Guardrails to consider:

  • Mark the tool as destructiveHint=True so MCP clients that respect annotations will prompt the user before execution.
  • The tool description should explicitly instruct the LLM to only use this as a last resort after confirming no existing tool can accomplish the task.
  • Consider requiring the LLM to provide a justification parameter explaining why no existing tool suffices, which could be shown to the user for approval.
  • Consider rate limiting or requiring user confirmation before each execution.
  • Logging/auditing of all code executed through this tool is essential.

Context

This pairs well with the search-based tool discovery system being explored. The code sandbox gives LLMs an escape hatch for requests that fall outside the tool catalog, while the search system handles discovery of tools that do exist.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions