feat: CodeMode sandbox tool for custom one-off tools, and multi-step tool chaining

## Summary

Add an optional sandboxed code execution tool that lets LLMs **create custom one-off tools on the fly** when no existing tool covers the user's request, using [FastMCP 3.1's CodeMode](https://gofastmcp.com/servers/transforms/code-mode) sandbox (`pydantic-monty`).

## Problem

The ha-mcp tool catalog — no matter how large — will never cover every possible Home Assistant operation. Today, when a user asks for something there's no dedicated tool for (e.g., "edit entity categories", "pull debug logs", "interact with a third-party add-on"), the LLM fails in one of several ways:

- **Gives up entirely**: "Sorry, I don't have a tool for that" — even though the operation is technically possible through existing lower-level APIs.
- **Confabulates**: Claims the operation is impossible or unsupported by Home Assistant, when it's really just unsupported by the current tool catalog.
- **Thrashes through similar tools**: Tries a series of related-but-wrong tools, failing repeatedly, wasting tokens and round-trips before ultimately giving up anyway.

But the LLM *could* accomplish many of these tasks by writing custom code that calls the existing lower-level tools or HA APIs in the right combination. The gap isn't always capability — it's that the LLM has no way to run custom code to bridge the gaps.

## Proposal

Add an `ha_execute_code` tool that accepts Python code with access to `call_tool(name, args)`. The code runs in [Monty](https://github.com/pydantic/monty) — a sandboxed Python interpreter (Rust-based, no filesystem/network access, configurable resource limits).

### Primary use case: On-the-fly custom tools

The core value is giving the LLM a **general-purpose fallback**. When no existing tool matches the user's request, the LLM writes custom Python code on the spot to accomplish it:

```
User: "Move my kitchen sensors to the 'Climate' category"
LLM:  No dedicated tool exists for editing entity categories.
      → Writes Python that calls the HA REST API via available tools to modify categories
      → Returns the result

User: "Show me the debug logs for the Zigbee integration"
LLM:  No tool exposes debug logs directly.
      → Writes Python to access the debug log endpoint and filter for the relevant integration
      → Returns the result

User: "Check the status of my AdGuard add-on"
LLM:  No tool covers third-party add-on interaction.
      → Writes Python to query the Supervisor add-on API
      → Returns the result
```

The code isn't persistent — it runs once, returns a result, and is gone. The LLM essentially gains the ability to **improvise** rather than being limited to a fixed set of pre-built tools. This dramatically expands what ha-mcp can handle without needing to anticipate and build a dedicated tool for every possible request.

### Secondary use case: Multi-step chaining

As a bonus, the sandbox also lets LLMs chain multiple tool calls in a single round-trip (e.g., "find all lights on for more than 2 hours and turn them off") without burning tokens on intermediate results flowing back through the context window.

### Two implementation options:

**Option A: Custom standalone tool (simpler)**

Register `ha_execute_code` as a regular MCP tool using `pydantic-monty` directly (already a dependency via `fastmcp[code-mode]`). Can be pinned or placed behind a write proxy. No transform needed.

```python
@mcp.tool(annotations=ToolAnnotations(destructiveHint=True))
async def ha_execute_code(code: str, ctx: Context) -> Any:
    """Run sandboxed Python that can chain tool calls via call_tool(name, args)."""
    async def call_tool(tool_name: str, arguments: dict) -> Any:
        return await ctx.fastmcp.call_tool(tool_name, arguments)

    m = pydantic_monty.Monty(code, external_functions=['call_tool'])
    return await run_monty_async(m,
        external_functions={'call_tool': call_tool},
        limits=ResourceLimits(max_duration_secs=30.0, max_memory=10*1024*1024))
```

**Option B: Separate CodeMode provider (more integrated)**

Mount a sub-server with the `CodeMode` transform that runs concurrently with the main tool system. Provides the full CodeMode discovery flow (search → get_schema → execute) scoped to its own provider.

### Sandbox constraints (security features)
- No class definitions, no match statements
- No third-party imports (only sys, os, typing, asyncio, re)
- No filesystem or network access — all I/O through `call_tool()` only
- Configurable limits: time, memory, allocations, recursion depth

### Configuration
- Toggle via config: `ENABLE_CODE_MODE: bool = False`
- Resource limits configurable via settings

## Gotchas & Guardrails

This tool is powerful but comes with a significant risk: **LLMs may reach for it when they shouldn't.**

- **Lazy fallback**: If the LLM can't immediately find the right tool, it may jump straight to `ha_execute_code` instead of trying harder to discover the correct existing tool. The hand-written code will almost always do a worse job than a purpose-built tool would have.
- **Hallucinated necessity**: The LLM may convince itself that a custom tool is needed when a perfectly good existing tool is available — it just didn't search for it properly, or misunderstood what an existing tool can do.
- **Quality gap**: Purpose-built tools have proper error handling, validation, and tested behavior. One-off code written by an LLM in a sandbox will be more fragile and less reliable.

**Guardrails to consider:**
- Mark the tool as `destructiveHint=True` so MCP clients that respect annotations will prompt the user before execution.
- The tool description should explicitly instruct the LLM to **only use this as a last resort** after confirming no existing tool can accomplish the task.
- Consider requiring the LLM to provide a `justification` parameter explaining why no existing tool suffices, which could be shown to the user for approval.
- Consider rate limiting or requiring user confirmation before each execution.
- Logging/auditing of all code executed through this tool is essential.

## Context

This pairs well with the search-based tool discovery system being explored. The code sandbox gives LLMs an escape hatch for requests that fall outside the tool catalog, while the search system handles discovery of tools that do exist.

## References
- [FastMCP 3.1 CodeMode docs](https://gofastmcp.com/servers/transforms/code-mode)
- [pydantic-monty](https://github.com/pydantic/monty) (the sandbox runtime)
- [FastMCP 3.1 release notes](https://github.com/PrefectHQ/fastmcp/releases/tag/v3.1.0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: CodeMode sandbox tool for custom one-off tools, and multi-step tool chaining #726

Summary

Problem

Proposal

Primary use case: On-the-fly custom tools

Secondary use case: Multi-step chaining

Two implementation options:

Sandbox constraints (security features)

Configuration

Gotchas & Guardrails

Context

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: CodeMode sandbox tool for custom one-off tools, and multi-step tool chaining #726

Description

Summary

Problem

Proposal

Primary use case: On-the-fly custom tools

Secondary use case: Multi-step chaining

Two implementation options:

Sandbox constraints (security features)

Configuration

Gotchas & Guardrails

Context

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions