-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Description
Description
Enable AI agents to autonomously execute terminal/shell commands, similar to tools like Claude Code and Codex CLI. This would provide AI agents with an "ultimate" tool with near-limitless options.
Note: This issue is about giving agents access to the host machine's terminal/shell. It is not about giving agents access to Terminal widgets of Theia which are driven by the user.
Architecture Considerations
Tool Levels
Terminal access can be separated into different levels of power:
- Level 1 - Read-only tools: Specialized commands for read-only operations, e.g.,
grep,find,cat, which are directly executed without any shell - Level 2 - Write-enabled tools: Specialized commands that can modify local data or communicate with external services, e.g.,
git,npm, which are directly executed without any shell - Level 3 - Generic exec: A generic exec command, i.e.,
exec <command> <args...>, which spawns the respective command without any shell - Level 4 - Shell execution: Full shell execution, e.g.,
bash "npm run build | tee log.txt", with full shell support including pipes, variable expansion, etc.
Existing Theia Tools (Level 1 Equivalent)
Theia AI already provides several read-only tools implemented in @theia/ai-ide, for example: getFileContent, searchInWorkspace and findFilesByPattern.
These tools are implemented against Theia's services (FileService, SearchInWorkspaceService), not terminal commands. This makes them much more portable than terminal-based equivalents, as they work in browser-only mode, remote container contexts, etc. and additionally respect workspace boundaries.
Therefore, we do not need special terminal-based implementations for basic file reading and searching in the short-term.
However, terminal commands like grep and find can be used in a more flexible way (e.g., complex regex patterns, specific flags). These capabilities become available automatically when Level 4 (shell execution) is implemented, although specialized support might still be beneficial in the long term.
Security Concept
There are multiple layers of security that can be applied for terminal access, balancing between strictness and user convenience.
Permissions
The permission system controls which commands can be executed and under what conditions. Permissions can be configured at different granularities.
Disapprove by Default
By default, every terminal interaction must be approved by the user, showing the exact command and all arguments to be executed.
Conceptually, this will allow all malicious terminal interactions to be prevented, at the cost of convenience.
Level 1 Read-Only Tools
Level 1 tools wrap specific commands (like grep, find, cat) for which there is special support and knowledge within Theia AI. These are limited to read-only operations that do not modify local data or communicate with external services.
Level 1 "read-only" tools should be configurable to be always allowed, as this likely matches user intent, especially if access is limited to files in the current workspace.
Level 2 Write-Enabled Tools
Level 2 "write-enabled" tools, like git and npm, can:
- Query data (e.g.,
git log,npm list) - Modify local data (e.g.,
git commit,npm install) - Communicate with external services (e.g.,
git push,npm publish)
Such a tool could be separated into multiple tools to split between "read-only" and "write-enabled", or they can work with separate whitelists.
Tools like Claude Code allow defining patterns like git log *, allowing all git log invocations without allowing git push.
These patterns/whitelists could be restricted to Level 1 and Level 2 tools to avoid Level 4 bypasses like this one, e.g., the pattern echo * allowing echo 'test' && rm -rf /.
Level 3 Tools
Level 3 tools execute arbitrary commands for which there is no special support in Theia AI, i.e., we don't know their behavior, whether they can modify local data, or communicate with external services.
These tools should either be handled in the same way as Level 4 tools, or users/adopters can configure "safe" usage of them, allowing these configured tools to be part of the "read-only" permission tier.
Level 4 Tools
Level 4 tools give full access to a shell, e.g., bash, which therefore exposes a large security risk.
This risk can be partially mitigated but, besides disapproving by default, never fully avoided.
Mitigation techniques include:
- Sandboxing shell invocation
- Analysis of the command via patterns or tree-sitter to detect chaining issues
- Analysis of the command via LLMs
Workspace Access
It is good practice, even if Level 1 read-only tools are configured to be allowed, to separate between read access of the current workspace (expected by the user) and read access of the remaining system.
Depending on the available security features, this can be securely achieved via sandboxing and containerization, or fall back to argument analysis.
Isolation
Sandboxing
Linux and macOS systems come with convenient sandboxing tools that could be leveraged by Theia AI, e.g., Bubblewrap on Linux and Seatbelt on macOS.
These tools allow restricting filesystem and network access for commands of all levels, significantly reducing the attack surface of malicious commands.
Windows limitation: Native sandboxing tools are not available on Windows. On Windows, the implementation will fall back to the permission system only, meaning users must rely on approval of commands rather than sandboxed execution.
Containerization
By isolating the terminal execution (and possibly the overall Theia backend) into an isolated container, the attack surface on the host can be greatly reduced, and issues like deleting the whole filesystem can be avoided.
Technical Considerations
Terminal Infrastructure
It must be checked whether we want to reuse Theia's terminal infrastructure and/or implement an own integration with the Node backend.
Shell Support
Initial implementation should focus on bash as the primary shell for Level 4 execution due to its ubiquity and well-documented behavior.
Generalizing to multiple shells could be a follow-up enhancement.
Long-Running Commands
Commands can run for extended periods (e.g., e2e tests taking minutes) or indefinitely (e.g., watch commands).
Aspects to consider:
- Configurable per-command timeout with sensible defaults (e.g., 2 minutes for generic commands, less/more time for known short/long-running commands)
- Allow commands to run in background with output streaming
- Needs some UI specialization
- Needs a concept on how the LLM can check the output
- Provide mechanism for user or agent to cancel running commands (without canceling the chat request/session)
Output Handling
Commands can produce large amounts of output that would consume excessive context.
The LLM can be given options to limit the output size.
Mitigation strategies to consider:
- Give LLM the option to configure the amount of output size and set some reasonable default limits
- In the same direction, we could show first N and last M lines for large outputs
- For Level 4, the agent can use shell features like
| head -n 100andgrepto self-limit - Optionally use another LLM to summarize large outputs
Sandboxing Integration
Sandboxing should be:
- Optional but recommended for all tool levels (Level 1-4)
- Configurable
- Gracefully degraded when sandboxing utilities are unavailable (fall back to permission-only security)
Other Considerations
- Persistence of command history
- Integration in remote development scenarios
Implementation Plan
MVP - Shell Command Execution Tool
Goal: Provide full shell access with essential security measures.
1.1 Core Infrastructure
- Define base interfaces for terminal tools (e.g.
TerminalTool,TerminalToolResult) - Evaluate Terminal reuse or implement shell execution service
- Add output capture and truncation handling (configurable max characters, head/tail options)
- Implement timeout handling with configurable defaults
- Handle "endless" commands
1.2 Security Foundation
- Basic permission system: all bash commands require approval by default. Reuse or enhance current approval UI for tools
- Implement workspace path detection in commands. Warn users for access outside of workspace bounds.
- Add logging for all executed commands
1.3 Sandboxing Integration (Linux)
- Integrate with Bubblewrap for Linux
- Configure sandbox to restrict filesystem access to workspace by default
- Add option to enable/disable network access in sandbox
- Gracefully fall back to non-sandboxed execution when Bubblewrap is unavailable (with warning)
1.4 Agent Integration
- Register
bash/shelltool - Integrate with Coder Agent via new prompt variant (or create dedicated agent)
- Specialize tool results rendering in chat UI
Enhanced Security & Permissions
Goal: Add permission patterns and command analysis for safer autonomous operation.
Could be part of the MVP or delivered as a followup
2.1 Permission Patterns
- Implement pattern-based whitelisting (e.g.,
git log *,npm test) - Add per-command approval mechanism ("always allow
make") - Add session-scoped permissions
- Create permission management UI in settings
2.2 Command Analysis
- Integrate tree-sitter-bash for command parsing
- Implement compound command detection (
&&,||,;, pipes) - Detect dangerous patterns (
rm -rf,eval, command substitution) - Warn user or prevent completely when pattern whitelist could be bypassed via shell features
2.3 Extended Sandboxing
- Implement Seatbelt integration for macOS
- Make sandbox configuration configurable for users
Potential followup: Level 3 Generic Exec
Goal: Allow execution of arbitrary commands without shell interpretation.
This is specialized support for commands to reduce the risk of malicious interactions while safely allowing users to whitelist them, i.e. no malicious shell-based attacks are possible for them.
- Reuse/Implement
exectool that spawns commands + arguments directly (no shell) - Add command + arguments allow list configuration
This support can be reused to support Level 1 commands out of the box with a known whitelist of commands + argument combinations.
Other potential followups
Enhanced UI
- Command history view
- Security status indicators (sandboxed, approved, etc.)
Advanced Security
- LLM-based command analysis
- Anomaly detection (unusual command patterns)
Remote/Container Integration
- Run commands in (dev-)container on demand or by default