Skip to content

Latest commit

 

History

History
682 lines (523 loc) · 32.9 KB

File metadata and controls

682 lines (523 loc) · 32.9 KB

Sandbox Environments for AI Agent Workflows

Security model: The sandbox (Docker/E2B) provides the security boundary. Inside the sandbox, Claude runs with full permissions because the container itself is isolated.

Security philosophy:

"It's not if it gets popped, it's when it gets popped. And what is the blast radius?"

Run on dedicated VMs or local Docker sandboxes. Restrict network connectivity, provide only necessary credentials, and ensure no access to private data beyond what the task requires.


Options

Sprites (Fly.io)

  • Persistent Linux environments that survive between executions indefinitely
  • Firecracker VM isolation with up to 8 vCPUs and 8GB RAM
  • Fast checkpoint/restore (~300ms create, <1s restore)
  • Auto-sleep after 30 seconds of inactivity
  • Unique HTTPS URL per Sprite for webhooks, APIs, public access
  • Layer 3 network policies for egress control (whitelist domains or use default LLM-friendly list)
  • CLI, REST API, JavaScript SDK, Go SDK (Python and Elixir coming soon)
  • Pre-installed tools: Claude Code, Codex CLI, Gemini CLI, Python 3.13, Node.js 22.20
  • $30 free credits to start (~500 Sprites worth)

Philosophy: Fly.io argues that "ephemeral sandboxes are obsolete" and that AI agents need persistent computers, not disposable containers. Sprites treat sandboxes as "actual computers" where data, packages, and services persist across executions on ext4 NVMe storage—no need to rebuild environments repeatedly. As they put it: "Claude doesn't want a stateless container."

Unique Features:

  • Stateful persistence: Files, packages, databases survive between runs indefinitely
  • Transactional snapshots: Copy-on-write checkpoints capture entire disk state; stores last 5 checkpoints
  • Idle cost optimization: Auto-sleep when inactive (30s timeout), resume on request (<1s wake)
  • Cold start: Creation in 1-2 seconds, restore under 1 second
  • Claude integration: Pre-installed skills teach Claude how to use Sprites (port forwarding, etc.)
  • Storage billing: Pay only for blocks written, not allocated space; TRIM-friendly
  • No time limits: Unlike ephemeral sandboxes (typically 15-minute limits), Sprites support long-running workloads

Pricing:

Resource Cost Minimum
CPU $0.07/CPU-hour 6.25% utilization/s
Memory $0.04375/GB-hour 250MB per second
Storage $0.00068/GB-hour Actual blocks only
  • Free trial: $30 in credits (~500 Sprites)
  • Plan: $20/month includes monthly credits; overages at published rates
  • Example costs: 4-hour coding session ~$0.46, web app with 30 active hours ~$4/month

Specs:

Spec Value
Isolation Firecracker microVM (hardware-isolated)
Resources Up to 8 vCPUs, 8GB RAM per execution (fixed, not configurable)
Storage 100GB initial ext4 partition on NVMe, auto-scaling capacity
Cold Start <1 second restore, 1-2 seconds creation
Timeout None (persistent); auto-sleeps after 30 seconds inactivity
Active Limit 10 simultaneous active Sprites on base plan; unlimited cold
Network Port 8080 proxied for HTTP services; isolated networks

Limitations:

  • Resource caps (8 vCPU, 8GB RAM, 100GB storage) not configurable yet
  • 30-second idle timeout not configurable
  • Region selection not available (auto-assigned based on geographic location)
  • Maximum 10 active sprites on base plan (unlimited cold/inactive sprites allowed)
  • Best for personal/organizational tools; not designed for million-user scale apps

Links:


E2B

  • Purpose-built for AI agents and LLM workflows
  • Pre-built template anthropic-claude-code ships with Claude Code CLI ready
  • Single-line SDK calls in Python or JavaScript (v1.5.1+)
  • Full filesystem + git for progress.txt, prd.json, and repo operations
  • 24-hour session limits on Pro plan (1 hour on Hobby)
  • Native access to 200+ MCP tools via Docker partnership (GitHub, Notion, Stripe, etc.)
  • Configurable compute: 1-8 vCPU, 512MB-8GB RAM

Philosophy: E2B believes AI agents need transient, immutable workloads with hardware-level kernel isolation. Each sandbox runs in its own Firecracker microVM, providing the same isolation as AWS Lambda. The focus is on developer experience—one SDK call to create a sandbox.

Unique Features:

  • Fastest cold start: ~150-200ms via Firecracker microVMs
  • Pre-built Claude template: Zero-setup Claude Code integration
  • Docker MCP Partnership: Native access to 200+ MCP tools from Docker's catalog
  • Pause/Resume (Beta): Save full VM state including memory (~4s per 1GB to pause, ~1s to resume, state persists up to 30 days)
  • Network controls: allowInternetAccess toggle, network.allowOut/network.denyOut for granular CIDR/domain filtering
  • Domain filtering: Works for HTTP (port 80) and TLS (port 443) via SNI inspection

Pricing:

Plan Monthly Fee Session Limit Notes
Hobby $0 1 hour + $100 one-time credit
Pro $150 24 hours + usage costs
Enterprise Custom Custom SSO, SLA, dedicated support

Usage Rates (per second):

Resource Rate
2 vCPU $0.000028/s
Memory $0.0000045/GiB

Specs:

Spec Value
Isolation Firecracker microVM
Cold Start ~150-200ms
Timeout 1 hour (Hobby), 24 hours (Pro)
Compute 1-8 vCPU, 512MB-8GB RAM (configurable)
Filesystem Full Linux with git support
Pre-installed Node.js, curl, ripgrep, Claude Code

Limitations:

  • No native sandbox clone/fork functionality
  • No bulk file reading API
  • Domain filtering limited to HTTP/HTTPS ports (UDP/QUIC not supported)
  • Self-hosted version lacks built-in network policies
  • Occasional 502 timeout errors on long operations
  • Sandbox "not found" errors near timeout boundaries

Links:


Modal

Modal Sandboxes are the Modal primitive for safely running untrusted code from LLMs, users, or third-party sources. Built on Modal's serverless container fabric with gVisor isolation.

Key Features:

  • Pure Python SDK for defining sandboxes with one line of code (also JS/Go SDKs)
  • Execute arbitrary commands with sandbox.exec() and stream output
  • Autoscale from zero to 10,000+ concurrent sandboxes
  • Dynamic image definition at runtime from model output
  • Built-in tunneling for HTTP/WebSocket connections to sandbox servers
  • Granular egress policies via CIDR allowlists
  • Named sandboxes for persistent reference and pooling
  • Production-proven: Lovable and Quora run millions of code executions daily

Philosophy: Modal treats sandboxes as secure, ephemeral compute units that inherit its serverless fabric. The focus is on Python-first AI/ML workloads with aggressive cost optimization through scale-to-zero, trading cold start latency for resource efficiency.

Unique Features:

  • Sandbox Connect Tokens: Authenticated HTTP/WebSocket access with unspoofable X-Verified-User-Data headers for access control
  • Memory Snapshots: Capture container memory state to reduce cold starts to <3s even with large dependencies like PyTorch
  • Idle Timeout: Auto-terminate sandboxes after configurable inactivity period
  • Filesystem Snapshots: Preserve state across sandbox instances for 24+ hour workflows
  • No pre-provisioning: Sandboxes created on-demand without capacity planning

Pricing (as of late 2025, after 65% price reduction):

Plan Monthly Fee Credits Included Seats Container Limits
Starter $0 $30/month 3 100 containers, 10 GPU
Team $250 $100/month 1,000 containers, 50 GPU
Enterprise Custom Volume discounts Custom limits, HIPAA, SSO, etc.

Compute Rates (per second):

Resource Rate Notes
Sandbox/Notebook CPU $0.00003942/core Per physical core (= 2 vCPU)
Standard Function CPU $0.0000131/core Per physical core
Memory $0.00000222/GiB Pay for actual usage
GPU (A10G) $0.000306/s ~$1.10/hr
GPU (A100 40GB) $0.000583/s ~$2.10/hr
GPU (H100) $0.001097/s ~$3.95/hr

Special Credits: Startups up to $25k, Academics up to $10k free compute

Specs:

Spec Value
Isolation gVisor (Google's container runtime)
Cold Start ~1s container boot, 2-5s typical with imports
With Snapshots <3s even with large dependencies
Default Timeout 5 minutes
Max Timeout 24 hours (use snapshots for longer)
Idle Timeout Configurable auto-termination
Filesystem Ephemeral (use Volumes for persistence)
Network Default Secure-by-default, no incoming connections
Egress Control block_network=True or cidr_allowlist
Concurrent Scaling 10,000+ sandboxes

Volumes (Persistent Storage):

  • High-performance distributed filesystem (up to 2.5 GB/s bandwidth)
  • Volumes v2 (beta): No file count limit, 1 TiB max file size, HIPAA-compliant deletion
  • Explicit commit() required to persist changes
  • Last-write-wins for concurrent modifications to same file
  • Best for model weights, checkpoints, and datasets

Limitations:

  • Cold start penalties when containers spin down (2-5s typical)
  • No on-premises deployment option
  • Sandboxes cannot access other Modal workspace resources by default
  • Single-language focus (Python-optimized, less suited for multi-language untrusted code)
  • Volumes require explicit reload to see changes from other containers
  • Less suited for persistent, long-lived environments vs microVM solutions

Modal vs E2B for AI Agents:

Aspect Modal E2B
Isolation gVisor containers Firecracker microVMs
Cold Start 2-5s typical, <3s with snapshot ~150ms
Session Duration Up to 24h (stateless) Up to 24h (Pro), persistent
Self-Hosting No (managed only) Experimental
Multi-Language Python-focused Python, JS, Ruby, C++
Network Control Granular egress policies Allow/deny lists
Best For Python ML/AI, batch workloads Multi-language agent sandboxes

Links:


Cloudflare Sandboxes

  • Open Beta (announced June 2025), still experimental
  • Edge-native (330+ global locations)
  • Pay for active CPU only (not provisioned resources)
  • Best if already in Cloudflare ecosystem
  • R2 bucket mounting via FUSE enables data persistence (added November 2025)
  • Git operations support (added August 2025)
  • Rich output: charts, tables, HTML, JSON, images

Philosophy: Cloudflare takes a security-first approach using a "bindings" model where code has zero network access by default and can only access external APIs through explicitly defined bindings. This eliminates entire classes of security vulnerabilities by making capabilities explicitly opt-in.

Unique Features:

  • Edge-native execution: Run sandboxes in 330+ global locations
  • Bindings model: Zero network access by default; explicit opt-in for external APIs
  • R2 FUSE mounting: S3-compatible storage mounting for persistence (R2, S3, GCS, Backblaze B2, MinIO)
  • Preview URLs: Public URLs for exposing services from sandboxes
  • keepAlive: true: Option for indefinite runtime

Pricing (as of November 2025):

Resource Cost Included (Workers Paid)
Base Plan $5/month -
CPU $0.000020/vCPU-s 375 vCPU-minutes
Memory $0.0000025/GiB-s 25 GiB-hours
Disk $0.00000007/GB-s 200 GB-hours
Network Egress $0.025-$0.05/GB Varies by region

Instance Types (added October 2025):

Type vCPU Memory Disk
lite 1/16 256 MiB 2 GB
basic 1/4 1 GiB 4 GB
standard-1 1 3 GiB 5 GB
standard-2 2 6 GiB 10 GB
standard-4 4 12 GiB 20 GB

Specs:

Spec Value
Isolation Container
Cold Start 1-5 seconds
Edge Locations 330+ global
Storage Ephemeral; persistent via R2 FUSE mount
Network Bindings model (zero access by default)
Max Memory 400 GiB concurrent (account limit)
Max CPU 100 vCPU concurrent (account limit)
Max Disk 2 TB concurrent
Image Storage 50 GB per account

Limitations:

  • Cold starts 1-5 seconds (slower than Workers' milliseconds)
  • Binary network controls without bindings
  • Bucket mounting only works with wrangler deploy, not wrangler dev
  • SDK/container version must match
  • Sandbox ID case sensitivity issues with preview URLs
  • Still in open beta; ecosystem maturing

Links:


Comparison Table

Feature Sprites E2B Modal Cloudflare
Setup Easy Very Easy Easy Easy
Free Tier $30 credit $100 credit $30/month $5/mo Workers Paid
Isolation Firecracker microVM Firecracker microVM gVisor container Container
Cold Start <1 second ~150ms 2-5s (or <3s w/snap) 1-5 seconds
Max Timeout None (persistent) 24 hours (Pro) 24 hours Configurable
Claude CLI Pre-installed Prebuilt template Manual Manual
Git Support Yes Yes Yes Yes
Persistent Files Yes (permanent) 24 hours Via Volumes Via R2 FUSE mount
Checkpoints Yes (~300ms) Pause/Resume (Beta) Memory Snapshots No
Network Controls Layer 3 policies Allow/deny lists CIDR allowlists Bindings model
Edge Locations Fly.io regions - - 330+ global
Max Concurrent 10 active (base) Plan-based 10,000+ Plan-based
Self-Hosting Fly.io only Experimental No No
MCP Tools - 200+ (Docker) - -
Best For Long-running agents AI agent loops Python ML workloads Edge apps

Other Options

Daytona

Founded by the creators of Codeanywhere (2009), pivoted in February 2025 from development environments to AI code execution infrastructure. 35,000+ GitHub stars (AGPL-3.0 license).

Key Features:

  • Sub-90ms sandbox creation (container-based, faster than E2B's ~150ms microVM)
  • Python SDK (daytona_sdk on PyPI) and TypeScript SDK
  • Official LangChain integration (langchain-daytona-data-analysis)
  • MCP Server support for Claude/Anthropic integrations
  • OCI/Docker image compatibility
  • Built-in Git and LSP support
  • GPU support for ML workloads (enterprise tier)
  • Unlimited persistence (sandboxes can live forever via object storage archiving)
  • Virtual desktops (Linux, Windows, macOS with programmatic control)

Philosophy: Daytona believes AI will automate the majority of programming tasks. Their agent-agnostic architecture enables parallel sandboxed environments for testing solutions simultaneously without affecting the developer's primary workspace.

Unique Features:

  • Fastest cold start: ~90ms (container-based, faster than E2B's microVM)
  • LangChain integration: Official langchain-daytona-data-analysis package
  • MCP Server: Native Claude/Anthropic integration support
  • Virtual desktops: Linux, Windows, macOS with programmatic control
  • Unlimited persistence: Sandboxes can live forever via object storage archiving

Pricing:

Item Cost
Free Credits $200 (requires credit card)
Startup Program Up to $50k in credits
Small Sandbox ~$0.067/hour (1 vCPU, 1 GiB RAM)
Billing Pay-per-second; stopped/archived minimal cost

Specs:

Spec Default Maximum
vCPU 1 4 (contact for more)
RAM 1 GB 8 GB
Disk 3 GiB 10 GB
Auto-stop 15 min idle Disabled
Cold start ~90ms -
Isolation Docker/OCI Kata/Sysbox optional

Network Egress Tiers:

  • Tier 1 & 2: Restricted network access
  • Tier 3 & 4: Full internet with custom CIDR rules
  • All tiers whitelist essential services (NPM, PyPI, GitHub, Anthropic/OpenAI APIs, etc.)

Limitations:

  • Container isolation by default (not microVM like E2B)
  • Cannot snapshot running sandboxes
  • Long-session stability still maturing
  • Young ecosystem compared to E2B
  • Requires credit card for free credits

Daytona vs E2B:

Aspect Daytona E2B
Isolation Docker containers Firecracker microVM
Cold start ~90ms ~150ms
Free credits $200 (CC required) $100 (no CC)
Max session Unlimited 24 hours (Pro)
GitHub stars 35k+ 10k+
Network controls Tier-based Allow/deny lists

Links:


Google Cloud Run

Google Cloud's serverless container platform with strong security isolation, designed for production workloads at scale.

Key Features:

  • Two-layer sandbox isolation (hardware + kernel)
  • Automatic scaling (including scale-to-zero)
  • Pay-per-second billing (100ms granularity)
  • NVIDIA L4 GPU support for AI inference (24 GB VRAM)
  • Direct VPC egress with firewall controls
  • Cloud Storage and NFS volume mounts for persistence
  • Request timeout up to 60 minutes (services), 7 days (jobs)
  • Up to 1000 concurrent requests per instance
  • Built-in HTTPS, IAM, Secret Manager integration
  • Source-based deployment (no Dockerfile required)

Philosophy: Google's approach treats Cloud Run as the "easy button" for serverless containers. Unlike dedicated AI sandbox providers, Cloud Run is a general-purpose platform that happens to work well for AI agents. The security model provides defense-in-depth through gVisor (1st gen) or Linux microVMs (2nd gen), with seccomp filtering in both. For AI-specific workloads, Google offers Agent Engine (fully managed) and GKE Agent Sandbox (Kubernetes-native) as alternatives.

Unique Features:

  • Dual execution environments: 1st gen (gVisor-based, smaller attack surface) or 2nd gen (Linux microVM, more compatibility)
  • GPU scale-to-zero: L4 GPUs spin down when idle, eliminating GPU idle costs
  • Startup CPU Boost: Temporarily increases CPU during cold start (up to 50% faster startups for Java)
  • VPC Flow Logs: Full visibility into network traffic for compliance
  • Network tags: Granular firewall rules via VPC network tags on revisions
  • Volume mounts: Cloud Storage FUSE or NFS (Cloud Filestore) for persistent data

Pricing (Tier 1 regions, e.g., us-central1):

Resource On-Demand Free Tier (Monthly)
CPU $0.000024/vCPU-second 180,000 vCPU-seconds (~50 hrs)
Memory $0.0000025/GiB-second 360,000 GiB-seconds (~100 hrs @ 1GB)
Requests $0.40/million 2 million
GPU (L4) ~$0.67/hour None
  • Always-on billing is ~30% cheaper than on-demand
  • Tier 2 regions (Asia, South America) are ~40% more expensive
  • GPU requires minimum 4 vCPU + 16 GiB memory
  • New customers: $300 free credits for 90 days

Specs:

Spec Value
Isolation Two-layer: hardware (x86 virtualization) + kernel (gVisor/microVM)
Cold Start 2-5 seconds typical; sub-second with Startup CPU Boost + min instances
Max Timeout 60 minutes (services), 168 hours/7 days (jobs)
Max Memory 32 GiB
Max CPU 8 vCPU (or 4 vCPU with GPU)
Max Concurrency 1000 requests/instance (default 80)
Max Instances 1000 (configurable)
GPU NVIDIA L4 (24 GB VRAM), 1 per instance, <5s startup
Storage Ephemeral; use Cloud Storage or NFS mounts for persistence

Network/Egress Controls:

  • Direct VPC egress without Serverless VPC Access connector
  • Network tags on service revisions for firewall rules
  • VPC Service Controls for data exfiltration prevention
  • Organization policies to enforce VPC-only egress
  • Cloud NAT supported for outbound IP control
  • VPC Flow Logs for traffic visibility

Limitations:

  • No persistent local disk (must use Cloud Storage or NFS volume mounts)
  • Cold start latency higher than E2B/Sprites (2-5s vs <1s) without pre-warming
  • Setup complexity: Requires GCP project, billing, IAM configuration
  • VPC complexity: Network egress controls require VPC setup
  • Job connection breaks: Jobs >1 hour may experience connection breaks during maintenance
  • GPU regions limited: L4 GPUs only available in select regions
  • No pre-built AI agent template (unlike E2B)
  • Memory/session management must be built manually

Google Cloud AI Agent Options Comparison:

Criteria Cloud Run Agent Engine GKE Agent Sandbox
Setup Complexity Medium Low High
Infrastructure Control Medium Low High
Memory/Session Mgmt Manual Built-in Manual
Isolation gVisor/microVM Built-in sandbox gVisor + Kata
Cold Start 2-5s (sub-second w/pre-warm) Sub-second Sub-second (warm pools)
Best For Flexible serverless Fastest to prod Enterprise scale

SDK/API Options:

  • Python: pip install google-cloud-run
  • Node.js: npm install @google-cloud/run
  • Go, Java, .NET, Ruby, PHP, Rust client libraries available
  • REST API and gcloud CLI
  • Terraform provider for IaC

Links:


Replit

Full development environment with built-in LLM agent (Agent 3).

Key Features:

  • Agent 3 can run autonomously for up to 200 minutes without supervision
  • Self-testing loop: executes code, identifies errors, fixes, and reruns until tests pass
  • Proprietary testing system claimed to be 3x faster and 10x more cost-effective than Computer Use models
  • Can build other agents and automations from natural language descriptions
  • Built on 10+ years of infrastructure investment (custom file system, VM orchestration)

Philosophy: Replit positions as an "agent-first" platform focused on eliminating "accidental complexity" (per CEO Amjad Masad). Target audience: anyone who wants to build software, not just engineers. The goal is to make Agent the primary interface for software creation.

Unique Features:

  • Agent 3 autonomy: Up to 200 minutes of autonomous execution
  • Self-testing loop: Automatic error detection and fixing
  • Agent building: Can create other agents from natural language
  • Full IDE integration: Complete development environment, not just sandbox
  • MCP support: Integration guide available

Pricing:

Plan Monthly Agent Access Compute Storage
Starter Free Limited (daily cap) 1 vCPU, 2 GiB RAM 2 GiB
Core $25 Full Agent 3 4 vCPU, 8 GiB RAM 50 GiB
Teams $40/user Full Agent 3 + RBAC 4 vCPU, 8 GiB RAM 50 GiB
Enterprise Custom Full + SSO/SAML Custom Custom
  • Core plan includes $25/month in AI credits
  • Teams plan includes $40/month in credits + 50 viewer seats
  • Annual billing: ~20% discount

Specs:

Spec Starter Core/Teams
vCPU 1 4
RAM 2 GiB 8 GiB
Storage 2 GiB 50 GiB
Agent Time Daily limits Up to 200 min

Limitations:

  • No public API for programmatic Agent access — designed exclusively for in-browser interactive use, not for CI/CD pipelines or external autonomous agent orchestration
  • Agent frequently gets stuck in loops on simple tasks
  • Over-autonomy risk (can override user intent)
  • External API authentication problems reported
  • Unpredictable credit consumption ($100-300/month reported overages)
  • Over 60% of developers report agent stalls/errors regularly (per surveys)
  • Notable July 2025 incident where Agent deleted a production database

Links:


Local Docker Options

Docker Official Sandboxes

Quick Start:

docker sandbox run claude                  # Basic
docker sandbox run -w ~/my-project claude  # Custom workspace
docker sandbox run claude "your task"      # With prompt
docker sandbox run claude -c               # Continue last session

Key Details:

  • Credentials stored in persistent volume docker-claude-sandbox-data
  • --dangerously-skip-permissions enabled by default
  • Base image includes: Node.js, Python 3, Go, Git, Docker CLI, GitHub CLI, ripgrep, jq
  • Container persists in background; re-running reuses same container
  • Non-root user with sudo access

Links: https://docs.docker.com/ai/sandboxes/claude-code/


Comparison: E2B vs Docker Local

Aspect E2B (Cloud) Docker Local
Setup SDK call docker sandbox run
Isolation Firecracker microVM Container
Cost ~$0.05/hr Free (your hardware)
Max Duration 24 hours Unlimited
Network Full internet Full internet
State Persistence Session-based Volume-based
Multi-tenant Safe Yes No (local only)
Best For Production, CI/CD Local dev, prototyping

Recommendation for This Project

For Production/Multi-tenant: Use E2B

  1. Pre-built Claude Code template = zero setup friction
  2. 24-hour sessions handle long-running autonomous agents
  3. Full filesystem for progress.txt, prd.json, git repos
  4. Proven in production (Lovable, Quora use it)
  5. True isolation (Firecracker microVM)
  6. 200+ MCP tools via Docker partnership

For Long-Running Persistent Agents: Use Sprites

  1. No session time limits (persistent environments)
  2. Transactional snapshots for version control of entire OS
  3. Auto-sleep when idle reduces costs
  4. Pre-installed Claude Code and AI CLI tools
  5. Best for agents that need to maintain state across days/weeks

For Local Development: Use Docker Sandboxes

  1. Quick prototyping: docker sandbox run claude
  2. With git automation: claude-sandbox (TextCortex)
  3. Minimal setup: Uses persistent credentials volume
  4. Free - runs on your own hardware
  5. Unlimited session duration