A secure code execution service and LLM evaluation harness that runs user-submitted or AI-generated code in isolated Docker containers. Perfect for online code editors, interview platforms, or educational tools.
- 🐳 Secure Execution: Code runs in isolated Docker containers with network disabled
- ⚡ Multi-Language Support: Python and Golang out of the box (easily extensible)
- 🤖 LLM Integration: Built-in Ollama client for local LLM inference
- 🛡️ Resource Limits: Configurable memory and CPU quotas prevent abuse
- ⏱️ Timeout Control: Enforce execution time limits per request
- ✅ Robust Validation: Input validation for language support and source code
- 📊 Structured Response: JSON responses with stdout, stderr, exit codes, and error details
- 📈 Metrics Tracking: Built-in request and error tracking for observability
- 🛑 Graceful Shutdown: Handles SIGINT/SIGTERM with proper cleanup of in-flight requests and containers
- 🚦 Rate Limiting: In-memory IP-based rate limiting (10 requests/minute by default)
- 🐋 Docker Compose Ready: Complete orchestration with Ollama service
- Go 1.24+ - Install Go
- Docker & Docker Compose - Install Docker
- Docker daemon must be running
Dependencies are automatically downloaded when you run go mod download or go build. Key dependencies include:
github.com/docker/docker- Docker SDK for Gogolang.org/x/time/rate- Rate limiting implementationgithub.com/ollama/ollama/api- Ollama API client for local LLM inference
This service uses Docker containers to isolate user code execution:
- Network Disabled: Containers run with network access disabled to prevent unauthorized network calls
- Resource Limits: Memory and CPU quotas restrict resource usage (default: 128MB, 50k CPU quota)
- Ephemeral Containers: Containers are automatically removed after execution
- Context Timeouts: Execution is enforced with context timeouts to prevent hanging processes
- Graceful Shutdown: Server catches SIGINT/SIGTERM signals and properly cleans up all active containers
- Rate Limiting: In-memory IP-based rate limiting prevents abuse (10 requests/minute with 2 burst)
⚠️ Important: While Docker provides strong isolation, this service should still be run behind additional security layers (authentication, firewall, etc.) in production environments.
# Clone the repository
git clone <repository-url>
cd gexec-sandbox
# Install dependencies
go mod download
# Copy environment file and configure
cp .env.example .env
# Edit .env to set your preferred Ollama modelCreate a .env file in the project root:
# Required: Ollama model to use (set in .env before running)
OLLAMA_MODEL=qwen3:4b
# Optional: Ollama host URL (default: http://localhost:11434)
OLLAMA_HOST=http://localhost:11434# Start all services (Ollama + Evaluator)
docker-compose up -d
# View logs
docker-compose logs -f
# Stop services
docker-compose downThis will:
- Pull and start Ollama service with your configured model
- Build and start the evaluator service
- Connect the evaluator to Ollama automatically
# Start Ollama locally first (if not using Docker)
docker run -d -p 11434:11434 ollama/ollama
docker exec -it <container_id> ollama pull qwen3:4b
# Start the evaluator
go run ./cmd/evaluator
# Or build and run
go build -o evaluator ./cmd/evaluator
./evaluatorThe evaluator will start on http://localhost:8080
# Build the Docker image
docker build -t gexec-sandbox .
# Run the container (requires mounting Docker socket)
docker run -p 8080:8080 \
-v /var/run/docker.sock:/var/run/docker.sock \
-e OLLAMA_HOST=http://host.docker.internal:11434 \
-e OLLAMA_MODEL=qwen3:4b \
gexec-sandboxEndpoint: POST /execute
Request Body:
{
"language": "python",
"source_code": "print('Hello, World!')",
"timeout_ms": 5000
}Response:
{
"stdout": "Hello, World!\n",
"stderr": "",
"exit_code": 0,
"error": ""
}Endpoint: GET /ping
Response:
{
"status": "ok"
}Endpoint: GET /metrics
Response:
{
"total_requests": 42,
"total_errors": 3
}The /execute endpoint is rate limited to 10 requests per minute per IP address (configurable).
Response when rate limited (HTTP 429):
{
"error": "Too many requests"
}You can adjust the rate limit in cmd/server/main.go by modifying the RateLimitMiddleware parameters.
curl -X POST http://localhost:8080/execute \
-H "Content-Type: application/json" \
-d '{
"language": "python",
"source_code": "print(\"Hello from Python!\")\nfor i in range(5):\n print(i)"
}'Response:
{
"stdout": "Hello from Python!\n0\n1\n2\n3\n4\n",
"stderr": "",
"exit_code": 0,
"error": ""
}curl -X POST http://localhost:8080/execute \
-H "Content-Type: application/json" \
-d '{
"language": "go",
"source_code": "package main\n\nimport \"fmt\"\n\nfunc main() {\n fmt.Println(\"Hello from Go!\")\n for i := 0; i < 3; i++ {\n fmt.Println(i)\n }\n}"
}'Response:
{
"stdout": "Hello from Go!\n0\n1\n2\n",
"stderr": "",
"exit_code": 0,
"error": ""
}Unsupported Language (HTTP 400):
curl -X POST http://localhost:8080/execute \
-H "Content-Type: application/json" \
-d '{"language": "ruby", "source_code": "puts \"hi\"}'
# Response: {"error":"unsupported language: ruby"}Empty Source Code (HTTP 400):
curl -X POST http://localhost:8080/execute \
-H "Content-Type: application/json" \
-d '{"language": "python", "source_code": ""}'
# Response: {"error":"source_code cannot be empty"}The evaluator reads configuration from environment variables:
OLLAMA_HOST: Ollama server URL (default:http://localhost:11434)OLLAMA_MODEL: Model name to use (required, e.g.,qwen3:4b,codellama:latest)
Edit internal/config/config.go to customize:
Config{
DefaultTimeoutMS: 60000, // Default timeout in milliseconds
MaxMemoryMB: 256, // Maximum memory per container (MB)
OLLAMAHost: "http://localhost:11434",
OLLAMAModel: "qwen3:4b",
Languages: map[string]string{
"python": "python:3.9-slim",
"py": "python:3.9-slim",
"golang": "golang:1.24-alpine",
"go": "golang:1.24-alpine",
},
}Rate Limiting Configuration
Edit cmd/evaluator/main.go to adjust rate limiting:
// Current: 10 requests per minute with burst of 2
mux.Handle("/execute", middleware.RateLimitMiddleware(
rate.Every(6*time.Second), // 1 request every 6 seconds = 10/min
2, // Burst allowance
)(http.HandlerFunc(executeHandler(cfg))))Graceful Shutdown Configuration
Edit cmd/evaluator/main.go to adjust shutdown timeout:
// Current: 30 second graceful shutdown timeout
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)- Add the Docker image to
Languagesmap inconfig.go - Update
getCommand()function indocker.goto return the correct execution command - Update
getExtension()function indocker.goto return the correct file extension
gexec-sandbox/
├── cmd/
│ └── evaluator/
│ └── main.go # HTTP server, LLM integration, graceful shutdown, and handlers
├── data/
│ └── problems.json # Benchmark problem dataset
├── internal/
│ ├── api/
│ │ └── types.go # Request/response types
│ ├── benchmark/
│ │ ├── harness.go # Benchmark harness for running evaluations
│ │ └── types.go # Benchmark-specific type definitions
│ ├── config/
│ │ └── config.go # Configuration management with env var support
│ ├── llm/
│ │ └── llm.go # Ollama client for LLM inference and model management
│ ├── metrics/
│ │ └── metrics.go # Request and error metrics tracking
│ ├── middleware/
│ │ └── rate_limiter.go # IP-based rate limiting middleware
│ └── sandbox/
│ └── docker.go # Docker container execution logic with cleanup
├── .env.example # Environment variable template
├── .gitignore # Git ignore patterns
├── docker-compose.yml # Multi-service orchestration (Ollama + Evaluator)
├── Dockerfile # Evaluator container definition
├── go.mod # Go module definition
├── go.sum # Go dependencies checksums
└── README.md # This file
# Test health endpoint
curl http://localhost:8080/ping
# Test Python execution
curl -X POST http://localhost:8080/execute \
-H "Content-Type: application/json" \
-d '{"language": "python", "source_code": "print(2+2)"}'
# Test metrics
curl http://localhost:8080/metrics# Build for current platform
go build -o evaluator ./cmd/evaluator
# Build for Linux (for Docker)
GOOS=linux GOARCH=amd64 go build -o evaluator-linux ./cmd/evaluatorThe evaluator gracefully handles shutdown signals (SIGINT/SIGTERM):
# Start the evaluator
go run ./cmd/evaluator
# In another terminal, send a request
curl -X POST http://localhost:8080/execute \
-H "Content-Type: application/json" \
-d '{"language": "python", "source_code":"import time; time.sleep(60)"}'
# Press Ctrl+C in the evaluator terminal
# Evaluator will wait up to 30 seconds for in-flight requests to complete
# All active containers will be cleaned up automatically# Send 10 rapid requests (should all succeed)
for i in {1..10}; do
curl -X POST http://localhost:8080/execute \
-H "Content-Type: application/json" \
-d '{"language": "python", "source_code":"print('$i')"}'
done
# 11th request will be rate limited
curl -X POST http://localhost:8080/execute \
-H "Content-Type: application/json" \
-d '{"language": "python", "source_code":"print(11)"}'
# Returns: HTTP 429 Too Many RequestsThis project is being developed as a complete LLM benchmarking engine. Here's the current status based on the project goals:
-
Local LLM Evaluation Harness
- ✅ Architected Go-based evaluation system orchestrating local inference (Ollama)
- ✅ Docker Compose orchestration for Ollama and evaluator services
- ✅ Environment-based configuration for model selection and host settings
- ✅ Connection handling and availability checking for Ollama service
-
Secure Multi-Language Execution Sandbox
- ✅ Engineered secure sandbox isolating untrusted code in Docker containers
- ✅ Network restrictions on containers (network disabled by default)
- ✅ Memory and CPU resource limits to prevent abuse
- ✅ Support for Python and Golang with easy extensibility
- ✅ Safe execution of AI-generated code output
-
Docker Orchestration
- ✅ Docker Compose setup with Ollama service
- ✅ Automatic model pulling on Ollama startup
- ✅ Service networking and volume management
- ✅ Environment variable configuration for flexibility
-
Core Infrastructure
- ✅ stdin/stdout piping for precise test case evaluation
- ✅ Timeout enforcement and graceful container cleanup
- ✅ Rate limiting and request metrics
- ✅ Structured JSON API responses
- ✅ Graceful shutdown with container cleanup
-
Benchmarking Pipeline
- ✅ Benchmark harness infrastructure (internal/benchmark/)
- ✅ Problem dataset structure (data/problems.json)
- 🚧 Integration of LLM code generation with execution
- 🚧 Dataset management system for problems and test cases
-
Evaluation Metrics
- 🚧 Pass@k metric calculation (k=1, k=5, k=10)
- 🚧 Statistical analysis and reporting
- 🚧 Performance benchmarking across multiple models
-
LeetCode-Style Challenges
- 🚧 Dataset of programming problems with test cases
- 🚧 Problem parser and test case runner
- 🚧 Validation against reference solutions
- 🚧 Support for multiple problem categories and difficulty levels
-
Enhanced Features
- 🚧 Batch evaluation mode for comparing multiple models
- 🚧 Result caching and persistence
- 🚧 Progress tracking and status reporting
- 🚧 Web UI or CLI interface for running benchmarks
- Multi-model comparison support
- Distributed evaluation setup
- Custom test case format support
- Performance profiling and optimization
- Result visualization and dashboards
- Integration with additional LLM backends
MIT License - feel free to use this in your projects!