LocalEval: Secure Code Execution Sandbox & LLM Evaluation Harness

A secure code execution service and LLM evaluation harness that runs user-submitted or AI-generated code in isolated Docker containers. Perfect for online code editors, interview platforms, or educational tools.

Features

🐳 Secure Execution: Code runs in isolated Docker containers with network disabled
⚡ Multi-Language Support: Python and Golang out of the box (easily extensible)
🤖 LLM Integration: Built-in Ollama client for local LLM inference
🛡️ Resource Limits: Configurable memory and CPU quotas prevent abuse
⏱️ Timeout Control: Enforce execution time limits per request
✅ Robust Validation: Input validation for language support and source code
📊 Structured Response: JSON responses with stdout, stderr, exit codes, and error details
📈 Metrics Tracking: Built-in request and error tracking for observability
🛑 Graceful Shutdown: Handles SIGINT/SIGTERM with proper cleanup of in-flight requests and containers
🚦 Rate Limiting: In-memory IP-based rate limiting (10 requests/minute by default)
🐋 Docker Compose Ready: Complete orchestration with Ollama service

Prerequisites

Go 1.24+ - Install Go
Docker & Docker Compose - Install Docker
Docker daemon must be running

Dependencies

Dependencies are automatically downloaded when you run go mod download or go build. Key dependencies include:

github.com/docker/docker - Docker SDK for Go
golang.org/x/time/rate - Rate limiting implementation
github.com/ollama/ollama/api - Ollama API client for local LLM inference

Security

This service uses Docker containers to isolate user code execution:

Network Disabled: Containers run with network access disabled to prevent unauthorized network calls
Resource Limits: Memory and CPU quotas restrict resource usage (default: 128MB, 50k CPU quota)
Ephemeral Containers: Containers are automatically removed after execution
Context Timeouts: Execution is enforced with context timeouts to prevent hanging processes
Graceful Shutdown: Server catches SIGINT/SIGTERM signals and properly cleans up all active containers
Rate Limiting: In-memory IP-based rate limiting prevents abuse (10 requests/minute with 2 burst)

⚠️ Important: While Docker provides strong isolation, this service should still be run behind additional security layers (authentication, firewall, etc.) in production environments.

Installation

# Clone the repository
git clone <repository-url>
cd gexec-sandbox

# Install dependencies
go mod download

# Copy environment file and configure
cp .env.example .env
# Edit .env to set your preferred Ollama model

Environment Configuration

Create a .env file in the project root:

# Required: Ollama model to use (set in .env before running)
OLLAMA_MODEL=qwen3:4b

# Optional: Ollama host URL (default: http://localhost:11434)
OLLAMA_HOST=http://localhost:11434

Running the Evaluator

Using Docker Compose (Recommended)

# Start all services (Ollama + Evaluator)
docker-compose up -d

# View logs
docker-compose logs -f

# Stop services
docker-compose down

This will:

Pull and start Ollama service with your configured model
Build and start the evaluator service
Connect the evaluator to Ollama automatically

Local Development

# Start Ollama locally first (if not using Docker)
docker run -d -p 11434:11434 ollama/ollama
docker exec -it <container_id> ollama pull qwen3:4b

# Start the evaluator
go run ./cmd/evaluator

# Or build and run
go build -o evaluator ./cmd/evaluator
./evaluator

The evaluator will start on http://localhost:8080

Using Docker (Standalone)

# Build the Docker image
docker build -t gexec-sandbox .

# Run the container (requires mounting Docker socket)
docker run -p 8080:8080 \
  -v /var/run/docker.sock:/var/run/docker.sock \
  -e OLLAMA_HOST=http://host.docker.internal:11434 \
  -e OLLAMA_MODEL=qwen3:4b \
  gexec-sandbox

API Usage

Execute Code

Endpoint: POST /execute

Request Body:

{
  "language": "python",
  "source_code": "print('Hello, World!')",
  "timeout_ms": 5000
}

Response:

{
  "stdout": "Hello, World!\n",
  "stderr": "",
  "exit_code": 0,
  "error": ""
}

Health Check

Endpoint: GET /ping

Response:

{
  "status": "ok"
}

Metrics

Endpoint: GET /metrics

Response:

{
  "total_requests": 42,
  "total_errors": 3
}

Rate Limiting

The /execute endpoint is rate limited to 10 requests per minute per IP address (configurable).

Response when rate limited (HTTP 429):

{
  "error": "Too many requests"
}

You can adjust the rate limit in cmd/server/main.go by modifying the RateLimitMiddleware parameters.

Example Commands

Python Example

curl -X POST http://localhost:8080/execute \
  -H "Content-Type: application/json" \
  -d '{
    "language": "python",
    "source_code": "print(\"Hello from Python!\")\nfor i in range(5):\n    print(i)"
  }'

Response:

{
  "stdout": "Hello from Python!\n0\n1\n2\n3\n4\n",
  "stderr": "",
  "exit_code": 0,
  "error": ""
}

Golang Example

curl -X POST http://localhost:8080/execute \
  -H "Content-Type: application/json" \
  -d '{
    "language": "go",
    "source_code": "package main\n\nimport \"fmt\"\n\nfunc main() {\n    fmt.Println(\"Hello from Go!\")\n    for i := 0; i < 3; i++ {\n        fmt.Println(i)\n    }\n}"
  }'

Response:

{
  "stdout": "Hello from Go!\n0\n1\n2\n",
  "stderr": "",
  "exit_code": 0,
  "error": ""
}

Error Handling Examples

Unsupported Language (HTTP 400):

curl -X POST http://localhost:8080/execute \
  -H "Content-Type: application/json" \
  -d '{"language": "ruby", "source_code": "puts \"hi\"}'

# Response: {"error":"unsupported language: ruby"}

Empty Source Code (HTTP 400):

curl -X POST http://localhost:8080/execute \
  -H "Content-Type: application/json" \
  -d '{"language": "python", "source_code": ""}'

# Response: {"error":"source_code cannot be empty"}

Configuration

Environment Variables

The evaluator reads configuration from environment variables:

OLLAMA_HOST: Ollama server URL (default: http://localhost:11434)
OLLAMA_MODEL: Model name to use (required, e.g., qwen3:4b, codellama:latest)

Code Configuration

Edit internal/config/config.go to customize:

Config{
    DefaultTimeoutMS: 60000,  // Default timeout in milliseconds
    MaxMemoryMB:      256,   // Maximum memory per container (MB)
    OLLAMAHost:       "http://localhost:11434",
    OLLAMAModel:      "qwen3:4b",
    Languages: map[string]string{
        "python": "python:3.9-slim",
        "py":     "python:3.9-slim",
        "golang": "golang:1.24-alpine",
        "go":     "golang:1.24-alpine",
    },
}

Rate Limiting Configuration

Edit cmd/evaluator/main.go to adjust rate limiting:

// Current: 10 requests per minute with burst of 2
mux.Handle("/execute", middleware.RateLimitMiddleware(
    rate.Every(6*time.Second),  // 1 request every 6 seconds = 10/min
    2,                          // Burst allowance
)(http.HandlerFunc(executeHandler(cfg))))

Graceful Shutdown Configuration

Edit cmd/evaluator/main.go to adjust shutdown timeout:

// Current: 30 second graceful shutdown timeout
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)

Adding New Languages

Add the Docker image to Languages map in config.go
Update getCommand() function in docker.go to return the correct execution command
Update getExtension() function in docker.go to return the correct file extension

Project Structure

gexec-sandbox/
├── cmd/
│   └── evaluator/
│       └── main.go          # HTTP server, LLM integration, graceful shutdown, and handlers
├── data/
│   └── problems.json        # Benchmark problem dataset
├── internal/
│   ├── api/
│   │   └── types.go         # Request/response types
│   ├── benchmark/
│   │   ├── harness.go       # Benchmark harness for running evaluations
│   │   └── types.go         # Benchmark-specific type definitions
│   ├── config/
│   │   └── config.go        # Configuration management with env var support
│   ├── llm/
│   │   └── llm.go           # Ollama client for LLM inference and model management
│   ├── metrics/
│   │   └── metrics.go       # Request and error metrics tracking
│   ├── middleware/
│   │   └── rate_limiter.go  # IP-based rate limiting middleware
│   └── sandbox/
│       └── docker.go        # Docker container execution logic with cleanup
├── .env.example             # Environment variable template
├── .gitignore               # Git ignore patterns
├── docker-compose.yml       # Multi-service orchestration (Ollama + Evaluator)
├── Dockerfile               # Evaluator container definition
├── go.mod                   # Go module definition
├── go.sum                   # Go dependencies checksums
└── README.md                # This file

Development

Testing the Service

# Test health endpoint
curl http://localhost:8080/ping

# Test Python execution
curl -X POST http://localhost:8080/execute \
  -H "Content-Type: application/json" \
  -d '{"language": "python", "source_code": "print(2+2)"}'

# Test metrics
curl http://localhost:8080/metrics

Building

# Build for current platform
go build -o evaluator ./cmd/evaluator

# Build for Linux (for Docker)
GOOS=linux GOARCH=amd64 go build -o evaluator-linux ./cmd/evaluator

Graceful Shutdown Testing

The evaluator gracefully handles shutdown signals (SIGINT/SIGTERM):

# Start the evaluator
go run ./cmd/evaluator

# In another terminal, send a request
curl -X POST http://localhost:8080/execute \
  -H "Content-Type: application/json" \
  -d '{"language": "python", "source_code":"import time; time.sleep(60)"}'

# Press Ctrl+C in the evaluator terminal
# Evaluator will wait up to 30 seconds for in-flight requests to complete
# All active containers will be cleaned up automatically

Rate Limiting Testing

# Send 10 rapid requests (should all succeed)
for i in {1..10}; do
  curl -X POST http://localhost:8080/execute \
    -H "Content-Type: application/json" \
    -d '{"language": "python", "source_code":"print('$i')"}'
done

# 11th request will be rate limited
curl -X POST http://localhost:8080/execute \
  -H "Content-Type: application/json" \
  -d '{"language": "python", "source_code":"print(11)"}'
# Returns: HTTP 429 Too Many Requests

Project Roadmap

This project is being developed as a complete LLM benchmarking engine. Here's the current status based on the project goals:

✅ Completed Features

Local LLM Evaluation Harness
- ✅ Architected Go-based evaluation system orchestrating local inference (Ollama)
- ✅ Docker Compose orchestration for Ollama and evaluator services
- ✅ Environment-based configuration for model selection and host settings
- ✅ Connection handling and availability checking for Ollama service
Secure Multi-Language Execution Sandbox
- ✅ Engineered secure sandbox isolating untrusted code in Docker containers
- ✅ Network restrictions on containers (network disabled by default)
- ✅ Memory and CPU resource limits to prevent abuse
- ✅ Support for Python and Golang with easy extensibility
- ✅ Safe execution of AI-generated code output
Docker Orchestration
- ✅ Docker Compose setup with Ollama service
- ✅ Automatic model pulling on Ollama startup
- ✅ Service networking and volume management
- ✅ Environment variable configuration for flexibility
Core Infrastructure
- ✅ stdin/stdout piping for precise test case evaluation
- ✅ Timeout enforcement and graceful container cleanup
- ✅ Rate limiting and request metrics
- ✅ Structured JSON API responses
- ✅ Graceful shutdown with container cleanup

🚧 In Progress / Planned Features

Benchmarking Pipeline
- ✅ Benchmark harness infrastructure (internal/benchmark/)
- ✅ Problem dataset structure (data/problems.json)
- 🚧 Integration of LLM code generation with execution
- 🚧 Dataset management system for problems and test cases
Evaluation Metrics
- 🚧 Pass@k metric calculation (k=1, k=5, k=10)
- 🚧 Statistical analysis and reporting
- 🚧 Performance benchmarking across multiple models
LeetCode-Style Challenges
- 🚧 Dataset of programming problems with test cases
- 🚧 Problem parser and test case runner
- 🚧 Validation against reference solutions
- 🚧 Support for multiple problem categories and difficulty levels
Enhanced Features
- 🚧 Batch evaluation mode for comparing multiple models
- 🚧 Result caching and persistence
- 🚧 Progress tracking and status reporting
- 🚧 Web UI or CLI interface for running benchmarks

🎯 Future Enhancements

Multi-model comparison support
Distributed evaluation setup
Custom test case format support
Performance profiling and optimization
Result visualization and dashboards
Integration with additional LLM backends

License

MIT License - feel free to use this in your projects!

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
cmd/evaluator		cmd/evaluator
data		data
internal		internal
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
go.mod		go.mod
go.sum		go.sum

Folders and files

Latest commit

History

Repository files navigation

LocalEval: Secure Code Execution Sandbox & LLM Evaluation Harness

Features

Prerequisites

Dependencies

Security

Installation

Environment Configuration

Running the Evaluator

Using Docker Compose (Recommended)

Local Development

Using Docker (Standalone)

API Usage

Execute Code

Health Check

Metrics

Rate Limiting

Example Commands

Python Example

Golang Example

Error Handling Examples

Configuration

Environment Variables

Code Configuration

Adding New Languages

Project Structure

Development

Testing the Service

Building

Graceful Shutdown Testing

Rate Limiting Testing

Project Roadmap

✅ Completed Features

🚧 In Progress / Planned Features

🎯 Future Enhancements

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages