Automated deployment script to run your private, self-hosted LLM inference server on Akamai Cloud GPU instances. Pre-configured with OpenAI's gpt-oss-120b (120B parameter open-source model) β the most intelligent American open-weights model. Get vLLM and Open-WebUI up and running in minutes with a single command.
gpt-oss-120b is OpenAI's flagship open-source model with 116.8B total parameters (5.1B active per token via MoE architecture). It achieves an Intelligence Index score of 58 on the Artificial Analysis benchmark, making it the top tier intelligent open-weights model available.
Hardware requirements: This deployment uses 4x RTX 4000 Ada GPUs (20GB VRAM each, 80GB total) to accommodate the model size (~69GB total, ~17GB per GPU with tensor parallelism) and support the full 128K token context length with FP8 KV cache.
Key advantages:
- State-of-the-art performance: Achieves 90% on MMLU, 90% on MMLU-Pro, 80.9% on GPQA (PhD-level science), and 97.9% on AIME 2025 math benchmarks
- Near o4-mini parity: Matches or exceeds OpenAI o4-mini on competition coding (Codeforces), general problem solving, and tool calling
- Efficient architecture: MoE design with only 5.1B active parameters per token enables high throughput despite large total parameter count
- Production-ready: Released under Apache 2.0 license, instruction-tuned for reliable, high-quality responses out of the box
- Multi-GPU optimized: Tensor parallel across 4x RTX 4000 Ada GPUs for optimal inference performance
Check out these other quickstart repositories:
| Model | Parameters | Description | Repository |
|---|---|---|---|
| GPT-OSS-120B | 120B | OpenAI's flagship open-source model (this repo) | ai-quickstart-gpt-oss-120b |
| GPT-OSS-20B | 20B | Compact open-source GPT model | ai-quickstart-gpt-oss-20b |
| Qwen3-14B-FP8 | 14B | Qwen3 with FP8 quantization | ai-quickstart-qwen3-14b-fp8 |
| NVIDIA Nemotron Nano 9B v2 | 9B | NVIDIA's efficient Nemotron model | ai-quickstart-nvidia-nemotron-nano-9b-v2 |
Just run this single command:
curl -fsSL https://raw.githubusercontent.com/linode/ai-quickstart-gpt-oss-120b/main/deploy.sh | bashThat's it! The script will download required files and guide you through the interactive deployment process.
- Fully Automated Deployment: handles instance creation with real-time progress tracking
- Basic AI Stack: vLLM for LLM inference with pre-loaded model and Open-WebUI for chat interface
- Cross-Platform Support: Works on macOS and Windows (Git Bash/WSL)
- Ubuntu 24.04 LTS with NVIDIA drivers
- Docker & NVIDIA Container Toolkit
- Systemd service for automatic startup on reboot
- Active Linode account with GPU access enabled
- Required: bash, curl, ssh, jq
- Note: jq will be auto-installed if missing
No installation required - just run:
curl -fsSL https://raw.githubusercontent.com/linode/ai-quickstart-gpt-oss-120b/main/deploy.sh | bashDownload the script and run locally:
curl -fsSLO https://raw.githubusercontent.com/linode/ai-quickstart-gpt-oss-120b/main/deploy.sh
bash deploy.shIf you prefer to inspect or customize the scripts:
git clone https://github.com/linode/ai-quickstart-gpt-oss-120b
cd ai-quickstart-gpt-oss-120b
./deploy.shNote
if you like to add more services check out docker compose template file
vi setup/docker-compose.yml
The script will ask you to:
- Choose a region (e.g., us-east, eu-west)
- Select GPU instance type
- Provide instance label
- Select or generate SSH keys
- Confirm deployment
The script automatically:
- Creates GPU instance in your linode account
- Monitors cloud-init installation progress
- Waits for Open-WebUI health check
- Waits for vLLM model loading
Once complete, you'll see:
π Setup Complete!
β
Your AI LLM instance is now running!
π Access URLs:
Open-WebUI: https://<ip-label>.ip.linodeusercontent.com
π Access Credentials:
SSH: ssh -i /path/to/your/key root@<instance-ip>
# Bootstrap script executed by cloud-init (installs drivers, Docker, downloads setup files)
/opt/ai-quickstart-gpt-oss-120b/bootstrap.sh
# Setup script that runs after containers start (waits for services to be ready)
/opt/ai-quickstart-gpt-oss-120b/setup.sh
# Docker compose file called by systemctl at startup
/opt/ai-quickstart-gpt-oss-120b/docker-compose.yml
# Caddy reverse proxy configuration
/opt/ai-quickstart-gpt-oss-120b/Caddyfile
# Systemd service definitions
/etc/systemd/system/ai-quickstart-gpt-oss-120b.service # Main stack service
/etc/systemd/system/ai-quickstart-gpt-oss-120b-setup.service # Setup service (runs once)
To delete a deployed instance:
# Remote execution
curl -fsSL https://raw.githubusercontent.com/linode/ai-quickstart-gpt-oss-120b/main/delete.sh | bash -s -- <instance_id>
# Or download script and run
curl -fsSLO https://raw.githubusercontent.com/linode/ai-quickstart-gpt-oss-120b/main/delete.sh
bash delete.sh <instance_id>The script will show instance details and ask for confirmation before deletion.
ai-quickstart-gpt-oss-120b/
βββ deploy.sh # Main deployment script
βββ delete.sh # Instance deletion script
βββ script/
β βββ quickstart_tools.sh # Shared functions (API, OAuth, utilities)
βββ setup/
β βββ docker-compose.yml # Docker Compose configuration
β βββ Caddyfile # Caddy reverse proxy configuration
β βββ setup.sh # Setup script (waits for services to be ready)
βββ template/
βββ cloud-init.yaml # Cloud-init configuration
βββ bootstrap.sh # Post-boot installation script (installs drivers, Docker)
-
Configure Cloud Firewall (Recommended)
- Create Linode Cloud Firewall
- Restrict access to ports 80/443 by source IP
- Allow SSH (port 22) from trusted IPs only
-
SSH Security
- SSH key authentication required
- Root password provided for emergency console access only
The deployed vLLM instance runs with the following configuration:
| Specification | Value |
|---|---|
| GPU Memory Utilization | 95% |
| Max Context Length | 131,072 tokens |
| KV Cache Type | FP8 |
| KV Cache Size | ~139K tokens |
| Available KV Cache Memory | 1.19 GiB |
| Model Memory Usage | ~69 GiB total (~17 GiB per GPU) |
| Max Concurrent Requests | 2 (full context) |
| Tensor Parallel Size | 4 GPUs |
Benchmarked using vllm bench serve with random dataset:
| Metric | 128 input / 128 output | 512 input / 256 output |
|---|---|---|
| Mean TTFT | 84ms | 232ms |
| P99 TTFT | 106ms | 633ms |
| Output Throughput | 48.8 tok/s | 21.3 tok/s |
| Peak Throughput | 78 tok/s | 78 tok/s |
| Mean ITL | 27ms | 29ms |
# SSH into your instance
ssh -i /path/to/your/key root@<instance-ip>
# Check container status
docker ps -a
# Check Docker containers log
cd /opt/ai-quickstart-gpt-oss-120b && docker compose logs -f
# Check systemd service status
systemctl status ai-quickstart-gpt-oss-120b.service
# View systemd service logs
journalctl -u ai-quickstart-gpt-oss-120b.service -n 100
# Check cloud-init logs
tail -f /var/log/cloud-init-output.log -n 100
# Restart all services
systemctl restart ai-quickstart-gpt-oss-120b.service
# Check NVIDIA GPU status
nvidia-smi
# Check vLLM loaded models
curl http://localhost:8000/v1/models
# Check Open-WebUI health
curl http://localhost:8080/health
# Check vLLM container logs
docker logs vllmSuccessful deployments register anonymous statistics (project name, region, instance type) to help improve the service. To opt out, remove the deployment_complete call from deploy.sh.
Issues and pull requests are welcome! For major changes, please open an issue first to discuss what you would like to change.
This project is licensed under the Apache License 2.0.
