A web-based calculator for estimating GPU memory requirements and maximum concurrent requests for self-hosted LLM inference.
🔗 Live Demo: https://selfhostllm.org
This tool helps you calculate how many concurrent requests your GPU setup can handle when running large language models (LLMs) locally. It takes into account:
- GPU VRAM capacity
- Model size and parameters
- Context window length
- Quantization methods
- KV cache overhead
Max Concurrent Requests = Available Memory / KV Cache per Request
-
Total VRAM Available
Total VRAM = Number of GPUs × VRAM per GPU
-
Model Memory (Adjusted for Quantization)
Adjusted Model Memory = Base Model Memory × Quantization Factor
The model weights are loaded once and stay in memory.
-
KV Cache per Request
KV Cache = (Context Length × Adjusted Model Memory × KV Overhead) / 1000
This memory is needed for each active request's attention cache.
-
Available Memory for Inference
Available = Total VRAM - System Overhead - Model Memory
This is what's left for KV caches after loading the model.
-
Maximum Concurrent Requests
Max Requests = Available Memory / KV Cache per Request
The calculator automatically detects and handles MoE models differently:
- Total Parameters: The full model size (e.g., Mixtral 8x7B = 56B total parameters)
- Active Parameters: Only a subset of experts are used per token (e.g., ~14B active)
- Memory Calculation: Uses active memory instead of total memory for accurate estimates
- Why this matters: You only need RAM for active experts, not the entire model
Example: Mixtral 8x7B shows "~94GB total, ~16GB active" - calculations use 16GB
- Worst-case scenario: All requests use the full context window
- In reality, most requests use much less context, so you may handle more concurrent requests
- KV cache grows linearly with actual tokens used, not maximum context
- System overhead is estimated at ~2GB for frameworks and OS
- Different attention mechanisms (MHA, MQA, GQA) affect memory usage
- Framework overhead and memory fragmentation can impact real-world performance
- Dynamic batching and memory management can improve real-world throughput
- MoE models: Memory requirements can vary based on routing algorithms and expert utilization patterns
- Multiple input methods: Select from common models, specify parameters, or input memory directly
- Extensive model database: Includes latest 2025 models from all major providers
- MoE model support: Automatic detection and accurate memory calculations for MoE architectures
- Quantization options: FP16, INT8, INT4, MXFP4, and Extreme quantization
- Context presets: From 1K to 1M tokens
- Performance estimation: GPU-specific speed and throughput predictions
- URL sharing: Share your configuration with others
- Mobile responsive: Works on all devices
- Mac version: Specialized calculator for Apple Silicon Macs
The calculator includes memory profiles for 100+ models from all major providers:
- Moonshot AI Kimi: K2 Base/Instruct (1T params, 32B active)
- Alibaba Qwen 3: 270M to 235B including MoE variants
- DeepSeek V3/R1: Latest reasoning models with distilled versions
- Zhipu AI GLM: ChatGLM to GLM-4.5 including MoE models
- Google Gemma 3: 270M to 27B with multimodal capabilities
- Mistral Codestral: Latest code-specialized models
- Meta Llama: 1B to 405B parameters
- Alibaba Qwen: 2B to 110B including QwenCoder variants
- DeepSeek: 7B to 671B including reasoning models
- Mistral: 7B to 123B (Mistral Large)
- Mixtral MoE: 8x7B and 8x22B with accurate active memory calculations
- Google Gemma: 2B to 27B models
- Microsoft Phi: 3.8B to 14B optimized models
- Command R: 35B and 104B
- GPT-OSS: 20B and 120B open-source variants
Feedback, bug reports, and contributions are welcome! Please feel free to:
- Open an issue for bugs or feature requests
- Submit pull requests with improvements
- Share your experience and suggestions
We'd love to hear from you! If you have:
- Suggestions for improving the calculation methodology
- Additional models to include
- Real-world performance data to share
- Feature requests or bug reports
Please open an issue or submit a pull request.
MIT License
Copyright © 2025 Eran Sandler
- Twitter/X: @erans
- BlueSky: esandler.bsky.social
- LinkedIn: /in/erans
- GitHub: @erans
This calculator is based on community knowledge and research about LLM memory requirements. The estimates are approximate and actual usage may vary based on specific implementations and optimizations.