A latency prediction tool for ML engineers: estimate inference time, throughput (FPS), and deployment feasibility across cloud GPUs, edge devices, and CPUs - trained on real benchmark data from Lambda Labs, MLPerf, and NVIDIA.
Deploying machine learning models to production requires understanding how they will perform on target hardware. Engineers face critical questions:
- "Will my model run fast enough on edge devices?"
- "Which GPU should I provision for real-time inference?"
- "How does batch size affect throughput?"
- "Is INT8 quantization worth the accuracy trade-off for my latency budget?"
The Challenge: Answering these questions traditionally requires expensive hardware access, time-consuming benchmarking, and deep expertise in hardware-software optimization. Teams often over-provision expensive GPUs or discover latency issues late in deployment.
Solution: This application provides instant latency predictions based on real benchmark data, enabling engineers to make informed hardware decisions before deployment. By democratizing access to performance insights, teams can:
- Select appropriate hardware early in the development cycle
- Estimate infrastructure costs accurately
- Identify bottlenecks before they impact production
- Compare deployment options (cloud vs edge, GPU vs CPU)
The predictor uses a Random Forest Regressor trained on real benchmark data with the following configuration:
RandomForestRegressor(
n_estimators=100,
max_depth=20,
min_samples_split=3,
random_state=42
)Why Random Forest?
- Handles non-linear relationships between hardware specs and latency
- Robust to outliers in benchmark data
- Provides feature importance for interpretability
- No assumptions about data distribution
Latency values span multiple orders of magnitude (0.5ms to 5000ms+), so applied log transformation to the target variable:
y_log = np.log1p(latency_ms) # Training
latency_ms = np.expm1(prediction) # InferenceThis improves prediction accuracy across both fast GPUs and slow edge devices.
| Feature | Description | Source |
|---|---|---|
tflops_fp16 |
Hardware compute capacity (TFLOPS) | Hardware specs |
memory_gb |
GPU/device memory (GB) | Hardware specs |
tdp_watts |
Thermal design power (W) | Hardware specs |
hw_year |
Hardware release year | Hardware specs |
model_params_m |
Model parameters (millions) | Model specs |
model_flops_g |
Model compute requirement (GFLOPs) | Model specs |
batch_size |
Samples per inference | User input |
precision_bits |
Numeric precision (32/16/8) | User input |
hardware_encoded |
Hardware type (label encoded) | Categorical |
model_encoded |
Model architecture (label encoded) | Categorical |
hw_type_encoded |
Device class: gpu/edge_gpu/tpu/cpu | Categorical |
category_encoded |
Task: classification/detection/nlp | Categorical |
The training data combines two sources:
| Source | Description | Count |
|---|---|---|
| Published Benchmarks | Real measurements from Lambda Labs, MLPerf, NVIDIA, Jetson reports | ~60% |
| Interpolated | Estimated values using hardware scaling laws for untested combinations | ~40% |
Data stored in src/data/inference_latency.csv
MAE 95.40ms, RMSE 990.79ms, MAPE 18.8%, RΒ² 0.6791
- Instant Latency Prediction: Get inference time estimates in milliseconds for any hardware-model combination
- Throughput Calculation: Automatic FPS (frames per second) computation for real-time applications
- Latency Categorization: Classifications from "real-time" (VR/AR ready) to "offline" (batch processing only)
- Hardware Comparison: Side-by-side performance comparison across multiple hardware platforms
- Batch Predictions: Predict multiple configurations in a single API call
- Precision Modes: FP32 (32-bit), FP16 (16-bit), INT8 (8-bit quantized)
- Batch Sizes: 1, 2, 4, 8, 16, 32 samples per inference
- 11 Hardware Platforms: From RTX 4090 to Raspberry Pi
- 6 Model Architectures: Classification, detection, and NLP models
| Category | Latency | FPS | Use Cases |
|---|---|---|---|
| π’ Real-time | < 5ms | 200+ | VR/AR, robotics, autonomous systems |
| π’ Interactive | 5-16ms | 60-200 | Gaming, live video processing |
| π‘ Smooth | 16-33ms | 30-60 | Video analytics, streaming |
| π‘ Responsive | 33-100ms | 10-30 | Interactive applications |
| π Batch | 100-1000ms | 1-10 | Near real-time batch processing |
| π΄ Offline | > 1000ms | < 1 | Offline batch processing only |
| Hardware | Type | Compute | Memory | TDP | Year |
|---|---|---|---|---|---|
| NVIDIA RTX 4090 | GPU | 82.6 TFLOPS | 24 GB | 450W | 2022 |
| NVIDIA RTX 3090 | GPU | 35.6 TFLOPS | 24 GB | 350W | 2020 |
| NVIDIA A100 | GPU | 78.0 TFLOPS | 40 GB | 400W | 2020 |
| NVIDIA V100 | GPU | 28.3 TFLOPS | 16 GB | 300W | 2017 |
| NVIDIA T4 | GPU | 8.1 TFLOPS | 16 GB | 70W | 2018 |
| Jetson Orin | Edge GPU | 5.3 TFLOPS | 32 GB | 60W | 2022 |
| Jetson Xavier | Edge GPU | 1.4 TFLOPS | 16 GB | 30W | 2018 |
| Jetson Nano | Edge GPU | 0.47 TFLOPS | 4 GB | 10W | 2019 |
| Coral Edge TPU | TPU | 4.0 TOPS | N/A | 2W | 2019 |
| Raspberry Pi 4 | CPU | 0.01 TFLOPS | 4 GB | 5W | 2019 |
| Intel Core i9 | CPU | 0.5 TFLOPS | 64 GB | 125W | 2022 |
| Model | Parameters | GFLOPs | Task | Complexity |
|---|---|---|---|---|
| MobileNetV2 | 3.4M | 0.3 | Classification | Lightweight |
| EfficientNet-B0 | 5.3M | 0.4 | Classification | Efficient |
| YOLOv5s | 7.2M | 16.5 | Object Detection | Real-time |
| ResNet-50 | 25.6M | 4.1 | Classification | Standard |
| BERT-base | 110M | 22.0 | NLP | Transformer |
| VGG-16 | 138M | 15.5 | Classification | Heavy |
ML-Inference-Latency-Predictor/
β
βββ src/
β βββ main.py # Flask API application
β βββ predict.py # Prediction logic & model loading
β βββ train.py # Model training script
β βββ test_api.py # API test suite
β βββ data/
β βββ inference_latency.csv # Training dataset
β
βββ model/
β βββ model.pkl # Trained model artifacts
β
βββ frontend/
β βββ Dockerfile # Frontend container
β βββ requirements.txt # Streamlit dependencies
β βββ streamlit_app.py # Streamlit application
β βββ config.py # API URL (gitignored)
β
β
βββ Dockerfile # API container
βββ requirements.txt # API dependencies
βββ .gitignore
βββ README.md
- Python 3.11+
- Google Cloud SDK (
gcloud) β for deployment - GCP account with billing enabled
git clone https://github.com/tengli-alaska/Flask-GCP-Lab-ML-Inference-Latency-Predictor.git
cd Flask-GCP-Lab-ML-Inference-Latency-Predictor# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# (Optional) Retrain the model
python -m src.train
# Run the API
python -m src.mainAPI will be running at http://localhost:8080
Test the API:
# Health check
curl http://localhost:8080/health
# Get valid options
curl http://localhost:8080/options
# Run test suite
python -m src.test_apicd frontend
# Create virtual environment
python -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Create config file
cp config.example.py config.py
# Edit config.py: BACKEND_URL = "http://localhost:8080"
# Run Streamlit
streamlit run streamlit_app.pyFrontend will be running at http://localhost:8501
# Build container image
gcloud builds submit --tag gcr.io/[YOUR_PROJECT_ID]/inference-latency-api
# Deploy to Cloud Run
gcloud run deploy inference-latency-api \
--image gcr.io/[YOUR_PROJECT_ID]/inference-latency-api \
--platform managed \
--region us-central1 \
--port 8080 \
--allow-unauthenticatedSave the output URL: https://inference-latency-api-xxxxx.us-central1.run.app
cd frontend
# Update config.py with your API URL
# BACKEND_URL = "https://inference-latency-api-xxxxx.us-central1.run.app"
# Build container image
gcloud builds submit --tag gcr.io/[YOUR_PROJECT_ID]/inference-latency-frontend
# Deploy to Cloud Run
gcloud run deploy inference-latency-frontend \
--image gcr.io/[YOUR_PROJECT_ID]/inference-latency-frontend \
--platform managed \
--region us-central1 \
--port 8501 \
--allow-unauthenticatedapp is now live at: https://inference-latency-frontend-xxxxx.us-central1.run.app
| Source | URL | Data Used |
|---|---|---|
| Lambda Labs | lambdalabs.com/gpu-benchmarks | GPU inference throughput |
| MLPerf | mlcommons.org | Standardized inference benchmarks |
| TensorRT | developer.nvidia.com/tensorrt | Optimized inference benchmarks |
| Jetson Benchmarks | developer.nvidia.com | Edge device performance |
Alaska Tengli
MLOps Labs β Northeastern University
This project is licensed under the MIT License.