Skip to content

latency prediction tool: estimate inference time, throughput (FPS), and deployment feasibility across cloud GPUs, edge devices, and CPUs

Notifications You must be signed in to change notification settings

tengli-alaska/Flask-GCP-Lab-ML-Inference-Latency-Predictor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

⚑ ML Inference Latency Predictor

A latency prediction tool for ML engineers: estimate inference time, throughput (FPS), and deployment feasibility across cloud GPUs, edge devices, and CPUs - trained on real benchmark data from Lambda Labs, MLPerf, and NVIDIA.

Python Flask Streamlit GCP


🎯 Problem Statement

Deploying machine learning models to production requires understanding how they will perform on target hardware. Engineers face critical questions:

  • "Will my model run fast enough on edge devices?"
  • "Which GPU should I provision for real-time inference?"
  • "How does batch size affect throughput?"
  • "Is INT8 quantization worth the accuracy trade-off for my latency budget?"

The Challenge: Answering these questions traditionally requires expensive hardware access, time-consuming benchmarking, and deep expertise in hardware-software optimization. Teams often over-provision expensive GPUs or discover latency issues late in deployment.

Solution: This application provides instant latency predictions based on real benchmark data, enabling engineers to make informed hardware decisions before deployment. By democratizing access to performance insights, teams can:

  • Select appropriate hardware early in the development cycle
  • Estimate infrastructure costs accurately
  • Identify bottlenecks before they impact production
  • Compare deployment options (cloud vs edge, GPU vs CPU)

🧠 Model & Dataset

Model Architecture

The predictor uses a Random Forest Regressor trained on real benchmark data with the following configuration:

RandomForestRegressor(
    n_estimators=100,
    max_depth=20,
    min_samples_split=3,
    random_state=42
)

Why Random Forest?

  • Handles non-linear relationships between hardware specs and latency
  • Robust to outliers in benchmark data
  • Provides feature importance for interpretability
  • No assumptions about data distribution

Target Transformation

Latency values span multiple orders of magnitude (0.5ms to 5000ms+), so applied log transformation to the target variable:

y_log = np.log1p(latency_ms)      # Training
latency_ms = np.expm1(prediction)  # Inference

This improves prediction accuracy across both fast GPUs and slow edge devices.

Feature Engineering

Feature Description Source
tflops_fp16 Hardware compute capacity (TFLOPS) Hardware specs
memory_gb GPU/device memory (GB) Hardware specs
tdp_watts Thermal design power (W) Hardware specs
hw_year Hardware release year Hardware specs
model_params_m Model parameters (millions) Model specs
model_flops_g Model compute requirement (GFLOPs) Model specs
batch_size Samples per inference User input
precision_bits Numeric precision (32/16/8) User input
hardware_encoded Hardware type (label encoded) Categorical
model_encoded Model architecture (label encoded) Categorical
hw_type_encoded Device class: gpu/edge_gpu/tpu/cpu Categorical
category_encoded Task: classification/detection/nlp Categorical

Dataset

The training data combines two sources:

Source Description Count
Published Benchmarks Real measurements from Lambda Labs, MLPerf, NVIDIA, Jetson reports ~60%
Interpolated Estimated values using hardware scaling laws for untested combinations ~40%

Data stored in src/data/inference_latency.csv

Model Performance

MAE 95.40ms, RMSE 990.79ms, MAPE 18.8%, RΒ² 0.6791

✨ Features

Core Functionality

  • Instant Latency Prediction: Get inference time estimates in milliseconds for any hardware-model combination
  • Throughput Calculation: Automatic FPS (frames per second) computation for real-time applications
  • Latency Categorization: Classifications from "real-time" (VR/AR ready) to "offline" (batch processing only)
  • Hardware Comparison: Side-by-side performance comparison across multiple hardware platforms
  • Batch Predictions: Predict multiple configurations in a single API call

Configuration Options

  • Precision Modes: FP32 (32-bit), FP16 (16-bit), INT8 (8-bit quantized)
  • Batch Sizes: 1, 2, 4, 8, 16, 32 samples per inference
  • 11 Hardware Platforms: From RTX 4090 to Raspberry Pi
  • 6 Model Architectures: Classification, detection, and NLP models

Latency Categories

Category Latency FPS Use Cases
🟒 Real-time < 5ms 200+ VR/AR, robotics, autonomous systems
🟒 Interactive 5-16ms 60-200 Gaming, live video processing
🟑 Smooth 16-33ms 30-60 Video analytics, streaming
🟑 Responsive 33-100ms 10-30 Interactive applications
🟠 Batch 100-1000ms 1-10 Near real-time batch processing
πŸ”΄ Offline > 1000ms < 1 Offline batch processing only

πŸ–₯️ Supported Hardware

Hardware Type Compute Memory TDP Year
NVIDIA RTX 4090 GPU 82.6 TFLOPS 24 GB 450W 2022
NVIDIA RTX 3090 GPU 35.6 TFLOPS 24 GB 350W 2020
NVIDIA A100 GPU 78.0 TFLOPS 40 GB 400W 2020
NVIDIA V100 GPU 28.3 TFLOPS 16 GB 300W 2017
NVIDIA T4 GPU 8.1 TFLOPS 16 GB 70W 2018
Jetson Orin Edge GPU 5.3 TFLOPS 32 GB 60W 2022
Jetson Xavier Edge GPU 1.4 TFLOPS 16 GB 30W 2018
Jetson Nano Edge GPU 0.47 TFLOPS 4 GB 10W 2019
Coral Edge TPU TPU 4.0 TOPS N/A 2W 2019
Raspberry Pi 4 CPU 0.01 TFLOPS 4 GB 5W 2019
Intel Core i9 CPU 0.5 TFLOPS 64 GB 125W 2022

🧩 Supported Models

Model Parameters GFLOPs Task Complexity
MobileNetV2 3.4M 0.3 Classification Lightweight
EfficientNet-B0 5.3M 0.4 Classification Efficient
YOLOv5s 7.2M 16.5 Object Detection Real-time
ResNet-50 25.6M 4.1 Classification Standard
BERT-base 110M 22.0 NLP Transformer
VGG-16 138M 15.5 Classification Heavy

πŸ“ Project Structure

ML-Inference-Latency-Predictor/
β”‚
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ main.py                 # Flask API application
β”‚   β”œβ”€β”€ predict.py              # Prediction logic & model loading
β”‚   β”œβ”€β”€ train.py                # Model training script
β”‚   β”œβ”€β”€ test_api.py             # API test suite
β”‚   └── data/
β”‚       └── inference_latency.csv   # Training dataset
β”‚
β”œβ”€β”€ model/
β”‚   └── model.pkl               # Trained model artifacts
β”‚
β”œβ”€β”€ frontend/
β”‚   β”œβ”€β”€ Dockerfile              # Frontend container
β”‚   β”œβ”€β”€ requirements.txt        # Streamlit dependencies
β”‚   β”œβ”€β”€ streamlit_app.py        # Streamlit application
β”‚   β”œβ”€β”€ config.py               # API URL (gitignored)
β”‚   
β”‚
β”œβ”€β”€ Dockerfile                  # API container
β”œβ”€β”€ requirements.txt            # API dependencies
β”œβ”€β”€ .gitignore
└── README.md

πŸš€ Getting Started

Prerequisites

  • Python 3.11+
  • Google Cloud SDK (gcloud) β€” for deployment
  • GCP account with billing enabled

Local Development

1. Clone the Repository

git clone https://github.com/tengli-alaska/Flask-GCP-Lab-ML-Inference-Latency-Predictor.git
cd Flask-GCP-Lab-ML-Inference-Latency-Predictor

2. Set Up the API

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# (Optional) Retrain the model
python -m src.train

# Run the API
python -m src.main

API will be running at http://localhost:8080

Test the API:

# Health check
curl http://localhost:8080/health

# Get valid options
curl http://localhost:8080/options

# Run test suite
python -m src.test_api

3. Set Up the Frontend

cd frontend

# Create virtual environment
python -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Create config file
cp config.example.py config.py
# Edit config.py: BACKEND_URL = "http://localhost:8080"

# Run Streamlit
streamlit run streamlit_app.py

Frontend will be running at http://localhost:8501


☁️ GCP Deployment

Deploy API

# Build container image
gcloud builds submit --tag gcr.io/[YOUR_PROJECT_ID]/inference-latency-api

# Deploy to Cloud Run
gcloud run deploy inference-latency-api \
  --image gcr.io/[YOUR_PROJECT_ID]/inference-latency-api \
  --platform managed \
  --region us-central1 \
  --port 8080 \
  --allow-unauthenticated

Save the output URL: https://inference-latency-api-xxxxx.us-central1.run.app

Deploy Frontend

cd frontend

# Update config.py with your API URL
# BACKEND_URL = "https://inference-latency-api-xxxxx.us-central1.run.app"

# Build container image
gcloud builds submit --tag gcr.io/[YOUR_PROJECT_ID]/inference-latency-frontend

# Deploy to Cloud Run
gcloud run deploy inference-latency-frontend \
  --image gcr.io/[YOUR_PROJECT_ID]/inference-latency-frontend \
  --platform managed \
  --region us-central1 \
  --port 8501 \
  --allow-unauthenticated

app is now live at: https://inference-latency-frontend-xxxxx.us-central1.run.app

πŸ“Š Data Sources

Source URL Data Used
Lambda Labs lambdalabs.com/gpu-benchmarks GPU inference throughput
MLPerf mlcommons.org Standardized inference benchmarks
TensorRT developer.nvidia.com/tensorrt Optimized inference benchmarks
Jetson Benchmarks developer.nvidia.com Edge device performance

πŸ‘€ Author

Alaska Tengli
MLOps Labs β€” Northeastern University


πŸ“ License

This project is licensed under the MIT License.

About

latency prediction tool: estimate inference time, throughput (FPS), and deployment feasibility across cloud GPUs, edge devices, and CPUs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published