⚡ ML Inference Latency Predictor

A latency prediction tool for ML engineers: estimate inference time, throughput (FPS), and deployment feasibility across cloud GPUs, edge devices, and CPUs - trained on real benchmark data from Lambda Labs, MLPerf, and NVIDIA.

🎯 Problem Statement

Deploying machine learning models to production requires understanding how they will perform on target hardware. Engineers face critical questions:

"Will my model run fast enough on edge devices?"
"Which GPU should I provision for real-time inference?"
"How does batch size affect throughput?"
"Is INT8 quantization worth the accuracy trade-off for my latency budget?"

The Challenge: Answering these questions traditionally requires expensive hardware access, time-consuming benchmarking, and deep expertise in hardware-software optimization. Teams often over-provision expensive GPUs or discover latency issues late in deployment.

Solution: This application provides instant latency predictions based on real benchmark data, enabling engineers to make informed hardware decisions before deployment. By democratizing access to performance insights, teams can:

Select appropriate hardware early in the development cycle
Estimate infrastructure costs accurately
Identify bottlenecks before they impact production
Compare deployment options (cloud vs edge, GPU vs CPU)

🧠 Model & Dataset

Model Architecture

The predictor uses a Random Forest Regressor trained on real benchmark data with the following configuration:

RandomForestRegressor(
    n_estimators=100,
    max_depth=20,
    min_samples_split=3,
    random_state=42
)

Why Random Forest?

Handles non-linear relationships between hardware specs and latency
Robust to outliers in benchmark data
Provides feature importance for interpretability
No assumptions about data distribution

Target Transformation

Latency values span multiple orders of magnitude (0.5ms to 5000ms+), so applied log transformation to the target variable:

y_log = np.log1p(latency_ms)      # Training
latency_ms = np.expm1(prediction)  # Inference

This improves prediction accuracy across both fast GPUs and slow edge devices.

Feature Engineering

Feature	Description	Source
`tflops_fp16`	Hardware compute capacity (TFLOPS)	Hardware specs
`memory_gb`	GPU/device memory (GB)	Hardware specs
`tdp_watts`	Thermal design power (W)	Hardware specs
`hw_year`	Hardware release year	Hardware specs
`model_params_m`	Model parameters (millions)	Model specs
`model_flops_g`	Model compute requirement (GFLOPs)	Model specs
`batch_size`	Samples per inference	User input
`precision_bits`	Numeric precision (32/16/8)	User input
`hardware_encoded`	Hardware type (label encoded)	Categorical
`model_encoded`	Model architecture (label encoded)	Categorical
`hw_type_encoded`	Device class: gpu/edge_gpu/tpu/cpu	Categorical
`category_encoded`	Task: classification/detection/nlp	Categorical

Dataset

The training data combines two sources:

Source	Description	Count
Published Benchmarks	Real measurements from Lambda Labs, MLPerf, NVIDIA, Jetson reports	~60%
Interpolated	Estimated values using hardware scaling laws for untested combinations	~40%

Data stored in src/data/inference_latency.csv

Model Performance

MAE 95.40ms, RMSE 990.79ms, MAPE 18.8%, R² 0.6791

✨ Features

Core Functionality

Instant Latency Prediction: Get inference time estimates in milliseconds for any hardware-model combination
Throughput Calculation: Automatic FPS (frames per second) computation for real-time applications
Latency Categorization: Classifications from "real-time" (VR/AR ready) to "offline" (batch processing only)
Hardware Comparison: Side-by-side performance comparison across multiple hardware platforms
Batch Predictions: Predict multiple configurations in a single API call

Configuration Options

Precision Modes: FP32 (32-bit), FP16 (16-bit), INT8 (8-bit quantized)
Batch Sizes: 1, 2, 4, 8, 16, 32 samples per inference
11 Hardware Platforms: From RTX 4090 to Raspberry Pi
6 Model Architectures: Classification, detection, and NLP models

Latency Categories

Category	Latency	FPS	Use Cases
🟢 Real-time	< 5ms	200+	VR/AR, robotics, autonomous systems
🟢 Interactive	5-16ms	60-200	Gaming, live video processing
🟡 Smooth	16-33ms	30-60	Video analytics, streaming
🟡 Responsive	33-100ms	10-30	Interactive applications
🟠 Batch	100-1000ms	1-10	Near real-time batch processing
🔴 Offline	> 1000ms	< 1	Offline batch processing only

🖥️ Supported Hardware

Hardware	Type	Compute	Memory	TDP	Year
NVIDIA RTX 4090	GPU	82.6 TFLOPS	24 GB	450W	2022
NVIDIA RTX 3090	GPU	35.6 TFLOPS	24 GB	350W	2020
NVIDIA A100	GPU	78.0 TFLOPS	40 GB	400W	2020
NVIDIA V100	GPU	28.3 TFLOPS	16 GB	300W	2017
NVIDIA T4	GPU	8.1 TFLOPS	16 GB	70W	2018
Jetson Orin	Edge GPU	5.3 TFLOPS	32 GB	60W	2022
Jetson Xavier	Edge GPU	1.4 TFLOPS	16 GB	30W	2018
Jetson Nano	Edge GPU	0.47 TFLOPS	4 GB	10W	2019
Coral Edge TPU	TPU	4.0 TOPS	N/A	2W	2019
Raspberry Pi 4	CPU	0.01 TFLOPS	4 GB	5W	2019
Intel Core i9	CPU	0.5 TFLOPS	64 GB	125W	2022

🧩 Supported Models

Model	Parameters	GFLOPs	Task	Complexity
MobileNetV2	3.4M	0.3	Classification	Lightweight
EfficientNet-B0	5.3M	0.4	Classification	Efficient
YOLOv5s	7.2M	16.5	Object Detection	Real-time
ResNet-50	25.6M	4.1	Classification	Standard
BERT-base	110M	22.0	NLP	Transformer
VGG-16	138M	15.5	Classification	Heavy

📁 Project Structure

ML-Inference-Latency-Predictor/
│
├── src/
│   ├── main.py                 # Flask API application
│   ├── predict.py              # Prediction logic & model loading
│   ├── train.py                # Model training script
│   ├── test_api.py             # API test suite
│   └── data/
│       └── inference_latency.csv   # Training dataset
│
├── model/
│   └── model.pkl               # Trained model artifacts
│
├── frontend/
│   ├── Dockerfile              # Frontend container
│   ├── requirements.txt        # Streamlit dependencies
│   ├── streamlit_app.py        # Streamlit application
│   ├── config.py               # API URL (gitignored)
│   
│
├── Dockerfile                  # API container
├── requirements.txt            # API dependencies
├── .gitignore
└── README.md

🚀 Getting Started

Prerequisites

Python 3.11+
Google Cloud SDK (gcloud) — for deployment
GCP account with billing enabled

Local Development

1. Clone the Repository

git clone https://github.com/tengli-alaska/Flask-GCP-Lab-ML-Inference-Latency-Predictor.git
cd Flask-GCP-Lab-ML-Inference-Latency-Predictor

2. Set Up the API

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# (Optional) Retrain the model
python -m src.train

# Run the API
python -m src.main

API will be running at http://localhost:8080

Test the API:

# Health check
curl http://localhost:8080/health

# Get valid options
curl http://localhost:8080/options

# Run test suite
python -m src.test_api

3. Set Up the Frontend

cd frontend

# Create virtual environment
python -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Create config file
cp config.example.py config.py
# Edit config.py: BACKEND_URL = "http://localhost:8080"

# Run Streamlit
streamlit run streamlit_app.py

Frontend will be running at http://localhost:8501

☁️ GCP Deployment

Deploy API

# Build container image
gcloud builds submit --tag gcr.io/[YOUR_PROJECT_ID]/inference-latency-api

# Deploy to Cloud Run
gcloud run deploy inference-latency-api \
  --image gcr.io/[YOUR_PROJECT_ID]/inference-latency-api \
  --platform managed \
  --region us-central1 \
  --port 8080 \
  --allow-unauthenticated

Save the output URL: https://inference-latency-api-xxxxx.us-central1.run.app

Deploy Frontend

cd frontend

# Update config.py with your API URL
# BACKEND_URL = "https://inference-latency-api-xxxxx.us-central1.run.app"

# Build container image
gcloud builds submit --tag gcr.io/[YOUR_PROJECT_ID]/inference-latency-frontend

# Deploy to Cloud Run
gcloud run deploy inference-latency-frontend \
  --image gcr.io/[YOUR_PROJECT_ID]/inference-latency-frontend \
  --platform managed \
  --region us-central1 \
  --port 8501 \
  --allow-unauthenticated

app is now live at: https://inference-latency-frontend-xxxxx.us-central1.run.app

📊 Data Sources

Source	URL	Data Used
Lambda Labs	lambdalabs.com/gpu-benchmarks	GPU inference throughput
MLPerf	mlcommons.org	Standardized inference benchmarks
TensorRT	developer.nvidia.com/tensorrt	Optimized inference benchmarks
Jetson Benchmarks	developer.nvidia.com	Edge device performance

👤 Author

Alaska Tengli
MLOps Labs — Northeastern University

📝 License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
FLASK_GCP_LAB		FLASK_GCP_LAB
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚡ ML Inference Latency Predictor

🎯 Problem Statement

🧠 Model & Dataset

Model Architecture

Target Transformation

Feature Engineering

Dataset

Model Performance

✨ Features

Core Functionality

Configuration Options

Latency Categories

🖥️ Supported Hardware

🧩 Supported Models

📁 Project Structure

🚀 Getting Started

Prerequisites

Local Development

1. Clone the Repository

2. Set Up the API

3. Set Up the Frontend

☁️ GCP Deployment

Deploy API

Deploy Frontend

📊 Data Sources

👤 Author

📝 License

About

Uh oh!

Releases

Packages

Languages

tengli-alaska/Flask-GCP-Lab-ML-Inference-Latency-Predictor

Folders and files

Latest commit

History

Repository files navigation

⚡ ML Inference Latency Predictor

🎯 Problem Statement

🧠 Model & Dataset

Model Architecture

Target Transformation

Feature Engineering

Dataset

Model Performance

✨ Features

Core Functionality

Configuration Options

Latency Categories

🖥️ Supported Hardware

🧩 Supported Models

📁 Project Structure

🚀 Getting Started

Prerequisites

Local Development

1. Clone the Repository

2. Set Up the API

3. Set Up the Frontend

☁️ GCP Deployment

Deploy API

Deploy Frontend

📊 Data Sources

👤 Author

📝 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages