Benchmarking and inference tools for Vision Language Models (VLMs) using ONNX Runtime GenAI.
Scripts provided:
vlm_run.py- Quick inference to test a model and see outputvlm_benchmark.py- Run benchmark for a single image/sizerun_all_benchmarks.py- Run benchmarks across multiple image sizes and save results
| Metric | Description |
|---|---|
| Preprocessing Time | Time spent in multimodal processor (image preprocessing) |
| TTFT | Time To First Token - includes prefill + first token generation |
| TPS (Decode) | Tokens per second during decode phase |
| Prefill Throughput | Tokens per second during prefill (prompt processing) |
| Image Tokens | Number of tokens generated from the image |
conda create -n vlm-bench python=3.12
conda activate vlm-bench
pip install onnxruntime-genai pillow numpy psutilNote: The above setup is for CPU models. To run AMD NPU models, activate the Ryzen AI conda environment instead.
python generate_test_image.pyThis creates test_image.jpg (4032x4032) with varied visual patterns for benchmarking.
python vlm_run.py -m /path/to/model -i image.jpgpython vlm_benchmark.py \
-m /path/to/model \
-i test_image.jpg \
--image_size 1024 1024 \
--max_tokens 50 \
-n 3 \
-w 1 \
-o results.json \
-vpython run_all_benchmarks.py -m /path/to/modelOr run specific sizes with verbose output:
python run_all_benchmarks.py -m /path/to/model --sizes 1024x1024 2048x2048 -vSimple inference script to quickly test a model and see output. No benchmarking, just runs the model once.
usage: vlm_run.py [-h] -m MODEL_PATH -i IMAGE_PATH
[-p PROMPT] [-e {cpu,dml,follow_config}]
[--image_size WIDTH HEIGHT]
[--max_tokens MAX_TOKENS] [--max_length MAX_LENGTH]
[--show_prompt]
Arguments:
-m, --model_path Path to the ONNX model folder
-i, --image_path Path to input image
-p, --prompt Prompt text (default: "Describe this image in detail.")
-e, --execution_provider
Execution provider: cpu, dml, follow_config
--image_size Resize image to this size (width height)
--max_tokens Max tokens to generate (default: 256)
--max_length Max context length (default: 16384)
--show_prompt Print the full prompt before generation
Examples:
# Basic inference
python vlm_run.py -m ./model -i image.jpg
# Custom prompt
python vlm_run.py -m ./model -i image.jpg -p "What's in this image?"
# Resize image before running
python vlm_run.py -m ./model -i image.jpg --image_size 1024 1024
# Generate more tokens
python vlm_run.py -m ./model -i image.jpg --max_tokens 512
# Debug prompt formatting
python vlm_run.py -m ./model -i image.jpg --show_promptBenchmark engine for a single image/size combination. Use this for quick single-size tests.
usage: vlm_benchmark.py [-h] -m MODEL_PATH -i IMAGE_PATH
[-e {cpu,dml,follow_config}]
[-p PROMPT] [--image_size WIDTH HEIGHT]
[--max_tokens MAX_TOKENS] [--max_length MAX_LENGTH]
[-n NUM_ITERATIONS] [-w WARMUP_ITERATIONS]
[-o OUTPUT_JSON] [-v]
Arguments:
-m, --model_path Path to the ONNX model folder
-i, --image_path Path to input image
-e, --execution_provider
Execution provider: cpu, dml, follow_config
-p, --prompt Prompt text for the VLM
--image_size Target image size (width height)
--max_tokens Max tokens to generate (default: 100)
--max_length Max context length (default: 16384)
-n, --num_iterations Benchmark iterations (default: 5)
-w, --warmup_iterations
Warmup iterations (default: 2)
-o, --output_json Export results to JSON
-v, --verbose Enable verbose output (shows generated text)
Examples:
# Quick test
python vlm_benchmark.py -m ./model -i image.jpg --image_size 1024 1024
# Full benchmark with JSON export
python vlm_benchmark.py -m ./model -i image.jpg --image_size 1024 1024 \
-n 5 -w 2 --max_tokens 100 -o results.json -vBatch runner for benchmarking across multiple image sizes. Outputs:
benchmark_<size>.json- per-size resultsall_benchmarks_combined.json- combined JSONsummary.md- markdown table
usage: run_all_benchmarks.py [-h] -m MODEL_PATH [-i IMAGE] [-o OUTPUT_DIR]
[-p PROMPT] [--max_tokens MAX_TOKENS]
[--max_length MAX_LENGTH]
[-n NUM_ITERATIONS] [-w WARMUP_ITERATIONS]
[--sizes SIZES [SIZES ...]] [-v] [--compact]
Arguments:
-m, --model_path Path to the ONNX model folder (required)
-i, --image Source image path (default: ./test_image.jpg)
-o, --output_dir Output directory for results (default: ./benchmark_results)
-p, --prompt Prompt text
--max_tokens Max tokens to generate (default: 30)
--max_length Max context length (default: 16384)
-n, --num_iterations Benchmark iterations (default: 2)
-w, --warmup_iterations Warmup iterations (default: 1)
--sizes Specific sizes to benchmark (e.g., 1024x1024 2048x2048)
-v, --verbose Enable verbose output
--compact Print compact summary (Total Tokens, Image Tokens, TTFT, Decode TPS)
Examples:
# Run all sizes (448x448 to 4032x4032)
python run_all_benchmarks.py -m ./phi-4-mm-onnx
# Custom image and output directory
python run_all_benchmarks.py -m ./model -i my_image.jpg -o ./my_results
# Run specific sizes only
python run_all_benchmarks.py -m ./model --sizes 1024x1024 2048x2048
# Verbose mode (shows generated text)
python run_all_benchmarks.py -m ./model -v --sizes 1024x1024
# Compact output (simplified table)
python run_all_benchmarks.py -m ./model --compactGenerates synthetic test images with visual complexity for benchmarking.
The generated image includes:
- Charts (bar, line, pie)
- Text at multiple sizes
- Geometric shapes and patterns
- Varying complexity regions
# Default: 4032x4032
python generate_test_image.py
# Custom size and output
python generate_test_image.py -s 2048 -o my_test_image.jpg============================================================
VLM BENCHMARK RESULTS
============================================================
Model Type: phi4mm
Image Size: 1024x1024
Number of runs: 3
Token Breakdown:
Total Prompt Tokens: 2635
Text Tokens: 16
Image Tokens: 2619
Preprocessing Time (image processing):
Average: 74.41 ms
Time To First Token (TTFT):
Average: 46948.55 ms
Prefill Throughput:
Average: 56.97 tokens/sec
Token Generation (excluding first token):
Average: 277.03 ms/token
TPS: 3.61 tokens/sec
Peak Memory Usage:
8192.45 MB (8.00 GB)
============================================================
==============================================================================================================
BENCHMARK SUMMARY
==============================================================================================================
Model: ./phi-4-mm-fine-tuned-onnx
Image: ./test_image.jpg
Prompt: "Describe this image in detail."
Text tokens (constant across sizes): 16
Size Img Tok Total Tok Preproc (ms) TTFT (s) Prefill Decode Decode Lat
(tokens/s) (tokens/s) (ms/token)
--------------------------------------------------------------------------------------------------------------
448x448 539 555 9.76 8.94 62.10 4.05 246.94
1024x1024 2619 2635 74.41 46.95 56.97 3.61 277.03
2048x2048 5765 5781 198.19 137.52 42.05 2.29 436.58
==============================================================================================================
============================================================
BENCHMARK SUMMARY
============================================================
Model: ./phi-4-mm-fine-tuned-onnx
Image: ./test_image.jpg
Prompt: "Describe this image in detail."
Text tokens (constant across sizes): 16
Size Total Tok Img Tok TTFT (s) TPS
(tok/s)
------------------------------------------------------------
448x448 555 539 8.95 4.05
1024x1024 2635 2619 47.02 3.61
2048x2048 5781 5765 137.72 2.29
============================================================
Results exported with -o include:
{
"config": {
"model_path": "...",
"model_type": "phi4mm",
"image_size": [1024, 1024],
"prompt": "...",
"max_tokens": 50,
"max_length": 16384
},
"summary": {
"token_breakdown": {
"prompt_tokens": 2635,
"text_tokens": 16,
"image_tokens": 2619
},
"preprocessing_ms": { "avg": 74.41, "p50": 74.95, ... },
"ttft_ms": { "avg": 46948.55, "p50": 47005.69, ... },
"prefill_throughput_tps": { "avg": 56.97, "p50": 57.06 },
"token_generation_ms": { "avg": 277.03, "tps_avg": 3.61, ... },
"peak_memory_bytes": 8589545472,
"peak_memory_mb": 8192.45
},
"raw_measurements": {
"preprocessing_times_ms": [...],
"ttft_times_ms": [...],
"token_gen_times_ms": [...]
}
}