A practical benchmark suite for the NVIDIA DGX Spark
The Spark DGX is a strange and fascinating machine — 128 GB of unified Blackwell memory in a low-power box that doesn’t behave like a 4090 and doesn’t pretend to be an H100. When I first unboxed mine, I realized very quickly that all the usual GPU intuition breaks down: the theoretical numbers don’t map to real workloads, and there wasn’t a clean, honest way to understand what this thing is actually good at.
So ChatGPT and I created this benchmark.
The goal is to provide a practical, real-world performance snapshot across the kinds of workloads people actually run today: GEMM throughput, memory bandwidth, kernel latency, SD1.5, SDXL, SDXL Turbo, LLM tokens/sec, and usable unified memory. The script also includes a reference comparison across Spark, L40S, H200, GH200, 4090, and H100 so you can see where Spark fits in the broader GPU landscape.
This isn’t meant to replicate NVIDIA’s theoretical peak numbers. It’s meant to show what the Spark DGX actually delivers in user-space, using standard PyTorch and diffusers pipelines. It gives owners a consistent baseline, helps set proper expectations, and highlights Spark’s real advantage:
Large-memory, inference-first workloads that don’t fit well on traditional GPUs.
This assumes:
- Your DGX Spark is already set up and updated
- You can SSH into the machine or access it through NVIDIA Sync
- Docker is already working (it is by default on Spark)
This container contains:
- PyTorch 2.9.0a0+50eac811a6 with official Blackwell (sm_121) support
- CUDA 13.0.1
- cuDNN, NCCL, Apex, and NVIDIA optimizations
- A clean, known-good environment for reproducible benchmarks
docker run --gpus all -it nvcr.io/nvidia/pytorch:25.09-py3You should now be inside:
root@<container>:/workspace#
We explicitly pin NumPy < 2.0 because PyTorch extensions in these containers are still built against NumPy 1.x.
pip install --upgrade pip \
diffusers transformers accelerate sentencepiece safetensors huggingface_hub opencv-python "numpy<2"Some models (SDXL Turbo, LLMs, etc.) require HF login.
hf auth loginPaste your token when prompted.
git clone https://github.com/rossingram/Spark-DGX-Benchmark.git
cd Spark-DGX-Benchmark
chmod +x spark_bench.pypython spark_bench.pyYou should now see a full benchmark report including:
- Environment info
- GEMM throughput
- Memory bandwidth
- Kernel latency
- SD1.5 / SDXL / SDXL Turbo performance
- LLM tokens/sec
- Unified memory slewing limits
- Comparison table vs 4090, L40S, H200, GH200, and H100
The DGX Spark’s architecture is optimized for inference, memory, and efficiency, not brute-force GPU throughput. A few high-level notes:
- Huge models that don’t fit on consumer GPUs
- Unified memory workloads where CPU+GPU share 128 GB seamlessly
- Batch-1 inference (e.g., agents, RAG, copilots, local LLMs)
- Long-context or MoE models
- Large diffusion models that would OOM a 4090 instantly
- Running multiple concurrent models without sharding
- High-throughput FP16/BF16 training
- Multi-teraflop tensor-core GEMM workloads
- Anything that assumes H100-class tensor cores
- PCIe bandwidth dependent multi-GPU setups (Spark is single-GPU)
A 128 GB Blackwell inference appliance — more like a GH200 cousin than a gaming/workstation GPU.
It trades raw FLOPs for:
- massive memory,
- low power draw,
- unified architecture,
- and ease of running huge models locally.
This benchmark helps quantify that tradeoff in real numbers.
================================================================================
FINAL SUMMARY — SPARK vs 4090 vs H100 vs L40S vs H200 vs GH200
================================================================================
RAW COMPUTE (TFLOPs, FP16/BF16 — Measured)
Spark: ~11–12 TFLOPs
4090: ~330 TFLOPs
H100: ~1000 TFLOPs
...
If you have improvements, discoveries, or want to contribute additional tests:
PRs welcome. Issues welcome. Benchmark screenshots welcome.