CPU and GPU: Two answers to the same problem

antoinebrl · web-flow · commit fca2b2eb861c · 2025-12-08T22:11:50.000+01:00
diff --git a/_posts/2025-11-08-CPU-vs-GPU.md b/_posts/2025-11-08-CPU-vs-GPU.md
@@ -0,0 +1,152 @@
+---
+layout: post
+title: CPU and GPU: Two answers to the same problem
+date: 2025-12-08 21:19:21 +0100
+toc: true
+description: "If you've ever wondered why we use both CPUs and GPUs,
+the answer might surprise you: it's not really about acceleration or computation speed.
+In fact, it's all about waiting for data. Specifically, it's about how these chips
+handle the agonizing eternity (in processor time) it takes to fetch data from
+memory and bring it closer to the computation units. Both CPUs and GPUs
+acknowledge this problem, but they propose radically different solutions."
+---
+
+If you've ever wondered why we use both CPUs and GPUs, the answer might surprise you: it's not really
+about acceleration or computation speed. In fact, it's all about waiting for data. Specifically,
+it's about how these chips handle the agonizing eternity (in processor time) it takes to fetch data
+from memory and bring it closer to the computation units.
+
+## The Fundamental Problem: Memory is Slow
+
+Here's the uncomfortable truth about modern computers: your processor is absurdly fast,
+and your memory is comparatively glacial.
+
+Consider this: a modern CPU can execute an instruction in less than a nanosecond.
+But fetching data from RAM? That takes 50-100 nanoseconds, or even more.
+This is the tyranny of memory latency, and it's the defining challenge of modern processor design.
+
+But why is memory so slow? The answer lies in physics and economics.
+DRAM (your RAM) uses just 1 transistor and 1 capacitor per bit, making it dense and cheap,
+but those capacitors need constant refreshing and take time to charge and discharge.
+We can build faster memory like SRAM (the memory used in caches) but it requires 6 transistors per bit
+and consumes significantly more power. You simply can't fit gigabytes of SRAM on a chip as it would
+take a lot of space, reach high temperatures, and have astronomical cost.
+Distance matters too: signals traveling from the RAM to CPU chips cover several centimeters on the motherboard,
+an huge distance even at the speed of light when you're measuring in nanoseconds.
+Memory is slow because fast memory is expensive, power-hungry, and takes up too much space.
+
+Both CPUs and GPUs acknowledge this problem, but they propose radically different solutions.
+
+## The CPU's Strategy: Predict, Prefetch, and Stay Busy
+
+The CPU is like a brilliant but impatient genius that has learned to cope by employing certain strategies.
+
+### Massive Caches
+
+CPUs dedicate enormous amounts of space to cache memory, which is ultra-fast memory built directly into the chip.
+A typical modern CPU might have 3 levels of cache, accessible in ~4, ~12 and ~40 cycles respectively.
+If the data you need is already in cache, you don't have to wait for that 100-cycles trip to RAM.
+
+### Sophisticated Branch Prediction
+
+CPUs employ complex strategies to predict which code path you'll take next, fetching data ahead of time.
+Modern CPUs achieve 95%+ accuracy in branch prediction. When they guess right, the data is already waiting.
+When they guess wrong, they've wasted cycles, but that's still better than always waiting.
+
+### Out-of-Order Execution
+
+While waiting for data from memory, a CPU doesn't just sit idle. It looks ahead in your program,
+identifies instructions that don't depend on the delayed data, and executes those instead.
+
+### Hyperthreading
+
+CPUs can juggle multiple threads on a single core. When one thread stalls waiting for memory,
+the CPU switches to another thread.
+
+**The CPU's philosophy**: We have a small number of very complex cores that are incredibly 
+good at staying busy even when data is delayed. We'll predict what you need, prefetch it,
+cache page addresses, and do other work while waiting.
+
+## The GPU's Strategy: Hide Latency behind Massive Parallelism
+
+The GPU takes a completely different approach to mitigate the slow memory.
+
+### Thousands of Simple Cores
+
+Where a CPU might have 8-16 powerful cores, a modern GPU has tens of thousands of simpler cores.
+These cores are much simpler than CPU cores: they can't predict branches, execute out of order,
+or do most of the clever tricks CPUs do. But what they lack in sophistication, they make up for in sheer numbers.
+
+### SIMT: The Assembly Line Model
+
+GPUs use an execution model called SIMT (Single Instruction, Multiple Threads).
+Think of it as having 32 cashiers at a supermarket, all scanning items at the same time, but for different customers.
+
+In SIMT, groups of 32 threads (called a warp in NVIDIA terminology) execute the exact same instruction simultaneously.
+When the instruction says "add these two numbers," all 32 threads add their respective numbers at the same time.
+
+This is fundamentally different from a CPU, where each core tries to run completely different
+instructions but as fast as possible. On a GPU, those 32 threads must execute the same instruction,
+but each operates on its own data. Why does this matter for memory?
+Because when those 32 threads hit a memory access, they all wait together for one big chunk of memory to be loaded.
+
+### Organizing the Chaos: Blocks and Grids
+
+GPUs organize threads hierarchically:
+
+- **Threads** are grouped into **warps** (32 threads that execute the same instruction)
+- **Warps** are grouped into **blocks** (typically 128-1024 threads)
+- **Blocks** are grouped into a **grid** (the entire problem you're solving)
+
+When you launch a GPU program, for example to render a 1920x1080 image, you might say 
+"I need to process over 2 million pixels. Organize them into blocks of 256 threads each".
+The GPU creates ~8,000 blocks and distributes them across its streaming multiprocessors
+(the physical units that contain those GPU cores).
+
+Blocks can share fast on-chip "shared memory" and synchronize with each other, but they're otherwise independent.
+This independence is crucial—it means the GPU can schedule blocks in any order, on any available hardware, maximizing utilization.
+
+### Oversubscription: The Key to Hiding Latency
+
+Here's the magic: a modern GPU might have 50,000+ threads in flight simultaneously across all its streaming multiprocessors.
+When one warp hits a memory access and has to wait 100+ cycles, the GPU doesn't try to keep those threads busy.
+It just switches to another warp; instantly, for free.
+
+This context switch is essentially free because the GPU hardware is designed for it.
+Unlike a CPU, which has to save and restore state when switching threads, the GPU keeps dozens of warps' worth of state
+(registers, program counters) permanently resident in fast on-chip memory.
+
+The math is simple: if you have 50,000 threads and 16,000 cores, and even if 70% of your threads are waiting for memory,
+you still have 15,000 threads ready to execute. That's enough to keep nearly all your cores busy.
+
+### Minimal Caching, Maximum Throughput
+
+GPUs have some cache, but it's proportionally tiny compared to CPUs.
+A CPU might dedicate 60% of its die to cache; a GPU might dedicate 10%.
+Instead, that space goes to more cores, more registers to hold more thread contexts, and more execution units.
+A different spatial, economical and power management strategy compared to CPU.
+
+**The GPU's philosophy**: Throughput Over Latency. Memory is slow and that's okay.
+We have so many threads organized in warps and blocks that while thousands are waiting for memory,
+thousands more are ready to compute. We'll execute them all in lockstep (SIMT),
+switch between them for free, and never let an execution unit sit idle.
+
+## The Tradeoff
+
+This explains why you can't just replace a CPU with a GPU. They solve the same problem (memory is frustratingly slow)
+in different ways. Neither approach is "better", they're optimized for different workloads.
+
+CPUs excel at:
+
+- Tasks that require low latency for a single stream of instructions.
+- Code with lots of branches and unpredictable control flow (GPUs support if statements, but when threads take different paths,
+the warp must execute the different sections one after another instead of simultaneously)
+
+GPUs excel at:
+
+- Problems that can be broken into millions of independent operations
+- Regular, predictable memory access patterns
+- Throughput-oriented workloads where total work done matters more than individual task completion time
+
+The next time you're writing code, ask yourself: "Am I doing one smart thing, or a million stupid things?";
+the answer will tell you whether you need the CPU's clever performance or the GPU's brute-force parallelism.