|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: CPU and GPU: Two answers to the same problem |
| 4 | +date: 2025-12-08 21:19:21 +0100 |
| 5 | +toc: true |
| 6 | +description: "If you've ever wondered why we use both CPUs and GPUs, |
| 7 | +the answer might surprise you: it's not really about acceleration or computation speed. |
| 8 | +In fact, it's all about waiting for data. Specifically, it's about how these chips |
| 9 | +handle the agonizing eternity (in processor time) it takes to fetch data from |
| 10 | +memory and bring it closer to the computation units. Both CPUs and GPUs |
| 11 | +acknowledge this problem, but they propose radically different solutions." |
| 12 | +--- |
| 13 | + |
| 14 | +If you've ever wondered why we use both CPUs and GPUs, the answer might surprise you: it's not really |
| 15 | +about acceleration or computation speed. In fact, it's all about waiting for data. Specifically, |
| 16 | +it's about how these chips handle the agonizing eternity (in processor time) it takes to fetch data |
| 17 | +from memory and bring it closer to the computation units. |
| 18 | + |
| 19 | +## The Fundamental Problem: Memory is Slow |
| 20 | + |
| 21 | +Here's the uncomfortable truth about modern computers: your processor is absurdly fast, |
| 22 | +and your memory is comparatively glacial. |
| 23 | + |
| 24 | +Consider this: a modern CPU can execute an instruction in less than a nanosecond. |
| 25 | +But fetching data from RAM? That takes 50-100 nanoseconds, or even more. |
| 26 | +This is the tyranny of memory latency, and it's the defining challenge of modern processor design. |
| 27 | + |
| 28 | +But why is memory so slow? The answer lies in physics and economics. |
| 29 | +DRAM (your RAM) uses just 1 transistor and 1 capacitor per bit, making it dense and cheap, |
| 30 | +but those capacitors need constant refreshing and take time to charge and discharge. |
| 31 | +We can build faster memory like SRAM (the memory used in caches) but it requires 6 transistors per bit |
| 32 | +and consumes significantly more power. You simply can't fit gigabytes of SRAM on a chip as it would |
| 33 | +take a lot of space, reach high temperatures, and have astronomical cost. |
| 34 | +Distance matters too: signals traveling from the RAM to CPU chips cover several centimeters on the motherboard, |
| 35 | +an huge distance even at the speed of light when you're measuring in nanoseconds. |
| 36 | +Memory is slow because fast memory is expensive, power-hungry, and takes up too much space. |
| 37 | + |
| 38 | +Both CPUs and GPUs acknowledge this problem, but they propose radically different solutions. |
| 39 | + |
| 40 | +## The CPU's Strategy: Predict, Prefetch, and Stay Busy |
| 41 | + |
| 42 | +The CPU is like a brilliant but impatient genius that has learned to cope by employing certain strategies. |
| 43 | + |
| 44 | +### Massive Caches |
| 45 | + |
| 46 | +CPUs dedicate enormous amounts of space to cache memory, which is ultra-fast memory built directly into the chip. |
| 47 | +A typical modern CPU might have 3 levels of cache, accessible in ~4, ~12 and ~40 cycles respectively. |
| 48 | +If the data you need is already in cache, you don't have to wait for that 100-cycles trip to RAM. |
| 49 | + |
| 50 | +### Sophisticated Branch Prediction |
| 51 | + |
| 52 | +CPUs employ complex strategies to predict which code path you'll take next, fetching data ahead of time. |
| 53 | +Modern CPUs achieve 95%+ accuracy in branch prediction. When they guess right, the data is already waiting. |
| 54 | +When they guess wrong, they've wasted cycles, but that's still better than always waiting. |
| 55 | + |
| 56 | +### Out-of-Order Execution |
| 57 | + |
| 58 | +While waiting for data from memory, a CPU doesn't just sit idle. It looks ahead in your program, |
| 59 | +identifies instructions that don't depend on the delayed data, and executes those instead. |
| 60 | + |
| 61 | +### Hyperthreading |
| 62 | + |
| 63 | +CPUs can juggle multiple threads on a single core. When one thread stalls waiting for memory, |
| 64 | +the CPU switches to another thread. |
| 65 | + |
| 66 | +**The CPU's philosophy**: We have a small number of very complex cores that are incredibly |
| 67 | +good at staying busy even when data is delayed. We'll predict what you need, prefetch it, |
| 68 | +cache page addresses, and do other work while waiting. |
| 69 | + |
| 70 | +## The GPU's Strategy: Hide Latency behind Massive Parallelism |
| 71 | + |
| 72 | +The GPU takes a completely different approach to mitigate the slow memory. |
| 73 | + |
| 74 | +### Thousands of Simple Cores |
| 75 | + |
| 76 | +Where a CPU might have 8-16 powerful cores, a modern GPU has tens of thousands of simpler cores. |
| 77 | +These cores are much simpler than CPU cores: they can't predict branches, execute out of order, |
| 78 | +or do most of the clever tricks CPUs do. But what they lack in sophistication, they make up for in sheer numbers. |
| 79 | + |
| 80 | +### SIMT: The Assembly Line Model |
| 81 | + |
| 82 | +GPUs use an execution model called SIMT (Single Instruction, Multiple Threads). |
| 83 | +Think of it as having 32 cashiers at a supermarket, all scanning items at the same time, but for different customers. |
| 84 | + |
| 85 | +In SIMT, groups of 32 threads (called a warp in NVIDIA terminology) execute the exact same instruction simultaneously. |
| 86 | +When the instruction says "add these two numbers," all 32 threads add their respective numbers at the same time. |
| 87 | + |
| 88 | +This is fundamentally different from a CPU, where each core tries to run completely different |
| 89 | +instructions but as fast as possible. On a GPU, those 32 threads must execute the same instruction, |
| 90 | +but each operates on its own data. Why does this matter for memory? |
| 91 | +Because when those 32 threads hit a memory access, they all wait together for one big chunk of memory to be loaded. |
| 92 | + |
| 93 | +### Organizing the Chaos: Blocks and Grids |
| 94 | + |
| 95 | +GPUs organize threads hierarchically: |
| 96 | + |
| 97 | +- **Threads** are grouped into **warps** (32 threads that execute the same instruction) |
| 98 | +- **Warps** are grouped into **blocks** (typically 128-1024 threads) |
| 99 | +- **Blocks** are grouped into a **grid** (the entire problem you're solving) |
| 100 | + |
| 101 | +When you launch a GPU program, for example to render a 1920x1080 image, you might say |
| 102 | +"I need to process over 2 million pixels. Organize them into blocks of 256 threads each". |
| 103 | +The GPU creates ~8,000 blocks and distributes them across its streaming multiprocessors |
| 104 | +(the physical units that contain those GPU cores). |
| 105 | + |
| 106 | +Blocks can share fast on-chip "shared memory" and synchronize with each other, but they're otherwise independent. |
| 107 | +This independence is crucial—it means the GPU can schedule blocks in any order, on any available hardware, maximizing utilization. |
| 108 | + |
| 109 | +### Oversubscription: The Key to Hiding Latency |
| 110 | + |
| 111 | +Here's the magic: a modern GPU might have 50,000+ threads in flight simultaneously across all its streaming multiprocessors. |
| 112 | +When one warp hits a memory access and has to wait 100+ cycles, the GPU doesn't try to keep those threads busy. |
| 113 | +It just switches to another warp; instantly, for free. |
| 114 | + |
| 115 | +This context switch is essentially free because the GPU hardware is designed for it. |
| 116 | +Unlike a CPU, which has to save and restore state when switching threads, the GPU keeps dozens of warps' worth of state |
| 117 | +(registers, program counters) permanently resident in fast on-chip memory. |
| 118 | + |
| 119 | +The math is simple: if you have 50,000 threads and 16,000 cores, and even if 70% of your threads are waiting for memory, |
| 120 | +you still have 15,000 threads ready to execute. That's enough to keep nearly all your cores busy. |
| 121 | + |
| 122 | +### Minimal Caching, Maximum Throughput |
| 123 | + |
| 124 | +GPUs have some cache, but it's proportionally tiny compared to CPUs. |
| 125 | +A CPU might dedicate 60% of its die to cache; a GPU might dedicate 10%. |
| 126 | +Instead, that space goes to more cores, more registers to hold more thread contexts, and more execution units. |
| 127 | +A different spatial, economical and power management strategy compared to CPU. |
| 128 | + |
| 129 | +**The GPU's philosophy**: Throughput Over Latency. Memory is slow and that's okay. |
| 130 | +We have so many threads organized in warps and blocks that while thousands are waiting for memory, |
| 131 | +thousands more are ready to compute. We'll execute them all in lockstep (SIMT), |
| 132 | +switch between them for free, and never let an execution unit sit idle. |
| 133 | + |
| 134 | +## The Tradeoff |
| 135 | + |
| 136 | +This explains why you can't just replace a CPU with a GPU. They solve the same problem (memory is frustratingly slow) |
| 137 | +in different ways. Neither approach is "better", they're optimized for different workloads. |
| 138 | + |
| 139 | +CPUs excel at: |
| 140 | + |
| 141 | +- Tasks that require low latency for a single stream of instructions. |
| 142 | +- Code with lots of branches and unpredictable control flow (GPUs support if statements, but when threads take different paths, |
| 143 | +the warp must execute the different sections one after another instead of simultaneously) |
| 144 | + |
| 145 | +GPUs excel at: |
| 146 | + |
| 147 | +- Problems that can be broken into millions of independent operations |
| 148 | +- Regular, predictable memory access patterns |
| 149 | +- Throughput-oriented workloads where total work done matters more than individual task completion time |
| 150 | + |
| 151 | +The next time you're writing code, ask yourself: "Am I doing one smart thing, or a million stupid things?"; |
| 152 | +the answer will tell you whether you need the CPU's clever performance or the GPU's brute-force parallelism. |
0 commit comments