Skip to content

Commit fca2b2e

Browse files
authored
CPU and GPU: Two answers to the same problem
1 parent 5c6a8c2 commit fca2b2e

File tree

1 file changed

+152
-0
lines changed

1 file changed

+152
-0
lines changed

_posts/2025-11-08-CPU-vs-GPU.md

Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
---
2+
layout: post
3+
title: CPU and GPU: Two answers to the same problem
4+
date: 2025-12-08 21:19:21 +0100
5+
toc: true
6+
description: "If you've ever wondered why we use both CPUs and GPUs,
7+
the answer might surprise you: it's not really about acceleration or computation speed.
8+
In fact, it's all about waiting for data. Specifically, it's about how these chips
9+
handle the agonizing eternity (in processor time) it takes to fetch data from
10+
memory and bring it closer to the computation units. Both CPUs and GPUs
11+
acknowledge this problem, but they propose radically different solutions."
12+
---
13+
14+
If you've ever wondered why we use both CPUs and GPUs, the answer might surprise you: it's not really
15+
about acceleration or computation speed. In fact, it's all about waiting for data. Specifically,
16+
it's about how these chips handle the agonizing eternity (in processor time) it takes to fetch data
17+
from memory and bring it closer to the computation units.
18+
19+
## The Fundamental Problem: Memory is Slow
20+
21+
Here's the uncomfortable truth about modern computers: your processor is absurdly fast,
22+
and your memory is comparatively glacial.
23+
24+
Consider this: a modern CPU can execute an instruction in less than a nanosecond.
25+
But fetching data from RAM? That takes 50-100 nanoseconds, or even more.
26+
This is the tyranny of memory latency, and it's the defining challenge of modern processor design.
27+
28+
But why is memory so slow? The answer lies in physics and economics.
29+
DRAM (your RAM) uses just 1 transistor and 1 capacitor per bit, making it dense and cheap,
30+
but those capacitors need constant refreshing and take time to charge and discharge.
31+
We can build faster memory like SRAM (the memory used in caches) but it requires 6 transistors per bit
32+
and consumes significantly more power. You simply can't fit gigabytes of SRAM on a chip as it would
33+
take a lot of space, reach high temperatures, and have astronomical cost.
34+
Distance matters too: signals traveling from the RAM to CPU chips cover several centimeters on the motherboard,
35+
an huge distance even at the speed of light when you're measuring in nanoseconds.
36+
Memory is slow because fast memory is expensive, power-hungry, and takes up too much space.
37+
38+
Both CPUs and GPUs acknowledge this problem, but they propose radically different solutions.
39+
40+
## The CPU's Strategy: Predict, Prefetch, and Stay Busy
41+
42+
The CPU is like a brilliant but impatient genius that has learned to cope by employing certain strategies.
43+
44+
### Massive Caches
45+
46+
CPUs dedicate enormous amounts of space to cache memory, which is ultra-fast memory built directly into the chip.
47+
A typical modern CPU might have 3 levels of cache, accessible in ~4, ~12 and ~40 cycles respectively.
48+
If the data you need is already in cache, you don't have to wait for that 100-cycles trip to RAM.
49+
50+
### Sophisticated Branch Prediction
51+
52+
CPUs employ complex strategies to predict which code path you'll take next, fetching data ahead of time.
53+
Modern CPUs achieve 95%+ accuracy in branch prediction. When they guess right, the data is already waiting.
54+
When they guess wrong, they've wasted cycles, but that's still better than always waiting.
55+
56+
### Out-of-Order Execution
57+
58+
While waiting for data from memory, a CPU doesn't just sit idle. It looks ahead in your program,
59+
identifies instructions that don't depend on the delayed data, and executes those instead.
60+
61+
### Hyperthreading
62+
63+
CPUs can juggle multiple threads on a single core. When one thread stalls waiting for memory,
64+
the CPU switches to another thread.
65+
66+
**The CPU's philosophy**: We have a small number of very complex cores that are incredibly
67+
good at staying busy even when data is delayed. We'll predict what you need, prefetch it,
68+
cache page addresses, and do other work while waiting.
69+
70+
## The GPU's Strategy: Hide Latency behind Massive Parallelism
71+
72+
The GPU takes a completely different approach to mitigate the slow memory.
73+
74+
### Thousands of Simple Cores
75+
76+
Where a CPU might have 8-16 powerful cores, a modern GPU has tens of thousands of simpler cores.
77+
These cores are much simpler than CPU cores: they can't predict branches, execute out of order,
78+
or do most of the clever tricks CPUs do. But what they lack in sophistication, they make up for in sheer numbers.
79+
80+
### SIMT: The Assembly Line Model
81+
82+
GPUs use an execution model called SIMT (Single Instruction, Multiple Threads).
83+
Think of it as having 32 cashiers at a supermarket, all scanning items at the same time, but for different customers.
84+
85+
In SIMT, groups of 32 threads (called a warp in NVIDIA terminology) execute the exact same instruction simultaneously.
86+
When the instruction says "add these two numbers," all 32 threads add their respective numbers at the same time.
87+
88+
This is fundamentally different from a CPU, where each core tries to run completely different
89+
instructions but as fast as possible. On a GPU, those 32 threads must execute the same instruction,
90+
but each operates on its own data. Why does this matter for memory?
91+
Because when those 32 threads hit a memory access, they all wait together for one big chunk of memory to be loaded.
92+
93+
### Organizing the Chaos: Blocks and Grids
94+
95+
GPUs organize threads hierarchically:
96+
97+
- **Threads** are grouped into **warps** (32 threads that execute the same instruction)
98+
- **Warps** are grouped into **blocks** (typically 128-1024 threads)
99+
- **Blocks** are grouped into a **grid** (the entire problem you're solving)
100+
101+
When you launch a GPU program, for example to render a 1920x1080 image, you might say
102+
"I need to process over 2 million pixels. Organize them into blocks of 256 threads each".
103+
The GPU creates ~8,000 blocks and distributes them across its streaming multiprocessors
104+
(the physical units that contain those GPU cores).
105+
106+
Blocks can share fast on-chip "shared memory" and synchronize with each other, but they're otherwise independent.
107+
This independence is crucial—it means the GPU can schedule blocks in any order, on any available hardware, maximizing utilization.
108+
109+
### Oversubscription: The Key to Hiding Latency
110+
111+
Here's the magic: a modern GPU might have 50,000+ threads in flight simultaneously across all its streaming multiprocessors.
112+
When one warp hits a memory access and has to wait 100+ cycles, the GPU doesn't try to keep those threads busy.
113+
It just switches to another warp; instantly, for free.
114+
115+
This context switch is essentially free because the GPU hardware is designed for it.
116+
Unlike a CPU, which has to save and restore state when switching threads, the GPU keeps dozens of warps' worth of state
117+
(registers, program counters) permanently resident in fast on-chip memory.
118+
119+
The math is simple: if you have 50,000 threads and 16,000 cores, and even if 70% of your threads are waiting for memory,
120+
you still have 15,000 threads ready to execute. That's enough to keep nearly all your cores busy.
121+
122+
### Minimal Caching, Maximum Throughput
123+
124+
GPUs have some cache, but it's proportionally tiny compared to CPUs.
125+
A CPU might dedicate 60% of its die to cache; a GPU might dedicate 10%.
126+
Instead, that space goes to more cores, more registers to hold more thread contexts, and more execution units.
127+
A different spatial, economical and power management strategy compared to CPU.
128+
129+
**The GPU's philosophy**: Throughput Over Latency. Memory is slow and that's okay.
130+
We have so many threads organized in warps and blocks that while thousands are waiting for memory,
131+
thousands more are ready to compute. We'll execute them all in lockstep (SIMT),
132+
switch between them for free, and never let an execution unit sit idle.
133+
134+
## The Tradeoff
135+
136+
This explains why you can't just replace a CPU with a GPU. They solve the same problem (memory is frustratingly slow)
137+
in different ways. Neither approach is "better", they're optimized for different workloads.
138+
139+
CPUs excel at:
140+
141+
- Tasks that require low latency for a single stream of instructions.
142+
- Code with lots of branches and unpredictable control flow (GPUs support if statements, but when threads take different paths,
143+
the warp must execute the different sections one after another instead of simultaneously)
144+
145+
GPUs excel at:
146+
147+
- Problems that can be broken into millions of independent operations
148+
- Regular, predictable memory access patterns
149+
- Throughput-oriented workloads where total work done matters more than individual task completion time
150+
151+
The next time you're writing code, ask yourself: "Am I doing one smart thing, or a million stupid things?";
152+
the answer will tell you whether you need the CPU's clever performance or the GPU's brute-force parallelism.

0 commit comments

Comments
 (0)