A Visual Guide to How Memory Really Works

Why "moving data", not "doing math", is the hidden force that decides whether your computer is fast or slow. A journey from a single leaking capacitor all the way to feeding a modern GPU, told so that anyone can feel it.

June 6, 2026

Here's a riddle that has quietly humbled a lot of smart programmers.

You write two tiny programs. They do the exact same math, the same number of additions, on the same data. One finishes in a blink. The other can take ten times longer.

Same math. Same answer. Same chip. So where did the time go?

Two Programs, One Riddle
identical math, identical data, identical chip

                  PROGRAM A
                  FINISHED
                
                  additions done: 0 / 16,777,216
                  0%
                
                  PROGRAM B
                  still running...
                
                  additions done: 0 / 16,777,216
                  0%
                
Same math. Same answer.
?
so where did the time go?

Two programs, the same additions on the same data. One blinks past the finish line; the other crawls. Nothing about the math explains it.

That little mystery hides a secret almost nobody gets taught: a computer isn't really a calculator. We picture it as a machine that does arithmetic, and the faster it does arithmetic, the faster our programs run. Feels obvious. It's also mostly wrong.

Your processor spends most of its life not calculating. It spends it waiting, waiting for a number to make its way in from somewhere far away.

This guide is the story of that waiting. Where it comes from. Why it can't be wished away. And the long, beautiful chain of tricks engineers built to hide it, caches, blocks, prefetchers, thousands of threads, each one a patch for the limits of the last.

We'll start from the most basic fact there is and build it up, one honest question at a time, until you can answer the one that decides whether real code is fast or slow: is my program limited by math, or by memory?

Every number in here is real, and you'll find exactly where each one comes from at the end.

Take your time. This one's meant to be felt, not skimmed.

Why is memory slow at all?
If fast memory exists, why isn't all memory fast?
What actually happens when you ask for one number?
How can a cache a thousand times smaller help almost every time?
When I read one byte, how much really moves?
Where does a block of memory live once it arrives?
What breaks when two cores touch the same memory?
If memory is so slow, how does anything run fast?
So, is my program limited by math, or by memory?
Watch it all collide: why a giant AI types one word at a time
The whole story, in one breath

1. Why is memory slow at all?

Let's start with something that sounds far too simple to be the foundation of anything: a number has to physically travel from where it is kept to where it is used.

And nothing in the universe travels instantly.

This is not a flaw that a clever engineer will someday patch. It's a wall built into reality itself.

Electrical signals move at a fraction of the speed of light, roughly 30 centimeters per nanosecond in a vacuum, and only about 18 centimeters per nanosecond when crawling down a real copper wire etched onto a circuit board.

Read that again, because it's the seed of the whole guide: distance is time.

A memory that sits farther away takes longer to reach. Not because it's badly made, but because the signal has farther to go.

The pioneering computer scientist Grace Hopper understood that people couldn't feel a nanosecond.

So she used to hand them a piece of wire about 30 centimeters long and call it exactly that, "a nanosecond," the distance light covers in one. She wanted the cost of distance to sit in your palm.

And once it clicks, it's a little jarring. A signal just can't cross a big chip, travel out to a memory module, and come back, anywhere near as fast as the processor can do a single piece of arithmetic.

Now layer history on top of physics.

For decades, two things improved at wildly different rates. Processors got faster at a breathtaking pace, while main memory's speed improved only slowly; its capacity grew huge, but the time to fetch any one item barely budged.

The two curves pulled apart year after year until the gap became the defining problem of computer performance.

It even got a name: the memory wall. Back in 1995, two researchers, Wulf and McKee, wrote a whole paper about it, with a wonderfully blunt title: "Hitting the Memory Wall: Implications of the Obvious."

Their point was simple, and honestly hard to argue with. If the processor keeps racing ahead of memory, then sooner or later the only thing that decides your speed is how long you sit around waiting for memory.

Two curves pulling apart: compute raced ahead while memory speed barely moved. The widening space between them is the memory wall.

And they were right.

Today a processor can perform an addition in less than a third of a nanosecond, but reaching all the way out to main memory and back costs roughly 60 to 100 nanoseconds.

In the time it waits for one number from main memory, it could have done hundreds of calculations. It just sits there instead.

Picture it. Imagine a chef who can chop an onion in the blink of an eye, superhuman, dazzling knife skills.

But the onions are kept at a store across town. No matter how impossibly fast she chops, her entire day is governed by driving to the store and back.

Her chopping speed was never the thing holding her up. The trips were. A processor is that chef, and main memory is that store.

So the first move practically writes itself: if a big, far-away memory is slow, then keep a small memory close by and stash the things you're using in it.

Simple.

But it instantly raises a sharper question, and the answer to it shapes the entire architecture of every computer you've ever touched.

If close, fast memory is so obviously wonderful, why on earth don't we just build the whole machine out of it?

2. If fast memory exists, why isn't all memory fast?

The answer comes down to how you store a single bit, a single 1 or 0.

There are really only two ways to do it, and they pull against each other in a way you can't escape.

The fast way uses six tiny transistors, wired into a little loop that holds its own value.

Two of them form a pair that keeps each other in place (one holds a value, the other holds the opposite, and each one stops the other from flipping), and two more act as doors that let you read or write the value when the cell is selected.

Because the loop holds itself, the value stays put with no extra effort, like a light switch that stays exactly where you flicked it.

This is SRAM (static RAM), and it's what your fast caches are built from.

It's gloriously quick, and it's bulky and expensive, because every single bit costs you six whole transistors.

SRAM vs DRAM
SRAM bit
6 transistors

                    keeps
                    "1"
                  
→
←

                    keeps
                    "0"
                  
a loop that props itself up
holds itself forever
fast, but 6x the transistors
DRAM bit
1 transistor + 1 capacitor

                      ↓leaks
                    
wordline
transistor
(door)
capacitor
(bucket)
cheap and dense
but forgets without constant refresh

              
            

Two ways to hold one bit. The six-transistor SRAM latch holds itself forever, the one-capacitor DRAM cell is tiny and cheap but leaks its charge away.

The cheap way uses just one transistor and one tiny capacitor, a tiny bucket that holds a little bit of electric charge.

A charged bucket means 1; an empty one means 0. The transistor is simply a door that connects the bucket to the outside world when you want to read or write it.

This is DRAM (dynamic RAM), and it's what your many gigabytes of main memory are made of.

One bit costs barely more than that single capacitor, so you can pack billions of them onto a chip for very little money.

That tiny leaking bucket is, and I mean this literally, the most important part in your whole computer.

And that leak causes three problems you can't get around, problems that end up shaping everything else:

It forgets. Charge slowly leaks out of the bucket, so every DRAM bit has to be topped up on a strict schedule before it fades away to nothing.

The standard (set by JEDEC, the group that defines memory) is to refresh every cell every 64 milliseconds.

The memory controller goes through the rows doing this over and over, and while a given row is being refreshed, nobody can read it. This "dynamic" forgetting-and-refreshing is exactly why it's called dynamic RAM.

Reading it destroys it. The charge in the bucket is so weak that to read it, the hardware tips it out onto a sensor, and that empties the bucket.

So after every single read, the value must be immediately written back to restore it. Every read is secretly a read-and-rewrite.

Filling it takes real time. A bucket doesn't fill up in an instant; it takes a moment to charge.

That little delay, more than anything else, is the real physical reason DRAM is slow to respond.

SRAM has none of these problems, no leaking, no refresh, no destructive reads, no slow fill.

So why not use it everywhere? Because at six transistors per bit, building gigabytes of SRAM would be wildly expensive and take up a huge amount of space.

The numbers force a compromise, and that compromise quietly defines all of computing: a huge, cheap, slow main memory made of DRAM, patched over by a tiny, expensive, fast cache made of SRAM.

Historically the cache has been around one-thousandth the size of main memory, and that ratio has stayed roughly true for decades.

Picture it. An SRAM bit is a light switch: flick it on and it stays on, instantly readable, never needs attention, but it's costly to install, so you can only afford a few.

A DRAM bit is a bucket with a pinhole leak: dirt cheap, so you can have billions, but a tireless helper must run around constantly refilling them before they drain (that's refresh), and the only way to check a bucket's level is to tip it out and quickly pour the water back (that's the destructive read).

Here's the part worth remembering: "fast memory is small" was never a choice anyone wanted to make.

It's a tax that physics charges us, and there's no refund.

And here's why that matters for everything ahead: it's exactly as true on a graphics card.

A GPU's massive global memory is DRAM, with all the same leaking and refreshing. Its lightning-fast registers and shared memory are SRAM.

When a GPU programmer says "shared memory is fast and global memory is slow," they are, without realizing it, simply restating "SRAM versus DRAM."

So we're stuck with slow DRAM for the bulk of our storage.

So the next thing to figure out is how you actually talk to it. Because DRAM, it turns out, has strong and surprising opinions about the order you ask for things.

3. What actually happens when you ask for one number?

Here's a fact that surprises almost everyone: DRAM isn't slow in the same way all the time.

Where your next request lands can change its cost by a factor of ten or more.

To really get why, deeply enough that it changes how you write code, you have to peek at what physically happens inside the chip.

The bits are arranged in an enormous grid of rows and columns. Crucially, you cannot reach in and pluck a single bit on its own.

To read anything, you must first open an entire row, a step that reads thousands of those weak buckets all at once and copies them into a special holding area called the row buffer.

This opening step is the slow, expensive part; it's where the waiting happens.

Only after a row is open can you quickly read any column you like from the row buffer.

And when you're done and want a different row, you must first close the current one (writing its contents safely back, since reading was destructive) before opening the next.

Opening a whole row is the slow part. Once it's sitting in the row buffer, tapping any column from it is cheap, so neighbors are nearly free, but jumping to a new row costs the full penalty again. Step through the three scenarios above to see each cost.

This one quirk is the whole personality of DRAM.

Opening a row is costly; reading more columns from a row that's already open is nearly free.

So if your next number happens to live in the same row you just opened, you get it almost instantly, a "row hit."

If it lives in a different row, you pay the full close-then-open penalty all over again, a "row conflict."

Going in order, neighbor by neighbor, gives you a lovely run of row hits. Jumping around all over the place gives you a painful stream of row conflicts.

Real memory systems claw some of that speed back by doing lots of things at once. It's worth seeing how they're layered, because these names show up everywhere.

A modern memory system is built in several independent layers, and each layer is a chance to do more at once:

Bank Parallelism
CH A
bank 0

                  row open (slow)
                
                  streaming
                
overlap
CH B
bank 1

                  streaming
                
                  row open
                
time
→
row open (slow, ~50 ns)
              
channel A streaming
              
channel B streaming

The system is layered so requests can overlap: while one bank slowly opens a row, another bank is already streaming data on another channel.

This is why a memory controller works so hard to spread your requests across many banks and channels.

While one bank is slowly opening a row, another can already be streaming data. Idle waiting in one place is hidden behind useful work in another, a theme that will return, much larger, in section 8.

One more bit of jargon, because people get this one wrong constantly.

DDR stands for Double Data Rate, and it means the memory hands over two chunks of data per clock tick instead of one (it uses both the rising and falling edge of the clock).

This doubles throughput, how much data arrives per second, but it does absolutely nothing for latency, how long the very first chunk takes to show up after you ask.

The bucket still takes exactly as long to read.

People mix up throughput and latency all the time, so let me just say it flatly: they are different things, and improving one does nothing for the other.

Each new DDR generation (DDR3, then DDR4 reaching 3,200 mega-transfers per second, then DDR5 reaching 4,800 and beyond) has mostly raised throughput; the latency to first byte has barely improved in twenty years.

How bad does random access get in practice?

In one careful measurement, a single isolated, scattered access kept the memory bus busy for only 2 cycles out of 7, meaning a bus rated at a theoretical 6.4 gigabytes per second delivered a real-world 1.8 gigabytes per second.

You only ever approach the number printed on the box with long, orderly, sequential streams that turn into row hits.

Picture it. A clerk fetching files from a cabinet has to heave open a heavy drawer (open the row, slow and effortful), then flip to the right folder inside it (pick a column, quick).

If your next file is in the same open drawer, she just flips to it, instant. If it's in a different drawer, she has to shove the first one shut and heave open another.

Ask her for files scattered across twenty drawers and she'll spend her whole day slamming drawers. Ask for twenty files from the same drawer, in order, and she opens it once and flips through them in seconds.

And if she has several drawers already open at her colleagues' desks (the banks), she can grab from one while another is still sliding open.

So memory rewards you, generously, for asking for things that sit next to each other, and for reusing what you've already opened.

Which sets up a small miracle that we now have to explain: how on earth can a cache that is a thousand times smaller than main memory manage to help on nearly every single access?

4. How can a cache a thousand times smaller help almost every time?

The answer is that real programs are not random.

They have a deep, reliable pattern to them, and a cache is really just a clever bet on that pattern.

The structure comes in two flavors, and once you see them you cannot unsee them.

The first: if you use a piece of data right now, you'll very probably use it again soon. A loop reading the same counter ten thousand times; a function called over and over.

The second: if you touch one location, you'll very probably touch its neighbor next. Stepping through an array element by element; reading the fields of a record one after another.

Computer scientists call these temporal locality and spatial locality, but you don't need the vocabulary.

You just need to feel the underlying truth: code is predictable, and a cache is a small, fast notepad that wins by betting on that predictability. It keeps the things you just used, and the things sitting next to them, close at hand.

How good is the bet? Let's actually do the arithmetic, because the answer is genuinely startling.

Suppose reaching main memory costs 200 cycles, reaching the cache costs 15, and your program works with 100 pieces of data, each used 100 times:

   Without any cache:
       10,000 accesses × 200 cycles                       = 2,000,000 cycles

   With a cache:
       100 first-ever loads      × 200 cycles (from RAM)   =    20,000 cycles
       the other 9,900 reuses    ×  15 cycles (from cache) =   148,500 cycles
                                                              ─────────────
                                                     total   =   168,500 cycles

   Improvement:   1 − (168,500 / 2,000,000)   =   a 91.5% reduction

A cache far smaller than the data cuts the memory time by more than nine-tenths, purely because the data gets reused.

That's the whole game.

And when you actually measure this, by slowly growing the amount of data a program juggles and timing each access, you don't get a smooth ramp.

You get flat shelves separated by sudden cliffs, the unmistakable fingerprint of a layered memory system:

Cache Hierarchy Latency
slower →
→
time per access, as the working set grows
amount of data the program works on →
L1 fills up
L2 fills up
~4 cycles
fits in L1 (nearest)
~12 cycles
spills to L2
~200+ cycles
spills to DRAM

              
            

Not a smooth ramp. Flat shelves and sudden cliffs. As long as the data fits a level you stay fast; cross the size boundary and you fall to the next, much slower one.

As long as your working data fits inside a given level, you stay on a flat, fast shelf.

The instant it grows too big to fit, you tumble off a cliff to the next level down, and the drop is not a few percent, it's a multiple.

Those numbers on the left are real and typical of a modern processor: roughly 4 cycles to reach the nearest cache, about 12 for the next, around 40 for the last-level cache, and a few hundred cycles out to main memory.

Each step out is a different shelf.

It's worth knowing why a cache miss happens at all, because there are exactly three reasons (a classic list every chip designer knows by heart):

   The three reasons a cache misses:

   • COMPULSORY - the very first time you ever touch a piece of data, it can't
                  possibly be cached yet. Unavoidable; you pay it once.
   • CAPACITY   - your working set is simply bigger than the cache, so older data
                  got pushed out to make room and now you want it back.
   • CONFLICT   - the data WOULD have fit, but too many items map to the same
                  small region of the cache and evicted each other. (More on this in §6.)

This whole idea, every level, every kind of miss, compresses into one line that every performance engineer keeps loaded in their head at all times, the average memory access time:

   average time  =  (time for a hit)  +  (chance of a miss × penalty for a miss)

Keep more of your working data in cache (lower the miss chance), or hide the misses you can't avoid (shrink the penalty, which is exactly what section 8 is about), and everything you run gets faster.

That one little equation is, quietly, the entire job of performance engineering.

Picture it. You never carry all your toys around the house. You keep the two or three you're playing with right now in your lap, and the rest stay in the toybox across the room.

Because you tend to keep replaying the same few favorites (you'll want them again soon) and tend to grab toys that were lying near each other (neighbors), keeping just a small handful in your lap saves you almost every single trip back to the box.

A cache is your lap. The toybox is main memory.

But notice we've been quietly vague about one crucial detail.

A cache keeps data "close", but in what units? Does it track one byte at a time? One number?

The answer is the single most useful fact in this whole guide. It explains more everyday slowdowns than anything else here.

5. When I read one byte, how much really moves?

Not one byte. Sixty-four of them. Every single time.

Memory doesn't move one byte at a time. It moves in fixed-size blocks called cache lines, and on virtually every PC and server processor a line is 64 bytes wide.

When you ask for a single number, the hardware quietly grabs the whole 64-byte block around it and hauls the entire thing into the cache at once.

The byte you actually wanted and its 63 neighbors all arrive together, whether you asked for them or not.

Cache Line Fetch
Main Memory

              ↓↓
            
Cache

Ask for one byte (blue cell) and the hardware hauls in the whole 64-byte line, all 64 cells, betting you'll want a neighbor next.

This is why opening a DRAM row pays off: you fill an entire cache line from one open row in a single burst.

But it has a stunning consequence that should change how you write loops forever: reading 1 byte and reading 64 neighboring bytes cost exactly the same.

You already paid for all 64 the moment you touched one. Which means the order in which you walk through memory silently decides your speed:

walking in order

0% wasted

every byte you paid for is used

jumping with a big stride

~88% wasted

+ a fresh slow miss for each block

Same 64-byte cost either way. Used in order, the whole line is useful; strided, you use one cell of every line and throw the rest away.

Remember the riddle this guide opened with, two programs, same math, one of them many times slower? Here it is, finally solved, and it's the most famous demonstration in all of performance programming.

In most languages, a two-dimensional grid of numbers is laid out in memory one full row after another.

Walking across a row marches you through neighboring bytes, beautiful. Walking down a column leaps across an entire row's worth of memory on every step, landing on a brand-new cache line nearly every time, disastrous.

Both loops (you'll see the exact, runnable code further down, where I measure them) compute the identical sum with the identical number of additions. The processor does exactly the same amount of arithmetic.

And yet on a real machine the second version routinely runs several times slower, sometimes by an order of magnitude, for one reason only: it wastes most of every cache line it drags in.

Nothing about the math changed. Only the path through memory did.

This same idea changes how experienced programmers lay out their data.

Consider storing a thousand particles, each with a position and a velocity. You can store them as an array of structures (each particle's fields bundled together) or a structure of arrays (all positions together, all velocities together):

AoS vs SoA
Array of Structures — fields bundled per particle
...
1 cache line (64 bytes)
2 of 8 cells useful — rest loaded but never read (75% wasted)
Structure of Arrays — all of one field together
...
1 cache line from xs[ ] (64 bytes)
8 of 8 cells useful — every byte is an x value (zero waste)

              
            

Same data, two layouts. Which one wastes cache lines depends entirely on which fields you touch together.

Neither is "correct": the right choice depends entirely on which fields you touch together.

But you can only make that choice well once you truly feel that memory moves in 64-byte blocks and the goal is to use the whole block.

And this is the bridge straight to the GPU, where the very same law wears a different name: coalescing.

A graphics card runs its threads in tight bundles of 32 (a "warp"), all moving in perfect step, and it fetches memory in small fixed chunks.

When the 32 threads of a warp ask for neighboring addresses, the hardware merges their requests into a few big, efficient transfers.

When those same 32 threads ask for scattered addresses, the request shatters into lots of small wasteful ones, and the warp slows to a crawl.

"Make neighboring GPU threads read neighboring memory" and "respect the cache line" are not two rules; they are the same rule of physics, on two different chips.

Picture it. You need one egg for a recipe. But the store doesn't sell single eggs; it only sells the carton of twelve. So you carry home twelve every time, like it or not.

If your recipe uses all twelve, wonderful, nothing wasted. But if you use one egg, throw the other eleven in the bin, then drive back to the store for another whole carton just to use one more egg… you've thrown away almost everything and made a dozen pointless trips.

Memory sells 64-byte cartons. The entire art is arranging your work to use the whole carton before you reach for the next.

Proof on my own machine (RTX 3060)

I didn't want to just tell you this. So I wrote two tiny benchmarks and ran them on my own computer, which has an RTX 3060. Here's what actually came back.

On the CPU, the same sum over a 128 MiB grid of numbers, the only difference being whether the inner loop walks rows (contiguous) or columns (strided):

row-major  (contiguous): 0.015 s
column-major (strided) : 0.107 s
slowdown               : 7.2x

Same math. Same answer. 7.2 times slower, purely from touching memory in the wrong order. That's the riddle from the top of the page, measured.

On the GPU, the same copy kernel, run with a growing stride so the 32 threads of a warp scatter further and further apart:

stride |  useful GB/s   (RTX 3060, peak ≈ 360 GB/s)
-------+-------------
   1   |     290.4      ← neighbors → fully coalesced
   2   |     136.0
   4   |      70.8
   8   |      34.1
  16   |      31.9
  32   |      25.5      ← fully scattered

When neighboring threads read neighboring memory (stride 1), the card hits about 290 GB/s, roughly 80% of its theoretical peak. Scatter those same reads (stride 32) and the useful bandwidth collapses to about 25 GB/s, more than 10 times slower, for the exact same number of reads. The bytes wasted in every 32-byte chunk are the whole difference.

Here is the exact code that produced those numbers, nothing hidden. The CPU one first:

/* cache_line_demo.c: same matrix, same additions; the only difference is the
 * ORDER we touch memory. Build: gcc -O2 -o cache_line_demo cache_line_demo.c */
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define N 4096                 /* 4096*4096 doubles = 128 MiB, far bigger than any CPU cache */

static double now_sec(void) {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return ts.tv_sec + ts.tv_nsec * 1e-9;
}

int main(void) {
    double *a = malloc((size_t)N * N * sizeof(double));
    if (!a) { perror("malloc"); return 1; }
    for (size_t i = 0; i < (size_t)N * N; i++) a[i] = 1.0;   /* touch every page */

    volatile double sink = 0.0;
    double t0, t1;

    /* ROW-MAJOR: inner loop walks neighbors → contiguous → cache-friendly */
    double sum1 = 0.0;
    t0 = now_sec();
    for (int i = 0; i < N; i++)
        for (int j = 0; j < N; j++)
            sum1 += a[(size_t)i * N + j];
    t1 = now_sec();
    sink += sum1;
    double row_time = t1 - t0;

    /* COLUMN-MAJOR: inner loop jumps a whole row → strided → cache-hostile */
    double sum2 = 0.0;
    t0 = now_sec();
    for (int j = 0; j < N; j++)
        for (int i = 0; i < N; i++)
            sum2 += a[(size_t)i * N + j];
    t1 = now_sec();
    sink += sum2;
    double col_time = t1 - t0;

    printf("row-major  (contiguous): %.3f s\n", row_time);
    printf("column-major (strided) : %.3f s\n", col_time);
    printf("slowdown               : %.1fx\n", col_time / row_time);
    printf("(checksum %.0f, ignore)\n", (double)sink);
    free(a);
    return 0;
}

And the GPU one, the same copy kernel launched at growing strides:

/* coalesce_demo.cu: same useful work at every stride; only the access pattern
 * changes. Build: nvcc -O3 -arch=sm_86 -o coalesce_demo coalesce_demo.cu */
#include <cstdio>
#include <cuda_runtime.h>

#define CHECK(x) do { cudaError_t err__=(x); if(err__){ \
    printf("CUDA error %s:%d: %s\n",__FILE__,__LINE__,cudaGetErrorString(err__)); return 1;} } while(0)

/* thread idx touches element idx*stride → neighbors scatter as stride grows */
__global__ void strideCopy(float* out, const float* in, long n, int stride) {
    long idx = (long)blockIdx.x * blockDim.x + threadIdx.x;
    long i = idx * stride;
    if (i < n) out[i] = in[i] * 2.0f;
}

int main(void) {
    const long baseN = 1L << 24;          /* 16.7M threads do useful work each launch */
    const int  maxStride = 32;            /* one warp = 32 threads */
    const long bytes = baseN * maxStride * sizeof(float);   /* ~2.1 GB per buffer */

    float *d_in, *d_out;
    CHECK(cudaMalloc(&d_in,  bytes));
    CHECK(cudaMalloc(&d_out, bytes));
    CHECK(cudaMemset(d_in, 1, bytes));

    int threads = 256, blocks = (int)((baseN + threads - 1) / threads);
    cudaEvent_t s, e;
    CHECK(cudaEventCreate(&s));
    CHECK(cudaEventCreate(&e));

    printf("stride |  useful GB/s\n");
    for (int stride = 1; stride <= maxStride; stride *= 2) {
        long n = baseN * stride;
        strideCopy<<<blocks, threads>>>(d_out, d_in, n, stride);   /* warm up */
        CHECK(cudaDeviceSynchronize());

        const int reps = 30;
        CHECK(cudaEventRecord(s));
        for (int r = 0; r < reps; r++)
            strideCopy<<<blocks, threads>>>(d_out, d_in, n, stride);
        CHECK(cudaEventRecord(e));
        CHECK(cudaEventSynchronize(e));

        float ms = 0.0f;
        CHECK(cudaEventElapsedTime(&ms, s, e));
        double useful_gb = (double)reps * baseN * 2 * sizeof(float) / 1e9; /* read+write */
        printf("  %2d   |   %7.1f\n", stride, useful_gb / (ms / 1e3));
    }
    cudaFree(d_in);
    cudaFree(d_out);
    return 0;
}

That -arch=sm_86 matters: it builds natively for the 3060's Ampere chip. Leave it off and you may hit a "PTX was compiled with an unsupported toolchain" error. The exact numbers wobble a few percent from run to run (GPU clock boost, system noise), but the pattern is the same every single time.

So a 64-byte block arrives at the cache.

And immediately a new question appears, one with a surprisingly elegant answer: out of all the slots in the cache, which one does this block get to live in?

6. Where does a block of memory live once it arrives?

This is a real engineering dilemma, and watching how it gets solved is a small, satisfying lesson in trade-offs all by itself.

Consider the two extreme answers.

The first: give every possible memory address exactly one permitted slot in the cache.

Finding a block is then instant; there's only one place to look, so a single check tells you if it's there.

But it's terribly fragile. Two pieces of data your program leans on constantly might happen to be assigned the very same slot, and they'll spend the whole program kicking each other out, over and over, like two kids assigned the same coat hook, each knocking the other's coat to the floor.

These are the conflict misses from section 4, and they can cripple a program even when the cache had plenty of room overall.

The opposite extreme: let any block live in any slot, anywhere in the cache.

Now there are no forced collisions at all; a block only gets evicted if the cache is genuinely full. Wonderful for hit rates.

But finding a block now means the hardware has to search every slot simultaneously to see if it's present, which is way too slow and power-hungry to build for a big cache.

So real caches take a sensible middle road, called set-associative.

A block isn't tied to one single slot, nor allowed to roam everywhere. Instead it's assigned to one small set of slots, say 8 of them, and may live in any slot within that set.

To find it, the hardware checks just those 8 in parallel: fast enough to be practical, while collisions become rare enough to ignore.

Cache Placement
where can an incoming block live?
DIRECT-MAPPED
one slot only
fast lookup
but high collision risk
SET-ASSOCIATIVE
any slot within one set
the practical sweet spot
(8-way is standard)
FULLY ASSOCIATIVE
any slot at all
zero forced collisions
but comparator cost explodes
Real caches use ~8 slots per set: parallel comparison is fast and collisions become rare enough to ignore.

              
            

Two bad extremes and the wise compromise in between: a block maps to one small set and may sit in any slot within it.

To make this work, the address itself gets split into three parts, and once you see the split, the whole thing clicks:

Memory Address Fields
one memory address, sliced into three fields

                tag
                high bits
              

                set index
                middle bits
              

                offset
                low 6 bits
              
confirms you found
the right block
picks which
set of slots
byte inside
the line (2^6=64)

              
            

The low 6 bits pick the byte inside the line; the next bits pick the set; the high bits are the tag the hardware compares to confirm a hit.

The low 6 bits pick the byte inside the 64-byte block (since 2 to the 6th is 64).

The next chunk of bits picks which set.

And the remaining high bits are stored alongside the cached block as a name-tag, so that when you come looking, the hardware can compare tags and confirm it found the exact block you meant rather than a different one that maps to the same set.

And the sizes all tie together in one neat little formula worth remembering:

   total cache size  =  line size  ×  slots per set  ×  number of sets

So a 4-megabyte cache with 64-byte lines and 8 slots per set has exactly 4 MB ÷ (64 × 8) = 8,192 sets, and a lookup compares just 8 name-tags in parallel, easily fast.

(The tiniest, most speed-critical structures, like the address-translation table we'll meet in section 8, sometimes do use the "search everything" fully-associative design, precisely because they're small enough to afford it.)

One last wrinkle: when a set is full and a new block needs in, which of the existing blocks gets evicted?

Ideally you'd evict the one you'll need again furthest in the future, but the hardware can't see the future.

So it approximates with a sensible rule of thumb: evict the least recently used block, betting (using that same idea about reuse) that whatever you haven't touched in a while, you probably won't need soon.

True least-recently-used tracking is expensive to build precisely, so real chips use clever cheap rough versions of it. The principle, though, is pure section 4: bet on predictability.

Picture it. "One hook per kid" means two kids assigned the same hook spend recess knocking each other's coats onto the floor.

"Hang it literally anywhere" means no fights, but the teacher has to scan the entire room every time to find a coat.

The real classroom does the sensible thing in between: your class gets a row of eight hooks: hang your coat on any of the eight, and the teacher only ever has to check your class's row. Fast to find, and fights are rare.

And when all eight hooks are taken and a new kid arrives, the coat that's been hanging there longest without anyone touching it comes down first.

So far, this whole story has quietly assumed a single worker.

But every modern processor, and certainly every GPU, has many cores, each with its own private cache, each potentially holding a copy of the very same 64-byte block.

What happens when two of them reach for the same memory at the same time?

7. What breaks when two cores touch the same memory?

Something sneaky, and, the first time it bites you, genuinely maddening, because your program is correct and yet slow for no obvious reason.

Start by realizing that writing is just as block-based as reading.

To change a single byte, a core pulls the whole 64-byte line into its private cache, edits it there, marks the line "dirty" (meaning it differs from main memory), and only writes it back out later.

On a lone core this is pure efficiency: you batch up changes and avoid touching slow memory until you must.

But now picture two cores, each holding its own cached copy of the same line.

The moment one core writes to its copy, every other copy is instantly wrong, silently, dangerously out of date.

Hardware absolutely refuses to allow that.

The cores constantly watch each other's memory traffic (this is called snooping), and the instant one core writes to a line, all other copies of that line are marked invalid and must be thrown away and re-fetched.

There's a well-known scheme that keeps track of each line's status, whether it's been changed, is held by just one core, is shared by a few, or is invalid, and shuffles lines between caches so there's always one source of truth.

The name of the scheme doesn't matter; what matters is that this coordination is real work, and it happens over the connections between cores.

False Sharing
the same 64-byte line, cached in two cores
CORE 1
line: SHARED
writes c.a++
"I changed it!"
→
CORE 2
line: SHARED
writes c.b++
The line ping-pongs between cores thousands of times per second. That is false sharing.

              
            

Even when two cores touch different variables, if those variables share one line, each write invalidates the other's copy and the line bounces back and forth.

All of this stays completely invisible, until it quietly destroys your performance through a trap with an innocent-sounding name: false sharing.

struct counters { long a; long b; };   // a and b are neighbors → SAME 64-byte line
struct counters c;

// Core 1 spends all day doing:   c.a++;     // it only ever touches 'a'
// Core 2 spends all day doing:   c.b++;     // it only ever touches 'b'

These two cores are working on completely different variables. Logically, they share nothing and should never interfere.

But because a and b happen to sit next to each other, they land in the same 64-byte line, and so every time Core 1 writes a, it invalidates Core 2's entire cached copy of that line (including b), and vice versa.

The line gets yanked back and forth between the two cores, over and over, thousands of times a second.

The result is genuinely shocking the first time you see it: code you carefully made "parallel" runs slower than if you'd just used one core.

The cure is almost comically simple once you understand the cause: shove the two variables onto separate lines with a little padding between them:

struct counters { long a; char gap[56]; long b; };   // assuming 8-byte long: 8 + 56 = 64 → 'b' starts on the next line

There's a deeper, almost poetic truth hiding here.

The 64-byte block gives (neighbors ride along for free: that's spatial locality, your friend) and the very same block takes away (unrelated data forced to share it collide: that's false sharing, your enemy).

The block is simultaneously the unit of efficiency and the unit of sharing. You cannot have one property without the other; they are the same coin.

It gets one layer harder on big machines.

When a computer has multiple processor sockets, memory is physically split, some attached close to one socket, some to another.

A core can reach its own socket's memory quickly, but reaching the other socket's memory means a longer journey across a link between them, at noticeably higher latency.

This is called non-uniform memory access (NUMA), and it means that on a large server, where your data physically lives relative to the core using it becomes yet another performance lever.

The distance tax from section 1 returns, now measured in centimeters of motherboard.

Picture it. Two kids each have their own box of crayons, completely separate sets. But they're forced to keep both boxes in one shared drawer.

Every single time one kid opens the drawer to grab a crayon, the other kid has to stop what they're doing, look over, and double-check that none of their crayons got bumped or moved.

They never once use the same crayon, but the mere fact of sharing the drawer makes them trip over each other endlessly, all day long.

Give each kid their own separate drawer (that's the padding), and suddenly both of them fly, never bothering each other again.

By now we've assembled an impressive toolkit: caches, locality, blocks, smart placement, and coherent sharing.

And yet, that fundamental ~100-nanosecond wait to reach main memory is still sitting there, unavoidable, every single time we miss in every cache.

So here's the question that should be nagging at you: if that long wait can't be removed, how does any real program ever manage to feel fast?

8. If memory is so slow, how does anything run fast?

It comes down to the most beautiful idea in the whole story: you don't make the wait shorter.

You make the wait overlap with useful work, until it stops costing you anything at all.

Think about waiting for a kettle to boil.

You don't stand frozen, staring at it, doing nothing for two minutes. You butter the toast, fetch the mugs, crack the eggs, all while the kettle heats.

The total time shrinks not because anything got faster, but because several slow things happened at the same time.

A processor does exactly this with memory.

Rather than firing off one request and idly waiting 100 nanoseconds for it to return, it fires off a request and immediately fires off the next, and the next, keeping dozens or even hundreds of memory requests on the move at the same time.

The slowness of any one request is still there; it's just hidden behind the slowness of all the others happening in parallel.

Chip designers call the number of requests you keep on the move your memory-level parallelism.

Two quiet helpers create this overlap automatically on a CPU.

The first is the prefetcher: a small circuit that watches how you're reading memory, notices when you're moving steadily through it in a regular pattern, and fetches the upcoming blocks before you've even asked for them.

This is the real, secret reason the orderly loop back in section 5 was so fast: the prefetcher saw you walking forward and ran ahead to have the next blocks waiting.

(Its one firm limitation: it generally won't leap past a memory-page boundary, because the next page might not belong to your program at all, and guessing wrong could cause real trouble.)

The second helper is a tiny, blazing-fast lookup table called the TLB (translation lookaside buffer).

Every address your program uses is a "virtual" address that must be translated into a real physical location in the memory chips, via large lookup tables called page tables.

Looking through those tables is slow, so the TLB keeps the most recent translations handy.

But it's small, typically only a few dozen entries at its fastest level, so if your program leaps around across too many different regions of memory at once, the TLB overflows and starts missing, and your program can slow to a crawl even when all of its actual data is sitting right there in the cache.

(This is exactly why "huge pages" exist: by making each page cover far more memory, a single TLB entry reaches much further, and the table stops overflowing.)

Now here is the calculation that ties this entire guide back to the fundamental nature of speed, and it's worth doing slowly, because it's one of the most eye-opening numbers in all of computing.

To keep a pipe completely full, the amount of stuff you need moving through it at any moment equals how fast the pipe delivers multiplied by how long each item takes to traverse it.

(This is a deep, general law of queues, often named after the mathematician John Little.)

Apply it to the memory of a common graphics card, the RTX 3060, which delivers about 360 billion bytes every second, where each round trip takes roughly 100 nanoseconds, and each trip carries one 64-byte block:

   bytes "in flight" to stay full  =  360 GB/s  ×  100 ns  ≈  36,000 bytes
   number of 64-byte requests      =  36,000 ÷ 64           ≈  560 requests

Memory Pipe
to keep memory busy, the pipe must stay full

              cores issue →→ memory
            
~560 requests must be in flight at every instant

Bandwidth times latency equals the amount that must be traveling at once. One thread issues about one request, so you need an army of threads just to keep the pipe full.

This is the real reason, not raw arithmetic, that a GPU runs tens of thousands of threads at once.

It needs that vast crowd not because it has that much math to do, but because while any one thread is stalled waiting for memory, thousands of others can run, and their collective waiting overlaps into one continuous stream of useful work.

On a GPU, occupancy, keeping enough threads resident to hide the latency, is the prefetcher.

And the same arithmetic scales straight up the product line: an A100 moves about 2 trillion bytes per second, an H100 about 3.35 trillion, each feeding many thousands of compute lanes, and the number of requests they must keep in flight to stay busy grows right alongside their bandwidth.

Picture it. A thoughtful big sibling notices you're working through a series of books and quietly pulls the next one off the shelf before you've finished the one in your hands, so you never sit waiting between books (that's the prefetcher running ahead).

They won't run off into a different room to guess which book you'll want next, though (that's the page boundary it won't cross).

And to find any book at all, they keep a small index card mapping "the dinosaur book" to "shelf 3, slot 7" (that's the TLB), quick and effortless, right up until you start asking for so many brand-new books that they have to keep going back to flip through the giant master catalog (that's the TLB overflowing).

So now we know why memory is slow, how that slowness gets hidden, and what it takes to keep it busy.

Which finally lets us ask, and actually answer, the one question that matters most when you sit down to make real code fast.

9. So, is my program limited by math, or by memory?

This is the question the whole guide has been building toward, and now we can answer it for real, instead of guessing.

For any piece of code, ask a single thing: how much arithmetic does it do for each byte it moves?

This ratio has a name, arithmetic intensity, and it's just the number of math operations divided by the number of bytes transferred.

A program with high intensity does a lot of computing on each byte it fetches, so it can keep its math units truly busy.

A program with low intensity barely glances at each byte before it needs a fresh one, so it spends its life waiting on memory, exactly the wait this whole guide has been describing.

Let's make it concrete with three real workloads, because the contrast is the whole lesson:

Workload	Math done	Bytes moved	Arithmetic intensity
Scale-and-add (y = a·x + y)	2 ops per element	~12 bytes per element	~0.17 → tiny → MEMORY-bound
Big matrix multiply	~2·N³ ops	~N² of data	grows with N → MATH-bound
AI text generation	~2 ops per weight	2 bytes per weight	~1 → tiny → MEMORY-bound

The scale-and-add does two operations but has to touch three numbers (read two, write one), barely any math per byte, so it's hopelessly memory-bound; a faster math unit would do nothing for it.

A large matrix multiply is the opposite: it does an enormous N³ amount of math while only moving N² worth of data, so as the matrices grow it becomes gloriously math-bound.

Which is exactly why matrix multiply is the beloved workhorse of GPUs, and why so much effort goes into expressing AI as big matrix multiplies.

(We'll come back to that third row, AI text generation, in the next section; it's the punchline.)

There's a wonderfully simple picture that captures all of this at once, called the roofline.

Your best achievable speed is whichever of two ceilings you bump into first: the raw top speed of your math units, or your arithmetic intensity multiplied by your memory bandwidth.

ROOFLINE MODEL

        attainable speed
        arithmetic intensity (math/byte)
        
        slope = bandwidth
        compute roof (peak FLOP/s)
        ridge point
        MEMORY-bound
        MATH-bound
        LLM gen (~1)

Land on the left ramp and you are memory-bound - bandwidth sets your speed. Land on the flat roof and you are compute-bound. LLM generation sits on the ramp.

If your code lands on the slanted left part of the roofline, you are memory-bound: your speed is set entirely by how fast you can move bytes, and buying faster math units does absolutely nothing.

If it lands against the flat top, you are math-bound, and only then does raw compute power help you.

The "ridge point" where the ramp meets the roof is itself just the machine's peak compute divided by its peak bandwidth; for a typical desktop GPU it sits somewhere around a few dozen operations per byte, meaning anything below that intensity is memory-bound.

And the uncomfortable truth is that a huge chunk of real-world code lives down on that left ramp.

Picture it. Imagine a bakery with a hundred ovens but only one narrow doorway.

If trays of dough can barely squeeze through that single door, then buying more ovens changes nothing at all: the ovens sit half-empty because dough can't get to them fast enough.

The bakery is doorway-limited, not oven-limited. The only ways to bake more bread are to widen the door (more bandwidth) or to load more dough onto each tray that goes through it (higher arithmetic intensity).

Staring at your idle ovens and ordering more is the single most common, most expensive performance mistake there is.

So before optimizing anything, you find where you sit on the roofline.

Get that wrong and you can pour weeks into speeding up math units that were never the bottleneck.

Now let's watch this exact drama play out in the most important computing workload of our moment.

10. Watch it all collide: why a giant AI types one word at a time

You've surely noticed that when a large language model answers you, it doesn't deliver the whole response at once: it types, word by word, at a visible pace.

That pace isn't a design choice or some fake animation.

It is the memory wall, made visible to your own eyes. Everything in this guide converges right here.

To generate each new word, the model must read every one of its weights, the billions of numbers that encode everything it knows, and do a little arithmetic with each.

Then, to generate the next word, it reads all of those weights again. And again, for every single word.

Here's the brutal math, and it's just sections 5 and 9 applied honestly:

   For each new word the model generates (at the common batch-of-one case):

   • it READS every weight once           →  about 2 bytes per weight (a 16-bit number)
   • it does about 2 math ops per weight   →  one multiply, one add

   arithmetic intensity  ≈  (2 ops) ÷ (2 bytes)  ≈  1 operation per byte
                                                     └─► far below the ridge → MEMORY-BOUND

An intensity of about 1 operation per byte puts this workload deep on the left ramp of the roofline.

The model's speed has almost nothing to do with how much math the chip can do, and almost everything to do with how fast it can stream the weights out of memory.

So we can predict the typing speed with nothing more than a division:

   words per second  ≈   memory bandwidth  ÷  (bytes that must be read per word)

   A 7-billion-weight model, at 2 bytes each, is about 14 gigabytes to read per word.

   • On an RTX 3060   (≈ 360 GB/s):     360 ÷ 14   ≈   ~25 words per second
   • On an H100       (≈ 3,350 GB/s):  3,350 ÷ 14   ≈   ~240 words per second

That's the whole reason the same model "types" several times faster on a high-end data-center card than on a desktop one.

It is not mainly that the big card computes faster (it does, but that barely matters here); it's that the big card has far more memory bandwidth to stream the weights.

The typing speed is a bandwidth meter you can watch with your own eyes.

And every other concept from this guide shows up in the engineering response:

The problem (memory-bound generation)	The fix, and which section it comes from
Reading 2 bytes per weight is too much	Use 1 byte, or less, per weight ("quantization") → fewer bytes to move (§5, §9)
One word at a time wastes the weight read	Generate for many users at once ("batching"): read each weight once, use it for everybody → higher intensity (§9)
The growing "memory" of the conversation (the KV cache) is re-read each step	Keep that running state laid out contiguously and read it in neighborly order (§3, §5)
The attention step kept dumping a huge scratch matrix to slow memory	"Flash attention": fuse the steps and keep the scratch data in fast on-chip SRAM instead of streaming to DRAM and back (§2, §4, §8)

Look closely at that last row, because it's the purest possible expression of this entire guide.

The famous "flash attention" technique, which made long-context AI practical, is not a cleverer piece of math. It gives the very same result as before.

Its whole insight is to stop moving data needlessly: instead of writing a giant intermediate scratch matrix all the way out to slow DRAM and immediately reading it back, it reorganizes the work to keep that scratch data in the tiny, fast on-chip SRAM (the shared memory from section 2) and never spills it.

It is, start to finish, a data-movement optimization.

The math was never the problem. The movement was. Which is, of course, the one thing this whole guide has been trying to make you feel.

Picture it. Imagine one librarian who must, to write each single word of a story, run the entire length of a vast library, touch all million books, and only then write that one word; then run the whole length again for the next word.

It's absurd, and it's slow, and it's slow no matter how fast the librarian can write.

The only real fixes are to make the library smaller (quantization), to have the librarian write a word for a hundred readers on each pass instead of one (batching), or to stop sending them on pointless extra laps (flash attention).

Speeding up their pen, buying more compute, does almost nothing.

11. The whole story, in one breath

Trace the whole path we just walked, and feel how each step forced the next into existence.

Distance costs time, because no signal moves infinitely fast, so engineers kept a little memory close to the processor.

Close, fast memory is built from expensive six-transistor cells while cheap memory is built from leaky one-capacitor cells, so the fast memory had to be tiny and the bulk of it slow.

That slow memory punishes scattered access and rewards orderly sequences, because of the cost of opening a row, so we learned to lean hard on the fact that real programs reuse their data and touch neighbors.

To exploit that reuse, memory is moved in fixed 64-byte blocks, so the order in which you touch memory came to decide your speed, and those blocks needed a home (associativity), and sharing a block between cores turned out to corrupt it (coherence and false sharing).

And because, after all of that, the fundamental wait was still there, we learned to hide it by keeping hundreds of requests traveling at once, so, in the end, whether your program is fast or slow comes down to a single ratio: how much math you do for every byte you move.

And when you watch a giant AI type one word at a time, you are watching that ratio, live.

The one sentence worth keeping forever:

A processor is a machine built to hide the wait for slow, far-away memory. So stop counting the operations your code performs, and start counting the bytes it moves, the distance each one has to travel, and how many you can keep in flight at the same time.

That is the whole secret.

Everything else, every cache, every block, every thread, every clever trick with an acronym, is just a detail built on top of it.

Sources & verification notes

Every claim above comes from established sources. A few numbers are rounded and flagged here so you know what is exact and what is representative.

Ulrich Drepper, "What Every Programmer Should Know About Memory" (2007) is the source for the core CPU facts: the 6-transistor SRAM versus 1-transistor-1-capacitor DRAM cell, the destructive read and write-back, the 64 ms refresh, the 64-byte line and its low 6-bit offset, the open-row RAS/CAS mechanism and row buffer, the 200-vs-15-cycle example giving a 91.5% reduction, the cache size = line × associativity × sets relation, and the 6.4 GB/s bus delivering only ~1.8 GB/s under scattered access (full PDF).
Wulf & McKee, "Hitting the Memory Wall" (1995) named the memory wall.
The three cache-miss categories (compulsory, capacity, conflict) are the standard grouping from architecture literature (Hill; Hennessy & Patterson).
The roofline model (Williams, Waterman & Patterson, 2009) gives arithmetic intensity and the compute-bound versus memory-bound split.
Flash attention (Dao et al., 2022) is the IO-aware reformulation that keeps attention intermediates in on-chip SRAM to cut reads and writes to slow memory.
GPU bandwidth figures are vendor specifications, rounded: RTX 3060 ≈ 360 GB/s; A100 (80 GB) ≈ 2.0 TB/s; H100 (SXM) ≈ 3.35 TB/s.
Little's Law (in-flight bytes = bandwidth × latency) is standard queueing theory; the ~560-request figure is illustrative, computed as 360 GB/s × ~100 ns ÷ 64 bytes, and real GPU latency varies.
Representative, rounded figures: the ~30 cm/ns speed of light and ~18 cm/ns in copper; the ~60 to 100 ns DRAM round-trip; cache latencies (~4, ~12, ~40 cycles, a few hundred for DRAM); DDR4/DDR5 rates and 16/32 banks per chip; TLB entry counts; and the LLM tokens-per-second estimates (idealized batch-of-one, 16-bit weights). The principles are exact; the specific numbers vary by chip.

← Back to blogs