Parallelism: hardware doesn't create parallelism

May 27, 2026

Diagram: parallelism comes from independence; uniform execution amortizes overhead, while dependencies force coordination and reduce scalability.
Parallelism comes from independence in the workload, hardware only consumes what's already there.

One of the deepest realizations I'm having about systems / GPU / distributed computing:

Hardware does NOT create parallelism.

It only consumes parallelism that already exists in the data / workload.

That changes everything.

Parallelism fundamentally comes from:

Not from:

Those are merely engines that exploit independence efficiently.

The deepest systems law

Independent work → scales well

Dependent work → coordination cost appears

And coordination creates:

That's why systems engineering becomes hard.

Amdahl's law makes this concrete. If a fraction $f$ of the work is parallelizable and the rest $(1-f)$ is serial, then with $p$ workers:

$$ S(p) \;=\; \frac{1}{(1-f) \;+\; \dfrac{f}{p}} \quad\xrightarrow{p \to \infty}\quad \frac{1}{1-f} $$

Distributed inference, suddenly clear

Single-GPU inference:

Distributed inference:

Now GPUs become dependent workers. And dependent workers create:

Which is why scaling distributed systems is hard.

The communication cost is also a formula. A ring all-reduce of an $m$-element gradient across $p$ GPUs takes roughly

$$ T_{\text{allreduce}}(p, m) \;\approx\; 2(p-1)\,\alpha \;+\; 2\,\frac{p-1}{p}\,\beta\,m $$

Modern hardware LOVES uniformity

Why? Because control overhead is expensive too:

If 32 workers execute DIFFERENT instructions: hardware pays those costs repeatedly.

If 32 workers execute the SAME instruction on different data: hardware fetches / decodes / schedules once, then broadcasts execution across all lanes.

This is amortization: sharing one overhead across massive amounts of work.

Put numbers on it. Let $t_f$, $t_d$, $t_e$ be the time for fetch, decode, execute, and let $W$ be the SIMT/SIMD width (e.g. $W = 32$ for a GPU warp). One vector instruction processes $W$ items in a single $(t_f + t_d + t_e)$ cycle. So for $N$ items (with $N$ a multiple of $W$ for simplicity):

$$ T_{\text{scalar}}(N) \;=\; N\,(t_f + t_d + t_e) \qquad\quad T_{\text{SIMT}}(N, W) \;=\; \frac{N}{W}\,(t_f + t_d + t_e) $$

The raw speedup is just the lane count:

$$ \text{Speedup} \;=\; \frac{T_{\text{scalar}}}{T_{\text{SIMT}}} \;=\; W $$

But the amortization is sharper when you look at the control cost per item:

$$ \underbrace{(t_f + t_d)}_{\text{scalar, per item}} \;\;\longrightarrow\;\; \underbrace{\frac{t_f + t_d}{W}}_{\text{SIMT, per item}} $$

Control overhead per useful op drops by a factor of $W$. With $W = 32$, that's a $32\times$ cheaper fetch+decode amortized across every item the warp touches and that is the throughput.

That's the heart of:

Uniformity → amortized control cost → cheap parallelism

Diversity / irregularity → broken amortization → expensive coordination

This also explains:

Warp divergence has a clean cost formula. A warp of $W=32$ lanes that splits across $k$ distinct control-flow paths must execute each path serially, with most lanes idle on every pass. Effective utilization and runtime are

$$ \eta_{\text{warp}} \;=\; \frac{1}{k} \qquad\qquad T_{\text{warp}} \;=\; k \cdot T_{\text{path}} $$

So a 4-way branch turns one warp into a $4\times$ slowdown, not because the math got harder, but because the broadcast got broken.

Optimization is reducing dependency cost

Most advanced optimization is NOT "make arithmetic faster."

It is: "reduce dependency cost."

Examples:

All try to reduce: waiting, synchronization, coordination overhead.

Everything compresses into five ideas

  1. Find independence
  2. Preserve uniformity
  3. Minimize coordination
  4. Amortize overhead
  5. Keep hardware busy

And suddenly: GPUs, CPUs, distributed systems, databases, networking, operating systems, they all start feeling like variations of the same deep ideas.


The mechanism: why uniformity → throughput

Here's the mechanistic picture, what's actually happening at the gate level:

Amortization: how SIMT turns one instruction into massive throughput Scalar execution 1 worker, 1 item at a time — overhead paid every cycle FETCH DECODE EXEC · data 0 FETCH DECODE EXEC · data 1 FETCH DECODE EXEC · data 2 … repeated for every item Total cost for N items N × (fetch + decode + exec) overhead grows linearly with the work control is the bottleneck SIMT / SIMD execution 1 instruction broadcast to many lanes — overhead paid once FETCH (once) DECODE (once) BROADCAST instruction to all lanes d0 d1 d2 d3 dN N execution lanes, one instruction Total cost for N items 1 × (fetch + decode) + N × exec control overhead is shared execution is the bottleneck — which is what you want Amortization is the whole trick. Uniformity ⇒ overhead is shared. Divergence breaks the broadcast and the speedup collapses with it. This is why warp divergence hurts, why tensor cores want regular shapes, and why batching inference scales.
Scalar execution pays fetch+decode for every item. SIMT pays it once and broadcasts, overhead becomes a constant, throughput scales with lane count. Break uniformity, and you collapse back toward the left.

The mental model I keep coming back to:

← Back to blogs