Parallelism: hardware doesn't create parallelism

Diagram: parallelism comes from independence; uniform execution amortizes overhead, while dependencies force coordination and reduce scalability. — Parallelism comes from independence in the workload, hardware only consumes what's already there.

One of the deepest realizations I’m having about systems / GPU / distributed computing:

Hardware does NOT create parallelism.

It only consumes parallelism that already exists in the data / workload.

That changes everything.

Parallelism fundamentally comes from:

independence

Not from:

GPUs
CUDA
tensor cores
clusters

Those are merely engines that exploit independence efficiently.

The deepest systems law

Independent work → scales well

Dependent work → coordination cost appears

And coordination creates:

synchronization
communication overhead
scheduling complexity
ordering constraints
consistency problems
stalls
idle hardware

That’s why systems engineering becomes hard.

Amdahl’s law makes this concrete. If a fraction $f$ of the work is parallelizable and the rest $(1-f)$ is serial, then with $p$ workers:

\[S(p) \;=\; \frac{1}{(1-f) \;+\; \dfrac{f}{p}} \quad\xrightarrow{p \to \infty}\quad \frac{1}{1-f}\]

The serial fraction $(1-f)$, every bit of coordination, dependency, synchronization, sets a hard ceiling on speedup, no matter how many GPUs you throw at the problem.
Even $1\%$ dependent work caps speedup at $100\times$. That’s the cost of dependence, written as a number.

Distributed inference, suddenly clear

Single-GPU inference:

mostly local computation
minimal coordination

Distributed inference:

activation transfers
all-reduce
KV synchronization
pipeline dependencies

Now GPUs become dependent workers. And dependent workers create:

waiting
communication
synchronization
bottlenecks

Which is why scaling distributed systems is hard.

The communication cost is also a formula. A ring all-reduce of an $m$-element gradient across $p$ GPUs takes roughly

\[T_{\text{allreduce}}(p, m) \;\approx\; 2(p-1)\,\alpha \;+\; 2\,\frac{p-1}{p}\,\beta\,m\]

where $\alpha$ is per-message latency and $\beta$ is per-byte time.
As $p$ grows the latency term $2(p-1)\alpha$ blows up linearly, every extra GPU adds round-trips.
The data term $\frac{2(p-1)}{p}\beta m \to 2\beta m$ saturates.
So at scale, latency, not bandwidth, is what kills you and that latency exists only because the workers depend on each other.

Modern hardware LOVES uniformity

Why? Because control overhead is expensive too:

instruction fetch
decode
scheduling
coordination

If 32 workers execute DIFFERENT instructions: hardware pays those costs repeatedly.

If 32 workers execute the SAME instruction on different data: hardware fetches / decodes / schedules once, then broadcasts execution across all lanes.

This is amortization: sharing one overhead across massive amounts of work.

Put numbers on it. Let $t_f$, $t_d$, $t_e$ be the time for fetch, decode, execute, and let $W$ be the SIMT/SIMD width (e.g. $W = 32$ for a GPU warp). One vector instruction processes $W$ items in a single $(t_f + t_d + t_e)$ cycle. So for $N$ items (with $N$ a multiple of $W$ for simplicity):

\[T_{\text{scalar}}(N) \;=\; N\,(t_f + t_d + t_e) \qquad\quad T_{\text{SIMT}}(N, W) \;=\; \frac{N}{W}\,(t_f + t_d + t_e)\]

The raw speedup is just the lane count:

\[\text{Speedup} \;=\; \frac{T_{\text{scalar}}}{T_{\text{SIMT}}} \;=\; W\]

But the amortization is sharper when you look at the control cost per item:

\[\underbrace{(t_f + t_d)}_{\text{scalar, per item}} \;\;\longrightarrow\;\; \underbrace{\frac{t_f + t_d}{W}}_{\text{SIMT, per item}}\]

Control overhead per useful op drops by a factor of $W$. With $W = 32$, that’s a $32\times$ cheaper fetch+decode amortized across every item the warp touches and that is the throughput.

That’s the heart of:

SIMD
SIMT
tensor cores
batching
GPU throughput

Uniformity → amortized control cost → cheap parallelism

Diversity / irregularity → broken amortization → expensive coordination

This also explains:

warp divergence
why GPUs hate branches
why branch-free kernels are fast
why tensor cores want regular shapes
why batching inference is powerful

Warp divergence has a clean cost formula. A warp of $W=32$ lanes that splits across $k$ distinct control-flow paths must execute each path serially, with most lanes idle on every pass. Effective utilization and runtime are

\[\eta_{\text{warp}} \;=\; \frac{1}{k} \qquad\qquad T_{\text{warp}} \;=\; k \cdot T_{\text{path}}\]

So a 4-way branch turns one warp into a $4\times$ slowdown, not because the math got harder, but because the broadcast got broken.

Optimization is reducing dependency cost

Most advanced optimization is NOT “make arithmetic faster.”

It is: “reduce dependency cost.”

Examples:

overlap compute + communication
lock-free structures
batching
async execution
CUDA streams
speculative execution
pipeline balancing

All try to reduce: waiting, synchronization, coordination overhead.

Everything compresses into five ideas

Find independence
Preserve uniformity
Minimize coordination
Amortize overhead
Keep hardware busy

And suddenly: GPUs, CPUs, distributed systems, databases, networking, operating systems, they all start feeling like variations of the same deep ideas.

The mechanism: why uniformity → throughput

The high-level claim is “uniform execution amortizes overhead.”

Here’s the mechanistic picture, what’s actually happening at the gate level:

Scalar execution pays fetch+decode for every item. SIMT pays it once and broadcasts, overhead becomes a constant, throughput scales with lane count. Break uniformity, and you collapse back toward the left.

The mental model I keep coming back to:

Find independence in your workload, then design execution so the hardware can amortize control across it.
Everything else: kernels, schedulers, runtimes, communication layers, is plumbing in service of that.