Blog · May 12, 2024

Parallelizing a Physics Simulation: CUDA vs. MPI+OpenMP

How I implemented the same 1D particle storm simulation three ways — hybrid MPI+OpenMP and CUDA — and what each architecture taught me about parallel performance.

When I started the particle storm simulation project, my first instinct was to just pick one parallelization approach and run with it. Instead, I ended up implementing the same simulation twice — once as a hybrid MPI+OpenMP solution for CPU clusters, once in CUDA for GPU — and comparing them head to head. This post is a short write-up of what that comparison actually taught me.

The simulation, briefly

The setup is simple to describe: a 1D material layer of L cells gets hit by S sequential storms of high-energy particles. Each storm drives the layer through three phases: bombardment (energy deposition), relaxation (thermal diffusion via a 3-point stencil), and hotspot detection.

The difficulty isn’t the physics — it’s the scale. Bombardment alone costs O(SPL)O(S \cdot P \cdot L). For the heaviest test case (L=106L = 10^6, P=5000P = 5000, S=4S = 4), that’s 2×10102 \times 10^{10} cell-particle evaluations. On a single core: 86 seconds. Completely intractable for parameter sweeps.

Why two implementations instead of one

Each parallelization model makes fundamentally different trade-offs. Building both forced me to understand why each decision matters — not just whether it works.

The CPU path: hybrid MPI+OpenMP

The CPU implementation distributes the layer across MPI ranks using 1D domain decomposition. Each rank owns a contiguous segment of cells plus a 2-cell halo on each side — one cell wider than the conventional choice.

That extra cell is deliberate. The relaxation stencil needs one ghost cell per boundary, and by keeping two, each rank can run relaxation locally on the first ghost cell without waiting for a second round of communication. The result: one MPI_Sendrecv pair per storm instead of two, halving inter-node traffic.

Within each rank, OpenMP threads parallelize the bombardment inner loop. But raw threading isn’t enough — the particle data arrives in Array-of-Structures format (position and energy interleaved), which blocks auto-vectorization. Before the main loop, I transpose it to Structure-of-Arrays: separate contiguous arrays for positions, energies, and precomputed influence radii. GCC then auto-vectorizes the bombardment kernel with AVX2 (8 floats per cycle), and the kernel stays branchless — out-of-range contributions are masked with an integer multiplier rather than a conditional branch, keeping SIMD lanes full.

The last CPU-specific decision was NUMA-aware initialization. The cluster nodes are dual-socket AMD EPYC 7301 systems with 8 NUMA domains. If all buffers are initialized on a single thread, Linux allocates the memory on one NUMA node and all subsequent accesses from other sockets pay remote-memory latency. Initializing inside the same schedule(static) parallel region used during computation forces Linux’s first-touch policy to distribute the 400 MB working set across all 8 memory controllers — effectively multiplying available DRAM bandwidth.

Peak result: 21.08× speedup on 64 physical cores (8 ranks × 8 threads).

The GPU path: CUDA

The GPU implementation looks different at every level, starting with the kernel structure. A naive approach launches three separate kernels per storm (one per phase), each requiring a full VRAM round-trip. Instead, all three phases are fused into a single kernel that keeps intermediate values in registers and shared memory.

Fusion alone isn’t enough — the relaxation stencil needs data from adjacent cells, and those cells might belong to adjacent thread blocks. The solution is geometric block overlap: adjacent blocks overlap by 4 cells (block stride = BLOCK_SIZE − 4 = 252), so each block has the extra data it needs without any additional global memory access.

The second major decision was particle tiling. Without it, each of the 256 threads in a block independently loads the same particle data from global VRAM — 256 redundant reads per particle. The cooperative strategy has all 256 threads collaboratively load one tile of 256 particles into shared memory, then process the entire tile together. 8× fewer VRAM reads, and broadcast access patterns eliminate bank conflicts.

The most instructive moment of the project came from profiling. An early version precomputed a 400 MB inverse-square-root lookup table on the device and transferred it via PCIe before each run. Profiling showed that transfer cost 740 ms — more than the entire rest of the computation. Replacing the table with inline sqrtf() (4–8 cycles on the RTX 6000’s dedicated hardware units) dropped total runtime from ~756 ms to ~13 ms.

The most expensive line in the codebase

The PCIe transfer for a precomputed lookup table was costing more than the entire GPU computation. Profiling before optimizing is not optional — it’s the only way to know where the time actually goes.

Peak result: 1242× speedup on the compute-intensive test case, reducing 86 seconds of sequential runtime to under 70 ms.

What the comparison actually shows

The two implementations didn’t just produce different numbers — they required different ways of thinking.

The CPU work was about memory topology: how data is laid out in cache, how NUMA domains are populated, how MPI boundaries interact with the stencil. The bottleneck was almost always bandwidth, not arithmetic.

The GPU work was about data movement at every level: global VRAM reads, shared memory bank conflicts, PCIe transfers. The arithmetic was essentially free; the question was always whether the data was in the right place at the right time.

Neither is “better” — they’re optimized for different constraints. The CPU solution scales across nodes and fits hardware that most research groups already have. The GPU solution is two orders of magnitude faster for the right workload, but requires thinking about memory at a level most CPU code never needs to.

The full methodology, benchmark tables, profiling breakdowns, and scaling analysis are written up in the project page.