Project · 2024
High-Performance Simulation of High Energy Particle Storms
Hybrid MPI+OpenMP and CUDA implementations of a 1D particle storm energy deposition and thermal diffusion simulation, achieving up to 1242× (CUDA) and 21× (MPI+OpenMP) speedup over a sequential baseline.
The physical problem
The simulation models S sequential particle storms hitting a one-dimensional material layer of L discrete cells. Each storm releases P high-energy particles and drives the layer through three sequential computational phases:
- Bombardment (Energy Deposition) — each particle deposits energy into every cell according to a square-root attenuation law:
with contributions below a material threshold discarded.
-
Relaxation (Thermal Diffusion) — accumulated energy diffuses through the layer via a 3-point moving-average stencil: . Boundary cells are left unchanged.
-
Hotspot Detection — the cell with the highest energy among all strict local maxima is recorded as the storm’s hotspot.
Why this is hard at scale
Bombardment alone costs . For the most compute-intensive test case ( cells, particles, storms), that is cell-particle evaluations — over 86 seconds on a single CPU core, completely intractable for the Monte Carlo parameter sweeps used in radiation shielding, proton therapy planning, and nuclear engineering.
Two implementations, two architectures
Both implementations are bit-exact against the sequential reference, verified across all test inputs.
Hybrid MPI + OpenMP (CPU)
Domain decomposition. The global layer is partitioned into contiguous segments, one per MPI rank. Each rank allocates an extended local buffer with 2-cell halos on each side. The 2-cell expansion (instead of the conventional 1-cell halo) is a communication-avoiding strategy: it allows relaxation to proceed locally on the first ghost cell at each boundary, reducing the required MPI exchanges to one MPI_Sendrecv pair per storm rather than two.
SoA layout and SIMD vectorization. Particle data arrives in Array-of-Structures format (position and energy interleaved), which prevents vectorization. Before the main loop, the data is transposed into Structure-of-Arrays — separate contiguous float arrays for positions, energies, and precomputed per-particle influence radii. This enables GCC to auto-vectorize the bombardment inner loop with 256-bit AVX2 instructions, processing 8 floats per cycle. The bombardment kernel is branchless by design: contributions outside the influence radius are masked with an integer multiplier (0 or 1) rather than a conditional branch, keeping SIMD lanes full.
NUMA-aware initialization. The test nodes are dual-socket AMD EPYC 7301 systems with 8 NUMA domains. Buffers are initialized inside the same schedule(static) parallel region used during computation, exploiting Linux’s first-touch policy to distribute the 400 MB layer across all 8 memory controllers — eliminating the single-NUMA bottleneck that would otherwise saturate one socket’s DRAM bandwidth.
Double buffering. Two pre-allocated layer buffers are alternated via pointer swapping at the end of each storm, removing a full copy pass per iteration.
CUDA (GPU)
Kernel fusion with geometric block overlap. A naive GPU implementation requires three separate kernel launches per storm, each incurring a full VRAM round-trip. All three phases are fused into a single kernel that keeps intermediate values in registers and shared memory. Adjacent thread blocks overlap by 4 cells (block stride = BLOCK_SIZE − 4 = 252), providing the extra halo data required for the relaxation stencil and local-maximum check without any additional global memory access.
Cooperative shared-memory particle tiling. Without tiling, each of the 256 threads in a block would independently load the same particle data from global VRAM. The cooperative strategy has all 256 threads collaboratively load one tile of 256 particles each into shared memory, then process the entire tile together. This reduces VRAM reads by 8× and eliminates bank conflicts via broadcast access patterns.
On-the-fly sqrtf. An early version precomputed a 400 MB inverse-square-root lookup table on the device and transferred it via PCIe. Profiling revealed this H2D copy cost 740 ms per run — more than the entire computation. Replacing it with inline sqrtf() (4–8 cycles on the RTX 6000’s dedicated hardware units) reduced total execution time from ~756 ms to ~13 ms: a 58× improvement from a single change.
Arena allocator. A single cudaMalloc covers all six device buffers (layer, block candidates, per-storm results, particle data, CUB scratch). Pointer arithmetic replaces individual allocation calls, reducing driver overhead by .
Asynchronous storm queue. All kernel and reduction launches are enqueued back-to-back in a single CUDA stream. Results are written directly into a pre-allocated d_result[S] array; a single cudaMemcpy at the end transfers all results in one shot, eliminating synchronization barriers.
Results
The CPU implementation peaks at 21.08× speedup on 64 physical cores ( MPI ranks OpenMP threads). Adding MPI ranks distributes the layer across NUMA memory controllers, effectively multiplying available DRAM bandwidth — at 16 ranks, each local segment fits in L3 cache, explaining super-linear scaling in certain configurations.
The GPU implementation reaches 313× application-level speedup (kernel-level ) on the memory-bandwidth-bound workload, and an outstanding 1242× on the compute-intensive case — reducing 86 seconds of sequential runtime to under 70 ms.
Full benchmark tables
Detailed per-test speedup numbers, profiling breakdowns, and strong-scaling analysis across all configurations are reported in the linked PDF below.
The source code is currently under evaluation at Sapienza — contact me if interested. The full report is available below.