MPI & OpenMP · Matteo Bernardi

MPI and OpenMP are my primary tools for exploiting parallel computing capabilities across distributed-memory systems (MPI, multi-node clusters) and shared-memory environments (OpenMP, single-node multi-threading).

In the particle storm simulation project, I implemented a full hybrid MPI+OpenMP solution for a 1D energy deposition and thermal diffusion simulation, achieving a peak speedup of 21.08× on 64 physical cores (8 MPI ranks × 8 OpenMP threads).

Key design decisions in that implementation:

1D domain decomposition with 2-cell halo expansion, reducing per-storm MPI communication to a single MPI_Sendrecv pair rather than two
Structure-of-Arrays particle layout enabling AVX2 SIMD auto-vectorization of the branchless bombardment kernel (multiply-by-mask instead of conditional branches)
NUMA-aware first-touch initialization distributing the 400 MB working set across the 8 NUMA domains of the dual-socket AMD EPYC 7301 nodes — eliminating the single-controller memory bottleneck that limits pure-OpenMP scaling
Pointer-swapping double buffering to eliminate per-storm memory copies
Two-level reduction — thread-private accumulators with nowait, followed by a single MPI_Reduce with MPI_MAXLOC for global hotspot detection

I’ve also worked on these frameworks on the Sapienza CS Department Cluster using Slurm for job scheduling, and have experience reasoning about both compute-bound and memory-bandwidth-bound parallel performance regimes and the different optimization strategies they require.

Related projects

High-Performance Simulation of High Energy Particle Storms

Related articles

Parallelizing a Physics Simulation: CUDA vs. MPI+OpenMP