Skill
MPI & OpenMP
Parallel programming frameworks for distributed-memory (MPI) and shared-memory (OpenMP) computing in scientific and HPC environments.
MPI and OpenMP are my primary tools for exploiting parallel computing capabilities across distributed-memory systems (MPI, multi-node clusters) and shared-memory environments (OpenMP, single-node multi-threading).
In the particle storm simulation project, I implemented a full hybrid MPI+OpenMP solution for a 1D energy deposition and thermal diffusion simulation, achieving a peak speedup of 21.08× on 64 physical cores (8 MPI ranks × 8 OpenMP threads).
Key design decisions in that implementation:
- 1D domain decomposition with 2-cell halo expansion, reducing per-storm MPI communication to a single
MPI_Sendrecvpair rather than two - Structure-of-Arrays particle layout enabling AVX2 SIMD auto-vectorization of the branchless bombardment kernel (multiply-by-mask instead of conditional branches)
- NUMA-aware first-touch initialization distributing the 400 MB working set across the 8 NUMA domains of the dual-socket AMD EPYC 7301 nodes — eliminating the single-controller memory bottleneck that limits pure-OpenMP scaling
- Pointer-swapping double buffering to eliminate per-storm memory copies
- Two-level reduction — thread-private accumulators with
nowait, followed by a singleMPI_ReducewithMPI_MAXLOCfor global hotspot detection
I’ve also worked on these frameworks on the Sapienza CS Department Cluster using Slurm for job scheduling, and have experience reasoning about both compute-bound and memory-bandwidth-bound parallel performance regimes and the different optimization strategies they require.