Skill

MPI & OpenMP

Parallel programming frameworks for distributed-memory (MPI) and shared-memory (OpenMP) computing in scientific and HPC environments.

MPI and OpenMP are my primary tools for exploiting parallel computing capabilities across distributed-memory systems (MPI, multi-node clusters) and shared-memory environments (OpenMP, single-node multi-threading).

In the particle storm simulation project, I implemented a full hybrid MPI+OpenMP solution for a 1D energy deposition and thermal diffusion simulation, achieving a peak speedup of 21.08× on 64 physical cores (8 MPI ranks × 8 OpenMP threads).

Key design decisions in that implementation:

  • 1D domain decomposition with 2-cell halo expansion, reducing per-storm MPI communication to a single MPI_Sendrecv pair rather than two
  • Structure-of-Arrays particle layout enabling AVX2 SIMD auto-vectorization of the branchless bombardment kernel (multiply-by-mask instead of conditional branches)
  • NUMA-aware first-touch initialization distributing the 400 MB working set across the 8 NUMA domains of the dual-socket AMD EPYC 7301 nodes — eliminating the single-controller memory bottleneck that limits pure-OpenMP scaling
  • Pointer-swapping double buffering to eliminate per-storm memory copies
  • Two-level reduction — thread-private accumulators with nowait, followed by a single MPI_Reduce with MPI_MAXLOC for global hotspot detection

I’ve also worked on these frameworks on the Sapienza CS Department Cluster using Slurm for job scheduling, and have experience reasoning about both compute-bound and memory-bandwidth-bound parallel performance regimes and the different optimization strategies they require.