Skill

CUDA

GPU programming for massively parallel computing in High Performance Computing contexts, including kernel fusion, shared-memory tiling, and asynchronous execution.

CUDA is used to parallelize numerical workloads on NVIDIA GPUs, particularly in High Performance Computing projects.

In the particle storm simulation project, I implemented a fully fused CUDA solution for a 1D energy deposition and thermal diffusion simulation, achieving an application-level speedup of 313× (kernel-level 410×) on the memory-bandwidth-bound workload, and an exceptional 1242× on the compute-intensive case — reducing 86 seconds of sequential runtime to under 70 ms.

Key techniques used:

  • Kernel fusion with geometric block overlap — all three simulation phases (bombardment, relaxation, hotspot detection) fused into a single kernel. Adjacent blocks overlap by 4 cells, providing the halo data needed for the stencil and local-maximum operations without any intermediate VRAM round-trips
  • Cooperative shared-memory particle tiling — all 256 threads in a block collaboratively load tiles of particles into shared memory, reducing VRAM reads by 8× and eliminating bank conflicts via broadcast access
  • On-the-fly sqrtf — profiling revealed that a 400 MB inverse-square-root lookup table transferred via PCIe was costing 740 ms per run, more than the entire computation. Replacing it with inline sqrtf() (4–8 cycles on dedicated hardware) reduced total execution time from ~756 ms to ~13 ms
  • Coalesced arena allocator — a single cudaMalloc covers all six device buffers, replacing six individual calls and reducing driver overhead
  • Non-blocking asynchronous storm queue — all S kernel and reduction launches are enqueued in one CUDA stream; results are written directly into a pre-allocated device array and transferred in a single cudaMemcpy at the end, eliminating S−1 synchronization barriers

The most important lesson from this project: profile before you optimize. The bottleneck was in data movement, not arithmetic. Every optimization targeting the kernel before that discovery would have been wasted effort.