Skill
CUDA
GPU programming for massively parallel computing in High Performance Computing contexts, including kernel fusion, shared-memory tiling, and asynchronous execution.
CUDA is used to parallelize numerical workloads on NVIDIA GPUs, particularly in High Performance Computing projects.
In the particle storm simulation project, I implemented a fully fused CUDA solution for a 1D energy deposition and thermal diffusion simulation, achieving an application-level speedup of 313× (kernel-level 410×) on the memory-bandwidth-bound workload, and an exceptional 1242× on the compute-intensive case — reducing 86 seconds of sequential runtime to under 70 ms.
Key techniques used:
- Kernel fusion with geometric block overlap — all three simulation phases (bombardment, relaxation, hotspot detection) fused into a single kernel. Adjacent blocks overlap by 4 cells, providing the halo data needed for the stencil and local-maximum operations without any intermediate VRAM round-trips
- Cooperative shared-memory particle tiling — all 256 threads in a block collaboratively load tiles of particles into shared memory, reducing VRAM reads by 8× and eliminating bank conflicts via broadcast access
- On-the-fly
sqrtf— profiling revealed that a 400 MB inverse-square-root lookup table transferred via PCIe was costing 740 ms per run, more than the entire computation. Replacing it with inlinesqrtf()(4–8 cycles on dedicated hardware) reduced total execution time from ~756 ms to ~13 ms - Coalesced arena allocator — a single
cudaMalloccovers all six device buffers, replacing six individual calls and reducing driver overhead - Non-blocking asynchronous storm queue — all S kernel and reduction launches are enqueued in one CUDA stream; results are written directly into a pre-allocated device array and transferred in a single
cudaMemcpyat the end, eliminating S−1 synchronization barriers
The most important lesson from this project: profile before you optimize. The bottleneck was in data movement, not arithmetic. Every optimization targeting the kernel before that discovery would have been wasted effort.