We'll demonstrate acceleration of a large, preexisting Fortran fluid dynamics solver using Kokkos, a C++ library that enables a single codebase to achieve high performance on multiple parallel architectures, including NVIDIA GPUs. We'll describe the complete process: identifying performance-critical physics subroutines, porting and optimizing these routines, integrating Kokkos C++ with the main Fortran code in a minimally invasive way, and tuning cluster-level performance. We'll compare the performance achieved when Kokkos uses NVIDIA Tesla K40 GPUs, Knight's Corner Xeon Phis, and Xeon CPUs.
We'll also present some GPU-specific optimizations. For "trivially parallel" physics calculations, assigning one NVIDIA CUDA thread to each grid point may not be ideal. If a small team works cooperatively on each grid point, performance can improve due to the larger amount of effective cache available to each team.