The direct solution of the N-body problem is a simple, yet scientifically important and ubiquitous showcase algorithm for modern GPUs. However, the computational complexity is O(N^2). The fast multipole method is an algorithm that reduces runtime and complexity to optimal O(N) for any required precision. We'll present an optimized, fully NVIDIA CUDA-enabled, templated C++ implementation of the FMM, which considers all stages of the method, from particle input to the forces extraction. We compare different parallelization approaches and show the performance improvement when going from a dynamic parallelization to a presorted list-based approach that fits particular system constraints such as periodic boundary conditions. We'll discuss how to exploit the FMM operators such that both memory access overhead and the number of complex multiplications are minimized. Thereby the kernels are led to the compute bound range, and performance is increased.