We'll review a successful method for accelerating the 1D FFT by reducing the amount of communication required to be performed. The resulting method discards nearly two-thirds of the communication in exchange for the application of many hierarchical structured dense matrices, which can be applied efficiently via the fast multipole method (FMM). This FMM is formulated to be maximally computationally efficient on modern architectures and require little auxiliary space and data. We'll review the formulation, stages of computation, free parameters, and heuristics for choosing them, and efficient implementation strategies for an optimized FMM-FFT distributed across many GPUs. We'll present results obtained on up to eight Telsa P100 GPUs that show 1.2-2.2x speedup over the distributed 1D FFT provided by CUFFTXT 8.0.